Issue with parsing large text file

#1

Hi,
I’m trying to get a bag-of-words for a large text file (123MB, I can provide the file of course).
I tried to open and parse it with both Flat File Document Parser and Tika Parser followed by Strings to Document.
Not of them worked in a reasonable time. I had to stop after few hours of running on a desktop with 8GB total RAM (4GB accessible by KNIME).
Is there any limitation in the size of the input file?
Eric

1 Like

#2

Hi @EricXL -

There aren’t any limits on the input file size, but 4GB allocated to KNIME is definitely on the low side for text processing applications. Any chance you can try out your workflow on a machine with more resources?

If you like, post the input file here and perhaps we can test it out.

2 Likes

#3

Hi Scott,
I tried to upload the full file without success (the max allowed size is 4MB).
The file I’m working with is a news corpus from Leipzig Univ as I try to develop a corpus comparison approach for automatic term recognition (domain specific term extraction).
I’ll see what’s feasible internally to get more computing resources.
Thanks.
Eric

1 Like

#4

Hi @EricXL -

Sorry about that. If you want to upload a file that large you’ll have to use Dropbox or a similar cloud sharing service, as the forum file sizes are restricted.

0 Likes

#5

I try again with this link: https://drive.google.com/open?id=1U0DUXkWJzGlRhlMbdlFmTjo8_PRdGkBF
Hopefully it works this way.
Cheers,
Eric

0 Likes

#6

Unfortunately, that link is asking for credentials to sign in.

0 Likes

#7

Hi Scott,
I was trying to solve it through the internal IT. Did not work, put it on GDrive with the link above.
I’d be grateful if you could have a look at it.

0 Likes

#8

I just downloaded it. No credentials were required.

0 Likes

#9

In case of txt file you need to load it by file reader node with tab as delimiter. Then use String to Document and continue with document exploration.

1 Like

#10

Hi @EricXL -

I just downloaded the file, thanks for updating the link. As @izaychik63 mentioned, here you can just use the File Reader node, followed by a Strings to Document node - Tika Parser and Flat File Document Parser nodes aren’t necessary.

When I do this on my machine with 12G of RAM dedicated to KNIME, I was able to parse the strings to documents in just under a minute.

Hope that helps.

3 Likes

#11

Hi @ScottF, @izaychik63,
Thanks to both of you. It looks like being new to KNIME I didn’t get the right understanding of the node descirptions. With File Reader and Strings to Documents it worked marvelous even with only 4GB of RAM.
Eric

2 Likes

#12

Hi there @EricXL,

welcome to KNIME Community and KNIME itself!

To catch up with KNIME you can check this Learning page where you can find various resourses.

Also make sure to explore KNIME Hub. Place where you can find nodes, example workflows and much more :wink:

Br,
Ivan

0 Likes