Multiple text files into a single table

I’ve been trying to use read a bunch if emails (text files) located in many directories and sub-directories. I have managed to create a list of the files but I cannot in the name make the ‘Iterate List of Files’ metanode. The File Reader complains about a file or directory. I know I am supposed to set a file path in the configuration of the node but I have thousands of files, which ones do I need to set? I don’t get what goes in the File Reader’s ‘data file location’ field. Any ideas???

You can insert URL variable in (flow variables/DataURL):
6

1 Like

Ahh, makes sense! Thank you!

Incidentally, if you are wanting to read an entire file (i.e. an entire email) into a single cell in the data table, then the Load Text-based files node in the Vernalis community contribution will help. Also, with this node, you don’t need to use a loop on your list of files.

Steve

2 Likes

Looks great but it only reads a certain part (the recipient emails addresses). Is there a way to set the Vernalis node to read the entire file, ie metadata and the body of the emails?

Off the top of my head, I dont know. What format are the messages saved in? What do they look like if you open them in e.g. Notepad++?

Steve

Each email is in .txt and they look like this:

Message-ID: <21670167.1075841554200.JavaMail.evans@thyme>
Date: Thu, 7 Feb 2002 06:57:13 -0800 (PST)
From: mike.purcell@enron.com
To: serena.bishop@enron.com, cara.semperger@enron.com, diane.cutsforth@enron.com, 
	donald.robinson@enron.com
Subject: RE: TAG 40700
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Purcell, Mike </O=ENRON/OU=NA/CN=RECIPIENTS/CN=MPURCELL>
X-To: Bishop, Serena </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sbishop3>, Semperger, Cara </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Csemper>, Cutsforth, Diane </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Dcutsfor>, Robinson, Donald </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Drobins>
X-cc: 
X-bcc: 
X-Folder: \ExMerge - Semperger, Cara\Inbox
X-Origin: SEMPERGER-C
X-FileName: cara semperger 6-26-02.PST

The new tag is 40719 and I will correct this in the workspace today.

Thank you Serena!

MP

 -----Original Message-----
From: 	Bishop, Serena  
Sent:	Wednesday, February 06, 2002 3:59 PM
To:	Purcell, Mike; Semperger, Cara; Cutsforth, Diane; Robinson, Donald
Subject:	TAG 40700

Hey Guys - Well, I had to learn some Cod Scheduling fast this afternoon, (we really do need to exchange new cell phone numbers :-)). Anyway, tag #40700 was incorrect and BPA called about 3:45.  I cancelled that tag and replaced it with tag #40716
I hope what I did was okay, let me know if I should have done it differently.  On the NW sheet that tag number is highlighted in yellow in case there is a problem.  I assume the something needs to be corrected in the preschedule workspace, but I don't know what it is, so I will leave it to Cara or Donald to explain it to me in the morning.  Thanks again, Serena

OK, when you tried the node, what option did you try for the file encoding? And if you tried ‘Guess’, did what did you see in the console?

A console INFO entry is added for each file format detected, and a WARN entry added when the default is used because none could be detected.

Steve

(PS - I hope thats either a fictitious example or there are no issues put it public?)

You have a very good point with the privacy issue. This is from the ENROM Corpus, a body of approximately 500,000 emails in the public domain.

I have changed the node’s setting from ‘Guess’ to UTF-8 and not much changed. I can only see the the first 5 lines of the emails. I appreciate your help!

1 Like

So, I just copied your example above into a new text file, and it reads OK (see below), but obviously I will have lost the encoding of the original. When you had the node set to ‘guess encoding’, what did the console log say about the result of the guess?

One option is that the files are in a different encoding and that somehow is read a bit before persuading the reader that it is all over… (Java doesn’t natively support UTF-7 encoding, which is why it isnt in the encoding list), but I tried saving as UTF-7 encoded and US-ASCII encoded and both read the whole file, if imperfectly.

Only other think I can think is if you can attach a copy of one of the files without breaching any sort of license agreement on your obtaining them?

Steve

PS - I’m assuming that you tried resizing the row height of the cell in the KNIME table view?!

Unfortunately I cannot upload .txt file.

“PS - I’m assuming that you tried resizing the row height of the cell in the KNIME table view?!”

Can you explain that, please? Where should I set that, in which node?

OK.

For the row height, it is something you can adjust in the output port view for any node. If you hover the mouse over the line between rows in the Row ID column (about where the yellow highlight is below) then the cursor changes to a double-headed arrow. Left click and drag to make the row taller or shorter. If you hold down Shift while doing so, all rows in the table change to the same new size. You can do the same with column widths.

image

Steve

1 Like

Thank you!

1 Like

By the way, Steve, the Vernalis node you recommended (‘Load text-based files’) is waaaay faster than using the other method (using a File Reader node in a loop). Excellent advice, thanks!

No worries - that was why I wrote it! (Actually, we had been using Java Snippet and JPython Function nodes to load whole files for a while, but it was always a pain to repeat the process again for yet another workflow, and more so when we where using chemistry-type files, where we then had to convert anyhow to the correct cell format once read)

Hope the Row height fixed it?

Steve
(Yes, the same Steve as above!)

All good Steve, thank you, the row height trick worked. Excellent node!

1 Like

That’s great - that’s the best sort of bug fix! :grinning:

Always nice to hear other people are using the nodes too.

Steve