read xls file in textprocessing/workflow

leo_vogels · November 22, 2016, 12:45pm

I have copied all the articles of our site ZDnet.be to an Exell spreadsheet. for 2016 it is a spreadsheet with approximately 4000 rows and 5 colums. In the "content column" all the articles are postes. Every row is a new article. I want to read this spreadsheet (especially the Content column) in a texprocessing node. I have tried XLS reader followed by Strings to Term (didnot work) and XLS reader followed by Strings to Document (didnot work). I want to datamine all the articles integral (so not an analysis on every separate article/row) because I want to be able to tell something about the content of the whole site.

Which workflow do I have to follow to be able to read this excell spreadsheet followed by the Textprocessing module?

I have copied the first 5 rows in the attachment so you can see the structure of the spreadsheet

Leo

zdnet2016.xlsx

marco_ghislanzoni · November 22, 2016, 4:00pm

Hi Leo,

when you say "did not work" what exactly didn't work? Reading in the file or turning the articles into Documents? Which error did you get, if any?

I haven't tested this, but I assume you can read in the XLS then concatenate all the Content cells into one gigantic string, convert that string to a Document, then use the Text Processing nodes to carry on your analysis task.

Did you try already?

Cheers,
Marco.

leo_vogels · November 22, 2016, 6:06pm

Hello Marco

what a good idea! I split up the spreadsheet in 2 tables and then I concatenated them. It worked! I now can go on with my preprocessing....Hope I will not have to bother you again! Thank you for the tip

Leo

marco_ghislanzoni · November 23, 2016, 12:08pm

You are welcome. Feel free to post here again in case you get stuck.

Cheers,
Marco.