Read list of files

Hi,

I have a problem that doesn’t allow me to go further in my analysis of topic modeling with LDA: I have a set of texts in txt extension and UTF-8 coding, but when I run the workflow to read a List of files already available in the examples of knime workflows, it returns me just the list of my texts (the title and the url) without its content that is the element I want to process to understand the topics in them.

Can you help me?

Am I making mistakes in loading nodes that don’t fit my goal that is to take a list of texts, read them, process them and find the key top terms topics ?

Thank you
Claudia

Hi,

IMHO, you need to use “Flat File Document Parser” node, which outputs content of your documents as “Document” datatype suitable for further text processing.

Martin K.

Thank you, Martin

This Document parser is an example in knime workflows that I have to upload from the list and use instead of the list of files one, or should I import this node in the previous one I mentioned?
If the hypothesis is the second one, where Have I to put the node in the flow?

Thank you so much
Claudia

Hi Claudia,

“Flat File Document Parser” node itself creates list of documents thus you don’t need to use “List Files” node. Filepath to each document is also stored there and can be extracted with “Document Data Extractor” node.
Could you give a fullname of the example workflow you are relating to ? Thanks.

Martin K.

Yes,

I’ve taken the example from “Control Structures”->“Loops”->“Example_of_reading_a_list_of_Files” and then for LDA topic modeling I’ve used the workflow from “Social Media”->“Topic Detection on Movie Reviews”.

My goal is to make Knime read my texts in .txt which contain just textual content not divided into columns, make Knime process all of them together with text processing tasks and then start the LDA algorithm of these processed texts, but the problem is that these workflows I’ve employed don’t read any textual content :\

If you want I can screenshot the workflow I’m trying to run.

Many thanks,
Claudia

Hi Claudia,

The workflow should be adopted in the following way:

  1. Instead of “File Reader” node, use “Flat File document parser” node, because (as I suppose), you don’t have a list of documents stored in CSV file, you have just documents to read contained in specific directory on your PC.
  2. In “Document creation and document preprocessing” metanode, there is a “Strings to Document” node. You dont need it, because you already have dataset of documents outputting from the “Flat File document parser” node.

Basically, you need single column of “Document” type entering into the “Topic Extractor (Parralel LDA)” node.
Don’t be confused when you see empty content of the Document column, the content is stored there !

Martin K.

Thank you very much.

So, if I’ve understood well, I have to delete the example of reading a list of files and put the Flat Parser as the beginning point of my workflow, concatenate the node with “Document creation and document processing” node and at the end link these 2 nodes with the example provided on Knime of “Topic detection based on Movies Reviews” to run the LDA model, or is it better to creat ex novo a new workflow by putting togheter these 3 nodes you told me?

I’m sorry if I’m repeating something several times, but I don’t understand which are the mistakes in the configurations of the workflow :slight_smile: .

Thanks
Claudia

Hi Claudia,

See attached workflow, I have also left comments under some nodes.
Hope it helps ! :slight_smile:

Martin K.

Topic Detection Based on Movie Reviews_Claudia.knwf (77.1 KB)

Thank you very much, Martin. I’m going to run it right away and see if it works with my textual data!

:wink:

Claudia

Martin, the workflow works really good and fits all my objectives! Just a last question: is there a way to insert in the pre-processing metanode a node that can provide a list of words uploaded from my personal folder and not from a stopword list in italian already existent in order to avoid these words I don’t want to be part of my topics once finished the text processing ?

Thank you

Claudia

Hi Claudia,

In “Stop word Filter” node (included in the metanode), there is an option to use your own stopword list instead of the built-in one. Just uncheck “Use built-in list” option and point to your own file in “Selected File:” field.
See description of “Stop word Filter” node and follow instructions related with a structure of the file.

Martin K.