I am teaching a module on text mining next week, so I am finally diving into workflows to help teach the fundamental concept.s
The concept of documents having metadata like author, date, title, etc., is fantastic. However, I have to admit that I do not understand the intuition behind the decision to include these elements in downstream preprocessing steps like Bag of Words, Term Frequency, etc.
Am I missing a setting (or overall pattern) so that the column we identify as the document is the only thing included in the text cleaning steps?
What kind of documents are you trying to process and what node are you using to read them? The Tika Parser node allows choosing the metadata elements. Depending on the configuration of the input file it may not work perfectly.
I am using Table Creator to show a small sample of records to build intuition for my students. After manually creating the data, I am using Strings to Document.
It caught my attention when I was looking at Term Frequencies. I would have expected items like the title and other metadata to be ignored from calculations and analysis.
Without knowing how your input table is formatted, its difficult to offer specific comments. Take a look at this workflow which extracts data from multiple pdfs.
1 Like