I am teaching a module on text mining next week, so I am finally diving into workflows to help teach the fundamental concept.s
The concept of documents having metadata like author, date, title, etc., is fantastic. However, I have to admit that I do not understand the intuition behind the decision to include these elements in downstream preprocessing steps like Bag of Words, Term Frequency, etc.
Am I missing a setting (or overall pattern) so that the column we identify as the document is the only thing included in the text cleaning steps?
What kind of documents are you trying to process and what node are you using to read them? The Tika Parser node allows choosing the metadata elements. Depending on the configuration of the input file it may not work perfectly.
I am using Table Creator to show a small sample of records to build intuition for my students. After manually creating the data, I am using Strings to Document.
It caught my attention when I was looking at Term Frequencies. I would have expected items like the title and other metadata to be ignored from calculations and analysis.
Without knowing how your input table is formatted, its difficult to offer specific comments. Take a look at this workflow which extracts data from multiple pdfs.
Here’s a better way to get an absolute count.
You’ve allocated the same column to Title and Full Text (because Strings to Document node forces you to choose a column for each). TF nodes counts the words in both Title and Full Text. A workaround is that before Strings to documents, you should create an empty column named e.g. Title and allocate it to Title in the Strings To Document node. An alternative workaround is to divide the term frequency by 2 using Math Formula node. This workflow demonstrates both methods. A final alternative is to set the title to “empty string” which may be the easiest.
This is to the point of my question. I am trying to understand why TF would include a document’s metadata, title, in the calculations. I find this rather counterintuitive and would have assumed that only the document column would be evaluated. The other data elements are just metadata, why would we care to process those?
I am closing this as answered, I just feel like this is a very counter intuitive and implicit behavior.
The node does exactly what you tell it to do. If you tell it to process a column twice, that’s what it does. There are many instances where you might want to process separate title and text columns. The node allows you to do that. Here’s a simple example.