Why on Earth do the preprocessing nodes such as the BoW creator strip away existing columns on the data flow? This is very annoying and I cannot see a good design reason for doing it. How is one expected to join data subsequently?
Moreover, why do nodes such as taggers demand you only have one document column on the feed? Why can't they be configured so you can select from any one of the document columns?
This sort of issue is getting in the way of my using KNIME for a text processing application. I think these design decisions are perverse, to say the least.
we are aware that it is inconvenient that the preprocessing nodes strip attached columns. This is an issue that is already on our list.
You can join additional columns to a bag of words using the original document as join column. Make sure to append the original documents in your bow.
Keeping additional columns makes no sense for the BoW creator. The nodes requires a list of documents and returns a bow. Additional columns can not be handled reasonably here (duplication is not reasonable).
We are also aware that it is inconvenient that the tagger nodes require exactly one document column. This is on our list as well. Use the Column Filter before the tagger node to filter out the other columns.
Btw. applying the preprocessing node on bow is not recommended (since 2.9). Apply the preprocessing nodes directly on the list of documents (like the tagger nodes). This is much faster! Create the bow afterwards (before computing the frequencies and creation of document vectors).
The problem with the approach that you outline is that because the tagger nodes require one document column, you can't keep the original document as a column to join on! The document column gets modified by the tagging procedure and becomes subequently useless for joining.
If you could just retain the columns that went in to the BoW creator then I could uses a counter or the suchlike to index individual documents, and this would not be an issue.
a student of mine just had a nice other idea. If you don't need the Author or the Category you can use those for an identifiying ID. Afterwards you can extract it with the DocumentDataExtractor and join on this ID only.
Kilian, does this make sense? I never tried it myself.
Thank for you taking the time and effort out to show me this approach. I think I will use Iris' method.
Can you
find some way of making this information more public on your site as I'm sure I'm not the only one who has had these issues, and
ensure that the problems with the document nodes that I originally flagged up, namely preprocessors stripping off columns and also allowing only one document column, get fixed, and soon, please?
Why do we, the users, need to keep track, anyway? It might be a good idea to keep an internal id, and to provide several equals()-methods (or whatever is used internally), and then give us a way to choose which one to use. Like, say, whenever there's an option for deep preprocessing, add a second option to "change internal id" or something.
I also don't see an obvious reason why you don't want to duplicate rows in the bow. If you would have to store physical copies of the contents, maybe, but couldn't you just forward to another iterator or something?
For me, the additional steps are not the main concern, but the join after processing is quite expensive.
Why? because I worked a lot with Pipeline Pilot in a previous job, and it made things simple by allowing you to keep columns on the data flow so you could uniquely identify which documents they belonged to. Having to cram identifers into a document's metatdata and then extract it seems like a longwinded approach.
I also don't see why it's unreasonable to expect the BoW to preserve column values on the input. If these do produce an overhead then you could just leave it to the developer to restrict the input flow using a column filter so that the overhead was manageable. Far better than giving the poor developer no option whatsoever. I could understand this design decision if the bag of words was produced for all the documents in the stream, but it does this on a document by document basis, so terms are already duplicated in the output flow.