When using the Text Processing PDF Parser node, there are numerous occassions where rogue spaces appear after or before some punctuations like brackets, commas, and dashes. Would it be possible to allow the String Replacer node to replace text inside a document column rather than just a string column. Or have a separate string replacer node in the text processing/transformation section.
Reason for this, is that the OSCAR tagger can miss an awful lot of structures from patents etc, due to these rogue spaces, and therefore the document needs to be processed prior to tagging and BagOfWords.
I appreciate there is a workaround to convert the document to string, using string replacer, and then convert back to a document again, but it is rather cumbersome.
You can use the "Replacer" node of the Textprocesing Plugin, which replaces terms in a bag of words and the terms in the documents as well (which is what You are looking for) based on regular expression. Alternatively You could use the "Dict Replacer" node which replaces based on a dictionary.
In order to use these nodes You would:
1. Use the Parser node (e.g. PDF)
2. Convert the list of documents into a bag of words with the "BoW" node
3. Replace the rouge spaces using the "Replacer" node (use the Deep Preprocessing option, which is switch on by default anyway)
4. Group by the Document Column using the GroupBy node (the result ist the preprocessed list of documents)
5. Use the Oscar tagger to tag compounds.
I hope this helps,
Great, many thanks.
I forgot about the fact that the preprocessing nodes actually replaces contents in the document cells.
This is just what I need to do.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.