The sentence extractor node extracts the document cell as the first sentence and then the other sentences contained within it. This results in duplicate terms, etc. when further analysis is carried out on the sentences extracted, since the first extraction contains all other sentences too.
Is there a way to avoid extracting the documnet as the first set of sentence. In the worst case, could an option be provided so that the extraction of the document itself as a sentence could be opted out in line with the requirement of the user.
the Sentence Extractor node extracts all sentences from documents. The title of a document is considered as a sentence as well and gets extracted too. You can mark the title rows in the output table and detect when sentences of a new documents will appear, by using a little bit of Java code with the Java Snippet node. Based on these markers you can filter out the title rows.
Attached is an example workflow.
As usual you have been both prompt and helpful with a handy workflow. I observed that the sentence extractor does not seem to split the Document 2 in your example - not sure why.
Though the Java Snippet node is helpful, I was wondering what purpose extracting the document in to the same column as the extracted sentence serves. Is there a technical issue preventing a more elegant solution that also avoids yet another node i.e. the Java Snippet node and an additional process step.
Cheers and thanks again
in document two the sentences are not split, you are right. This is due to the sentence tokenizer (OpenNLP). Sometimes the sentence tokenizer has problems with very short sentences containing numbers followed by a dot ".". "2." could be interpreted as second instead of 2 and full stop. Replacing the numbers "1" and "2" with one and two results in the correct tokenization.
The extracted strings "Document 1" and "Document 2" are the titles of the documents, not the documents themselves. The titles are considered as sentences as well and thus extracted.
Though there is an entry for yogesh123, I am not able to see the message. Is there something that I am missing?
Hello, I am trying to do some text processing too, but my problem is that I need to keep documents information after a sentence extractor process (like Category, Title, Authors). I am quite new to Knime.
It's easier for me to post a part of my workflow. Until now what I've done is:
import file to document -> do some tagging process -> BoW -> filtering stuff -> Select documents by category -> tag cloud.
This workflow does what I want, but I think that using a Sentence extractor will make my tagging process work much better (in my actual workflow I am using a lot of NER nodes, by the way). But how can I do it? Is there a way to do tagging and filtering stuff using the sentences, and then put the labelled sentences back into my documents, so that I can keep all the documents original meta-information?
I took a look at this other post but it didn't help me.. https://tech.knime.org/forum/knime-textprocessing/clean-data-sentence-by-sentence
Thank you very much
to apply tagging and filtering etc on sentences you would need to extract sentences as strings using the Sentence Extractor node and convert them back to documents using the Strings to Document node. Tagging and filtering can only be applied on documents. However, I assume that applying tagging node on documents that contain of only one sentence will not affect the tagging quality. Tagger nodes operate on sentence levels internally anyway. Each sentence is passed over to the integrated tagging libs separately and thus is tagged separately.
By the way. you can apply the the "filters" meta node containing the preprocessing nodes directly after the Stanford Tagger node before the BoW Creator node. This will speed up the processing time.
I'm struggling with the same problem. Thus, I downloaded the sample workflow you provided. Now, KNIME (I run the current version on win 7) constantly either crashes or simply refuses to read the workflow.
Could you please sent the java snipped or give me a hint how to remove the document title from the sentence extractor output?
Thanks very much