Other languages: Source for Dictionary Tagger

baltow · January 7, 2013, 1:15pm

Hello everyone, I created a text processing workflow:

I have some PDF documents in Italian, my goal is to seek some particular terms present in the documents, I started with a PDF parser--->POS tagger--->BoW creator--->Various Filters and for last the Snowball Stemmer setted for Italian Language(the whole workflow is represented in the attached file!).

I would like to know if there are, among the tagger and filter nodes I created in the workflow, useless applied to Italian rather than English or German and, if that so, how to incorporate the dictionary tagger with my Italian vocabulary txt file (the file is compiled with Italian terms without spaces among them) in order to use it therefore isolating words I'm looking for from those PDF files,

Thanks in advance, best regards,

Riccardo

P.S. I just started using Knime for my engineering graduation thesys, I'm a newbie, be patient please

workflow.png

kilian.thiel · January 7, 2013, 3:04pm

Hi Riccardo,

the "Dictionary Tagger" provides you with all the flexibility you need to tag words in documents of various languages, as long as you can provide the dictionaries. The words (multi words are allowed too) you want to tag must be specified in a (dictionary)file. Each line must contain one word. This particular file you need to read in using the "File Reader" node. This node will create a table consisting of one column, containing the word to tag. This data table must be the input table on the second port of the "Dictionary Tagger" node. The first input port of the tagger node is the list of documents, created by the PDF parser (or any other parser node). In the "Dictionary Tagger" you can specify which tag type is used when tagging the words. Please use the tag type "NE" (there is a bug if you don't use this tag type in 2.7.0. This bug will be fixed in the new version). Than create a bag of words using the "BoW creator" node and the a "Named Entity Filter" node, which filters all words and keeps only those which have been tagged before. Finally you can use the "TF" or "IDF" node in order to computed the frequencies of the remaining words.

In Short:

"PDF Parser" & "File Reader" => "DictionaryTagger" -> "BoW creator" -> "Named Entity Filter" (optional) -> "TF" -> ...

Hope this will help you.

Cheers, Kilian

baltow · March 14, 2013, 5:42pm

Hello Kilian, first of all thanks for the short response time!

It worked out, Thanks again!

On the other hand, I came into another issue concerning the topic subject, I build in my knime architecture a StopWord Filter with a txt source (ASCII) that I filled with the most common Italian stopwords (including, among others, all the Italian article forms, each for every line); it worked up execept for the articles followed by the apostrophe, in the example below the expression "The air" is provided to furnish a strong example of when an Italian article goes with an apostrophe(for those who don't know):

English-->The air-->Italian--> La aria ("La" is one of the Italian articles, but in this case it is followed by two vocals, so the "a"after the "L"is substituted with the apostrophe--> L'aria

The problem I encountered regards this little "bug" by the Stopword filter: it doesn't filter these article forms, I even presumed it could be due to the lack of space between the article and the word it refers, so I decided to put in the architecture a Puntuaction Erasure but the article with its apostrophe resulted, with the noun they refer, in the same line in the output table as well. Regarding the mentioned example, a "keygraph word extractor" or a "TF/IDF filter" put after the filters counts "l'aria"as a whole word, whereas the word for "air" is just "aria".

Any suggestions would be really appreciated, thanks in advance,

I hope I was clear and stayed on topic.

Best regards

kilian.thiel · March 17, 2013, 4:49pm

Hello baltow,

the stop word filter can only filter entire terms (usually tokens). In your example L'aria would be one term (token), since the tokenizer splits only at whitespace characters. The term is not in the provided stop word list, i guess, so it will not be found and filtered. One solution would be, to use the "Replacer" node and replace the "L'"s via a regular expression. In the dialog of the node the regular expression would be "L'" and the replacement an empty string.

Cheers,

Kilian

baltow · March 26, 2013, 7:26pm

Thanks for the kind answer and for your constant feedback, I really appreciate it in order to understand the Knime basis that I am discovering daily. Back to the topic, I added the apostrophe in the stopword txt file created previously(speaking of txt encoding, is it better, when the txt file interacts with KNIME, to keep the ANSI or to change it as UNICODE?), and it worked.

Thanks again, wish a good end of the day,

Riccardo

kilian.thiel · March 29, 2013, 2:15am

Good to read that it worked. It is better to save the stopword file in ANSI format.

Cheers, Kilian

jmbittes · April 9, 2021, 6:22pm

Hello, I’m doing a text mining in Portuguese and I tried to implement the Dictionary Tagger, however when I access BoW no words were found, it returns in the Term column the UNKNOW (NE) value for all words. When I use the POS TAGGER with Tokenizer Spanish it fills the Term, but with errors because there are differences between the grammars of the two languages.
I checked your tip on this forum and I’m doing it that way, there is something else I have to pay attention to.

jmbittes · April 11, 2021, 12:56pm

I need help about dictionary tagger.
I’m doing a text mining in Portuguese and I tried to implement the Dictionary Tagger, however when I access BoW no words were found, it returns in the Term column the UNKNOW (NE) value for all words. When I use the POS TAGGER with Tokenizer Spanish it fills the Term, but with errors because there are differences between the grammars of the two languages.
I checked your tip on this forum and I’m doing it that way, there is something else I have to pay attention to.

system · June 2, 2023, 9:40pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.