we would to identify the language of given texts and tag the rows with the language id.
I tried to get the java language detection library working with the Java Snippet and included langdetect.jar as additional library. Here I always get an compile error: "Error: Unhandled exception type com.cybozu.labs.langdetect.LangDetectException" when reading the profiles.
I am curious if any experience can help us.
have you tried using a try catch block to catch possible exceptions thrown by the external lib?
thanks for this, we got it working.
Issue was that the DetectorFactory.loadProfile(path); can be called only once.
Adding one boolean to make sure it is called only once, the script works fine.
(And you are right the catch block is needed)
I'd also be interested in language detection, in my case I'd simply like to discriminate between texts in German (with some low fraction of English words) and texts in English.
For my problem I thought about some simple (but probably calculation intense) approaches like a detection based on frequencies of letters or the most common stop words, and then I stumbled across your thread...
Since my knowledge of Java is fairly limited (to non-existing); would you be willing to share your language detection?
If not, would you have a hind for me how to solve my problem best?
Any constructive input is welcome!
I solved the problem by simply row filtering for words or combination of words which are unique in one language and not the other. In my case I was looking for English, French and German. For example the German " ein " and the English " the " work quite well to distinguish both languages. It is not 100% failsafe but close. You might want to take a few test documents in each language and find the most common and at the same time unique keywords. This offcourse is difficult if the number of laguages you want to distinguish increases.
Best regards, Jerry
thank you for your answer. In my case I still have small proportions of the other language in the texts which interfered with -this otherwise very neat- approach.
Now I ended up using a stop-word based (top 30 words in EN/DE) document classification (somehow a similar idea) that works fine for me as well.
I use a combination of row filter with include/exclude sequences. The exclude after the include filters out the false responses. So, include rows with " ein " followed by exclude rows with " the " (or similar combinations) cleans it up quite well. Stop-word filtering surely also works.
Recently I realized there are “simple” statistical approaches that in fact do use letter frequencies i.e. the probability of h after a t to compute the language of a word.
For sorting some wordlists with mixed languages I generated (e.g. including english technical terms in a german text) this would be tremendously useful, at least to have a preclassification that only has to be checked instead of having to manually assign a language word by word (which is possible for several thousand words, but not a lot of fun and becomes even less desirable for bigger lists). Since I work with a high proportion of special (if not endemic) words, this approach also has great appeal to me.
It should be easy to obtain text corpi/ word lists online to perform statistical analysis on them, but then how to go on? Computing word frequencies in KNIME is easy due to available TF nodes, but right now I wouldn’t know how to access the statistics of one letter following another. Could this be done in a simple approach, or is it easier to implement something as the java language detection ad mentioned above by Dnreb?
the NGram Node would do the trick. Create character 2 grams with frequencies. Then extract the first letter, group by it to sum up all occurrences, and join the sums to the n gram frequency table. Finally divide the n gram frequency by the sum. Attached you find an example workflow.
thanks lot for the workflow! I didn’t have the NGram character option in mind...
I tested it with wordlists of 10k words and it worked fine. However when using a bigger number of words (including plurals and alike, e.g. 1.6 Mio DE words) the NGram node ends up with the attached error message. Since I used a list of German words I tested what happens if a different doc title and author is used: the error still prevails. (see both workflows attached)
What could be the reason for this error? How could I avoid it?
Thanks a lot in advance