Separation of multiple languages in one document

#1

Hello,

I would like to set up a workflow for classifiying patent documents as relevant/not relevant regarding certain technologies or processes. Basis for the classification should be the set of claims. In many documents (especially european patents), the claims appear in more than one language, mostly English, German and French, at the same time in one document. My question is: Is it possible to separate for example the german, english and french text blocks from each other so that hey could be furtherly processed by a separate workflow branch or s separate workflow ?

Kind regards

Michael

0 Likes

#2

Hi @michael19602016 -

That’s an interesting problem! Our existing nodes that detect languages (Tika Language Detector and Amazon Comprehend (Dominant Language)) do so for either strings or documents, but for the entirety of the field that is input. So one approach might be:

  • first convert your strings to documents (using language independent whitespace tokenization)
  • extract sentences from those documents
  • detect the language of each sentence
  • branch based on the language
  • continue analysis downstream

A dummy workflow might look something like this:

Does that help get you started?

3 Likes

#3

Dear Scott,

thanks a lot for your answer and advice. I tried according to your model workflow using an input which contains English, French and German language. For test purpose, I used only one document, a set of patent claims in all three languages. Converting to Document with whitespace tokenization works. The sentence extractor provides an output table with sentences in all three languages. Unfortunately, the Tika language detector doesn’t seem to work properly in my case since “fr” is allocated to all sentences. Do you have an Idea, what could be the reason (for example, the sentences still contain numbers and at the same time some technical (chemical) expressions) ? As far as I can see, the rule based row splitter also works properly.

Kind regards

Michael

0 Likes

#4

Hello Scott,

I made a mistake - it was late in the evening and I chose the wrong column for being processed. Now, it works as you described.

Thanks,

Michael

2 Likes

#5

That’s good to hear!

0 Likes

#6

Hello Scott,

I have one remaining question: After „stripping off“ unneeded languages, is it possible to put together the sentences again giving a document ? Or isn‘t that necessary for further processing towards machine learning ( the identity and content of the document should be retained).

Kind regards

Michael

0 Likes

#7

Hi @michael19602016 -

I suppose it depends on what your ultimate goal is on the machine learning side. All of the metadata associated with the original document is still retained, and you can create explicit fields for that meta data using the Document Data Extractor node.

The Sentence Extractor produces strings, so you could always join them back together (minus the sentences of other languages) to create a new, cleaned document.

But the question remains: what is your ultimate goal with the cleaned data? Is it classification, or topic extraction, or…?

0 Likes

#8

Hello Scott,

my ultimate goal is to classify the documents. I manually identified a number (several hundred) of patent documents to be relevant regarding to a certain technology I’m interested in. Those documents,I would like to use to train a model.
Using the trained model, I would like to classify unkknown documents. I would prefer a predictor which provides as output the relevancy in terms of a value between 0 and 1 (for example logistic lerner / predictor ?) for each document.
Since I’ new to data science and knime, I search for the beginning for a model which is not too complicated to parameterize… Could you give me a hint which model to try ?

Kind regards
Michael

0 Likes

#9

Hello Scott,

it seem, I’ m not able to combine my extracted sentences back to documents. I tried the strings to document node. Ist that correct or is ther any other way ?

Kind regards

Michael

0 Likes