How to make the most consistent, most compatible classification in every language?

umutcankurt · January 7, 2024, 11:04am

Hello everyone;
For a few days, I have been lost in trying to make the most accurate and highest matching classifications for a common standard classification in different languages.

I’m trying to find the most accurate and working workflow that can be implemented in any language. Classifying code from a data set. (with reference code list prepared for each language)

** But the most important part is to create the best classification or the most relevant classification in each language. English, French, Spanish, German… It’s easy, but even now I’m working with data containing 26 languages and the number of languages will increase.

What I’m thinking about right now is from the simplest perspective. I created reference code and description for 26 languages. If I can include any word in the description section in the classification/category addition section, as in the example below, I will be one step closer, at least a classification will be created in the reference code list created for each language.

Example;

Data: Cable and construction repairs will be carried out in Paris.

classification/category
code: 112 / description: construction engineering services
code: 488 / description: cable supply
code: 266 / description: cable works
code: 996 / description: construction products

As a result, all codes must be present in the classification code column according to the data. Because; It is necessary to be flexible to produce the most relevant and close approach. If it is only based on the description in the code list, the matches will be very few. If there is a rule that the word group in the code description must be present even if it is irregular in the text, this will not happen and matches will be low.

** To create a must-have code classification group; If any word in the code classification description is included in the data text, add it to the code column.

In other words, the column output should look like this for the example data above:

Classification code: 112, 488, 266, 996

Finally, if there is a better method, please share your example. But I need an example of a workflow that works like this that will work for me in any language.

umutcankurt · January 7, 2024, 11:47am

Tag Documents With Reference Code.knwf (58.9 KB)
The workflow closest to the solution is attached. However, if there is no separation between words with ", " in the description in the reference code list, it will not work. Naturally, this leads to the problem of dividing ", " in a structure with a lot of code or in different languages (in many languages), so if it works without ", ", it seems possible that it can make a consistent category classification or enable classification of matches in different languages. … I will be looking forward to suggestions and solution examples.

mlauber71 · January 7, 2024, 12:52pm

@umutcankurt you could think about further text manipulations, especially stemmers (tractor instead of tractors) that could result in more matches. Also removing numbers and punctuation characters might help - though in your case this might also be irrelevant.

You will have to see which languages are being supported.

At the very left there are some further manipulations in the Component.

umutcankurt · January 7, 2024, 2:05pm

@mlauber71 Thank you for the answer, I will look into this

umutcankurt · January 7, 2024, 5:47pm

Unfortunately it didn’t work, I even chose the Turkish NLP package in Turkish. If I can arrange an alternative to the example I sent, I think it will work. I couldn’t find the solution yet.

mlauber71 · January 7, 2024, 5:54pm

@umutcankurt an empty title does not mean that it had not worked just that there is no title being configured. The document should still be there and be usable. Just make sure you have the cascade of documents to be manipulated right.

Maybe you can post your sample workflow with turkish texts.

umutcankurt · January 7, 2024, 6:22pm

Hi @mlauber71 The attached workflow includes a sample data table (for classification) and a reference code list.

If we can create a model in Turkish for a different language example, then I can create a workflow that I can use in other languages.

The data table and reference code list are in Turkish.
Turkish Data and Turkish Referance Code List.knwf (339.2 KB)

mlauber71 · January 7, 2024, 7:13pm

@umutcankurt this is the modified workflow with the stemmers in Turkish also being used.

Tag Documents With Reference Code - Turkish - KNIME Forum (76399).knwf (649.7 KB)

umutcankurt · January 7, 2024, 8:20pm

@mlauber71
This is great work! Thank you very much for your time and interest. I have examined that the matching of codes is quite successful, if the reference code list is edited (manual editing is required) it will code more accurately.

There is only a small correction for Turkish. There are vowels in Turkish. So, like "ş, ü, ö, ğ, i, " etc.

For example, “golet” is wrong in the arrangement because it should be a pond, it is a word meaning small lake. If the vowels are corrected in the same way, uppercase or lowercase, semantic errors occur.

Yes, I think this will give many people ideas for text classification / categorization.

Great for taking the time on Sunday again.

system · January 14, 2024, 8:20pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.