An Expert in Text Mining

Hello everyone:

I need help from someone who can handle text mining, or at least I think. I have two files, one of them is part of the information notices used car sales. The most important field of this in "Model" where the person who placed the ad type model of car you want to sell. On the other hand, I have a file with data from the Internal Revenue Service, where different cars with their brands, "Models", the amount you must pay to circulate in the streets of Chile appear. I need to know is what model (row-level) of the first file belongs to second. While it is true that the "Model" the second file field is very orderly, in the first, as is written at the discretion by putting the announcement, the "Model" field is very messy and perhaps the same "model" appears written in different ways. Also, the "Model" field of the first file contains information that has nothing to do with the model itself.

The original files are much larger and contain many other makes and models, in this case I filtered 3 common to the Chilean market.

Thanks in advance to anyone who can help me. I include a very basic flow and sample files.


Hi Gabriel,

that sounds to me that you need to normalize the "Model" field of the first file containing the messy information before you match it against the second file e.g. per join. Alternatively you could solve this via fuzzy string matching. KNIME provides distance functions that support string distances, e.g. edit distance. For all pairs with a distance below a certain threshold a join could be done.

Cheers, Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.