database with more than a million items

Hi guys, I have a problem that I don’t know how to tackle. I have a database with more than a million items with the following issue, for example: coca, coca cola, coca 1lt, coca cola 1lt, cocacola lt, coca 1.0lt, coca cola bott., coca cola bott.1, coca cola bott.1lt. This applies to a thousand types of items, and I should be able to associate them with ‘coca cola 1lt’ and so on for all the other items. How can I address this issue? Thanks a lot

@JEdgar first. I have a collection of articles and examples how to deal with string similarities.

First question would be if you have some sort of ground truth against which to check all the items or if you have to start from scratch grouping them. Both can be achieved.

Then I would think about if the numbers would be significant so if it is “cola 1ltr” and “cola 500ml” or if they would both be cola. If you want to include numbers you might want to check the settings of the algorithms you might choose to use.

Is there a meaningful list of main categories that you could use to pre-sort the items like pane, grissini … this could maybe help.

Then if this is a recurring task you might want to establish a collection of references that would occur more often and just match them without some fancy algorithm.

This being said, would it be possible to provide a meaningful sample of the data.


Thanks so much for the reply, sure I can pass it on

Link per il download

@JEdgar OK so maybe now you can tell us more about what would be the scenario/task that you want to do. The file contains several levels of text. What would the result look like and maybe you can also think about the other questions I asked.

Hi Edgar,

looking at the “Description” column inside your image, I doubt the result for string similarity will give good (automatic) results: if you are dealing with brand names (ex Coca Cola) it’s easy, but how can you put “Gli Stirati a Mano” together with other grissini? How can you deal with all abbreviations (ex: “Prep. per ciamb. limone” will be more similar to “Limone” ie fruit rather than other cake mixtures)?

Maybe I haven’t fully understood the problem, but why do you need to deduce the type of good from the description if the system has a 5-level category name?

Thank you and have a nice evening,
Raffaello Barri


Good afternoon friend, thank you very much for your help. I understand that I need to create a system capable of understanding ‘prep per ciamb. limone’ is actually a product for preparing ‘chambelle,’ and so in the list there are some very complex characteristics

1 Like

Each customer requests a product from the warehouse with an approximation to the original name. For example, one may call it ‘coca,’ another ‘coca cola,’ and another ‘co.la1lt,’ and so on. However, for all products, I need to create an automatic system that normalizes all names assigned by customers to 'coca cola 1 lt

@JEdgar I still do not understand what data you provided, what column is the one with the texts you want to explore - “Description” I assume, and what is the case with the 5 levels of text. Maybe you give us a description of the dataset and try to explain again what you initially have and how a solution might look like along the lines of questions I have asked, like: is there a ground truth or will you start from scratch aggregating data.

Also you seem to have Brand names in there also. Once approach could be to remove the band names in the first place from the description if you want to have products that are ‘brand agnostic’ so to say.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.