Similarity search

marco_aurelio_maia_silva · November 17, 2016, 5:32pm

Hi everybody. I'm trying to do a similarity search in my workflow. I have a collumn named "Oferta" that is listed a lot of kinds of offers by a company. When a put the node, does not bring all the results. Is it normal?

marco_ghislanzoni · November 17, 2016, 5:45pm

It may depend on your data, on your configuration/workflow or on both. Difficult to help without more information. Can you share some of the data and your workflow?

Cheers,
Marco.

marco_aurelio_maia_silva · November 18, 2016, 1:39pm

thats my project.

I have a database with some products offers of a company, but only to a few costumers.

I thought of clustering the entire customer base and this small base and then finding the similarity between the two bases. So arrive at the final model with all products, but "Isenção", for example is not in the final node. Can anyone help me please?

marco_ghislanzoni · November 18, 2016, 2:56pm

Hi Marco Aurelio,

having looked at your Workflow I think I understand what you are trying to do.

In practice you have clustered your customers on a subset of the data and now you are trying to assign a broader set of customers to the same clusters by evaluating their proximity to the 4 centroids.

If this is the case, here is what you should do:

Connect the 2nd output port of the k-Means node to the 2nd input port of a Similarity Search node. This will feed the node with the centroids of the 4 clusters defined by the smaller initial data set.
Connect the new larger dataset, after normalization, to the 1st input port of the Similarity Search node.
Configure the Similarity Search node to use the Euclidean distance and include all the Double type columns (this should be done automatically by the node when you selected the Euclidean distance and the inputs are already connected)

Now when you execute the Similarity Search node you will get 2 additional columns in your larger dataset, one indicating the nearest cluster for each row and the other indicating the Euclidean distance to the centroid of that cluster. See the following picture.

https://www.knime.org/files/knime_text_processing_introduction_technical_report_120515.pdf

Well, you were almost there I would say.

Cheers,
Marco.

18-11-2016_14-46-33.png

marco_aurelio_maia_silva · November 18, 2016, 5:33pm

Great! That's what i was looking for. Thanks a lot!

Now a question. Is there a node that i can save the secont output from kmeans, so i can put just this node on another workflow?

I was thinking to create 2 workflows. One to save the clusters and the model to run just once in a month, for example. And another workflow to run onde in a week to score the whole base. And in this node, i just call the model of clusters.

marco_aurelio_maia_silva · November 18, 2016, 5:47pm

Another doubt. How i turn back now, and based on cluster, get the "Oferta"?

marco_ghislanzoni · November 21, 2016, 9:57am

Sure. Just use a Table Writer node in your k-means workflow and read it back with Table Reader in your second workflow.

Cheers,
Marco.

marco_ghislanzoni · November 21, 2016, 10:05am

Is Oferta associated with each cluster on 1:1 basis? Or is it an outcome variable of some sort? I understand it is assigned according to some sort of rule, or not?

Cheers,
Marco.

marco_aurelio_maia_silva · November 21, 2016, 2:11pm

The client has some rules to offer a "Oferta". The ideia is for us to say which offers are more likely to be accepted.

I thought about taking the lowest distance value from each contract and associate with the model clusters to get the offers from each cluster. If you can, take a look at the final joiner.

Do you think that is a good ideia what i did?

model.knwf

marco_ghislanzoni · November 23, 2016, 12:21pm

Ok, I am not sure I understand anymore the earlier logic at this point.

So you are not trying to cluster the customers according to their characteristics, but more to predict which offer they are more likely to accept given their characteristics and on the basis of the behavior of previous customers (whether they accepted that offer or didn't accept it and churned).

If this is the case, rather than into clustering you should be looking into recommendation systems. See the following examples:

https://www.knime.org/knime-applications/lastfm-recommodation

https://www.knime.org/knime-applications/market-basket-analysis-and-recommendation-engines

Cheers,
Marco.

marco_aurelio_maia_silva · November 29, 2016, 1:26pm

Thanks a lot Marco!

system · June 2, 2023, 9:31pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.