Similarity search

Hi everybody. I'm trying to do a similarity search in my workflow. I have a collumn named "Oferta" that is listed a lot of kinds of offers by a company. When a put the node, does not bring all the results. Is it normal?

It may depend on your data, on your configuration/workflow or on both. Difficult to help without more information. Can you share some of the data and your workflow?


thats my project.

I have a database with some products offers of a company, but only to a few costumers.

I thought of clustering the entire customer base and this small base and then finding the similarity between the two bases. So arrive at the final model with all products, but "Isenção", for example is not in the final node. Can anyone help me please?

Hi Marco Aurelio,

having looked at your Workflow I think I understand what you are trying to do.

In practice you have clustered your customers on a subset of the data and now you are trying to assign a broader set of customers to the same clusters by evaluating their proximity to the 4 centroids.

If this is the case, here is what you should do:

  1. Connect the 2nd output port of the k-Means node to the 2nd input port of a Similarity Search node. This will feed the node with the centroids of the 4 clusters defined by the smaller initial data set.
  2. Connect the new larger dataset, after normalization, to the 1st input port of the Similarity Search node.
  3. Configure the Similarity Search node to use the Euclidean distance and include all the Double type columns (this should be done automatically by the node when you selected the Euclidean distance and the inputs are already connected)

Now when you execute the Similarity Search node you will get 2 additional columns in your larger dataset, one indicating the nearest cluster for each row and the other indicating the Euclidean distance to the centroid of that cluster. See the following picture.

Well, you were almost there I would say.


Great! That's what i was looking for. Thanks a lot!

Now a question. Is there a node that i can save the secont output from kmeans, so i can put just this node on another workflow?

I was thinking to create 2 workflows. One to save the clusters and the model to run just once in a month, for example. And another workflow to run onde in a week to score the whole base. And in this node, i just call the model of clusters.

Another doubt. How i turn back now, and based on cluster, get the "Oferta"?

Sure. Just use a Table Writer node in your k-means workflow and read it back with Table Reader in your second workflow.


Is Oferta associated with each cluster on 1:1 basis? Or is it an outcome variable of some sort? I understand it is assigned according to some sort of rule, or not?


The client has some rules to offer a "Oferta".  The ideia is for us to say which offers are more likely to be accepted.

I thought about taking the lowest distance value from each contract and associate with the model clusters to get the offers from each cluster. If you can, take a look at the final joiner.

Do you think that is a good ideia what i did?




Ok, I am not sure I understand anymore the earlier logic at this point.

So you are not trying to cluster the customers according to their characteristics, but more to predict which offer they are more likely to accept given their characteristics and on the basis of the behavior of previous customers (whether they accepted that offer or didn't accept it and churned).

If this is the case, rather than into clustering you should be looking into recommendation systems. See the following examples:




Thanks a lot Marco!