I have data from different cars where colums like (VIN, speed, power, rpm, latitude, longitude, timestamp, vehicle type) exists. For each VIN number I have multiple rows of data (i.e for each car I have lot of data).

Now I will get a new car's data with multiple rows - I need to do a similarity check and should be able to tell to which of the old car this new car is similar to.

1) That means how is the new car running conditions similar to any of the old car.

2) Also Is there is a way to tell that the new car is similar to different old cars with some percentage.

Example:

Old Cars - A, B, C, D

New Car - N [Where A, B, C, D, N are VIN numbers of those respective cars]

Results should be like (N is similar to A with 90%, N is similar to C with 80%) etc.

This will help me to do a preventive maintainance on the new car based on the maintainace data I have for that particular similar old car . If there is any example that you can refer me to, that would be of great help for me.

I think KNN would be a good approach. On the other hand it would be a great advance if you could bring a sample of your data, it will be easier to think how we could help you.

- "geographical": latitude, longitude -> based on your use case description, it appears to me that these two are not required for a similarity search, are they ?

- categorical: vehicle type -> does comparing similarity include vehicle type as a descriptor for similarity or would you rather search within each vehicle type ? If vehicle type should not be mixed, the similarity search problem can be split per vehicle type (i.e. Group Loop), which makes finding the neighbors much easier and reliable.

- string: VIN -> have you already extracted the manufacturer and vehicle attributes (vehicle type, model year) from the VIN code ? (e.g. using String Manipulation) The VIN itself appears useless IMO for similarity search, for it will render the cars practically unique. Though I'm intrigued, how come you have several rows by VIN ? Is your VIN truncated already ?

- how does timestamp fit in ? what does it represent ? date of acquisition ? date of data entry ? etc.

The problem definition

The Similarity Search node will find you find the nearest neighbor for each new car and provide information on how close. To find all similarities, you would have to analyze the distance matrix - you could extract automatically the n closest, etc.

All you need to do before is define the similarity problem and choose the appropriate (set of) distance function(s). Regarding similarity, which are precisely the attributes that you consider important to determine maintenance based on past experience values ? What would you look at if you had to look it up manually ? Which simplifications do you perform when searching manually ? (e.g. categories of power, rpm, model year range, etc.)

-The characteristics you mentioned are all right.
-As of now, I too don't see a need to use latitude and longitude for similarity serach
-I already have got all the details of the vehicle like - make, model, year, for what purpose the vehicle is used (example: van, utility service, construction etc)
-Time stamp - It respresents the time at which data is recorded. The data for these different vehicles is recorded while they are in running state (so each time the data is taken that particular date and time stamp is recorded, obviously mutiple rows of data will be recorded for the same VIN number)

I can do different similarity search here - based on model, based on utilization of the vehicle. But while doing it, I need to see to which vehicle of that model my new vehicle corresponds to.

How about transforming the numeric variables into bins, thus making them categorical ?

Alternatively, you could use two distance functions (e.g. manhattan for numeric variables and tanimoto for categorical ones), the combine them using Aggregated Distance and perform similarity search based on that function.

Or you can create a group out of model, and with Group Loop perfom x Similarity searches.

Finally, another way to define the analysis is using association rules. This analyzes the similiarity between features instead of between observations.

Here attached a simple example workflow to perform a (Euclidian) distance calculation by vehicle type. The distance matrix shows it all. You can attach the same distance function to Similarity Search, which then selects the nearest neighbor for you.

In the end, everything depends on how YOU define similarity. If b, h and j would be your new records to assess similiarity for, to which record would you associate each of them and why ? Ask yourself more specifically: which variables are useful ? are there groups of variables that need to be weighted ? how to transform each variable (dummies, binning, normalization, PCA, etc.) ? which distance function ?

Thanks you so much for your help. This example really helps me alot. speed and vehicle type are the variables which I need to primarily consider for similarity as I already have done Linear corelation and found that these two variables are very important.

Can you show me how to attach the distance function to similarity search, if I have the new data for which I need to find similarity. I have attached the sample data for which I need to find the similarity with the old data.

The third port of Similarity Search is the one you need to connect the distance function to. The distance function should come from the reference data set. That's as simple as that, even within the Group Loop.

Regarding variable selection, you should not be worried about multicolinearity: you can also include Power in the similarity search, for this will increase precision. Moreover, consider that you have only tested for LINEAR correlation and not for correlation in general. It is however a good idea not to mix trucks, vans and cars in the similarity search (maintenance is certainly different for each type), therefore, I'd keep the by vehicle type approach.

What you have to decide as well is whether you want to punish more than proportionnally bigger distances between pairs (Euclidian, squared pairwise distances, L2 norm) or not (Manhatten, unsquared pairwise distances, L1 norm). The former punishes outliers, while the latter does not treat them any differently than any other point.

The same goes then for the Normalizer node: z-score punishes outliers, while min-max => 0-1 normalization shortens the distance between outliers and the "normal" points. The advantage of Normalizer is that you can practically observe the close-by points yourself at a simple glance at the data set (given the right sorting). Normalizing is important in any case, otherwise the variable with the bigger range and values will dominate the distance calculations.

BTW there is no other rule in this exercise, it is unsupervised and you should do what your domain knowledge and common sense recommend you to do and that includes experimenting :-)