# Hierarchical Clustering after using Symbolic Aggregate approXimation (SAX)

Hello,

I'm working at CETHIL (Lyon Thermal Center), a research departement at INSA Lyon, in France.

One part of my research project is to undertake a data mining study of experimental thermal data. We have been advised (by researchers in computer science) to use SAX to create a distance matrix. Here the web page of the algorithm: http://www.cs.ucr.edu/~eamonn/SAX.htm

I want to use Knime to realize hierarchical clustering but I don't know if there is a node capable of doing that. After adding some Knime extensions I wasn't able to find what I wanted, so that is why I'm posting on this forum.

So my question is: does someone know if such a node exists?

Guillaume

PS: My english is probably bad, I'm sorry...

What about the "Hierachical Clustering" node ;-) If you have distance matrix generated externally you need to load it via the Distance Matrix Reader and then feed it into the "Hierarchical Clustering (Distmatrix)" node. You may need to install the "KNIME Distance Matrix" extension first.

I installed  the "KNIME Distance Matrix" extension and I explored "Hierarchical Clustering (Distmatrix)" and "Distance Matrix Reader" nodes, but they don't seem able to accept a Symbolic distance matrix .

Maybe I didn't gave enough details about our project.

SAX is an algorithm that transform time series into Symbolic distance matrix. We want to use it because it seems more effective in clustering time series than the others. SAX run on Matlab, but we want to use Knime for clustering the Symbolic distance matrix.

Do you think it's possible to do it with Knime?

Hm, I have never heard of "symbolic" distance matrices. How do they look like?

To add: there are MatLAB nodes for KNIME/MatLAB integration. Never tried them, though.

Cheers
E

A row could be look like : 'b'    'b'    'c'    'd'    'c'    'c'    'b'    'a'

It transform a time series into a string.

(this string is from a sample application of the Matlab code)

I installed the "Matlab Scripting" extension on Knime. The node "Matlab Snippet" is really interesting because it allows to run Matlab during the Knime execution. So we can give him a time series in input and the output will be our "Symbolic Distance Matrix" from SAX (if I make no mistake).

But we still have the problem of the reading of the "Symbolic Distance Matrix" by the "Hierarchical Clustering (DistMatrix)" and "distance matrix reader" nodes...

I hope you have a begining of a solution for us :)

I'm a little bit familiar with SAX, however I do not know what a symbolic distance matrix is... Can you maybe post a paper where the term is introduced?

But you basically need a distance measure between your symbolic values? But therefore you need an odering of the values?

Can you give two example rows and what would be the difference between them?

Actually I don't  know exactly how it's working, but here is what I understood (maybe not well):

If you give to SAX a matrix in input, it will return you a the same matrix plus a new symbolic column which represent the "distance" (in our research, the matrix can, for example, represent the temperature, the wind velocity and the solar radiation during a day). The symbolic column can be like: 'b'    'b'    'c'    'd'    'c'    'c'    'b'    'a'.

Here is the output explanation of SAX (first version) from the readme file:

Output:
symbolic_data:    matrix of symbolic data (no-repetition).  If consecutive subsequences
have the same string, then only the first occurrence is recorded, with
a pointer to its location stored in "pointers"
pointers:         location of the first occurrences of the strings

I have a co-worker who ask me the same question about measuring the distance between 2 symbolic values... but I don't have any idea about how it works...

If you want to, I can make a list of papers I read.

Anyway, I have a meeting with co-workers next week. I will keep you inform about this problem.

You should try to isolate your problem. I fear that you don't understand how to calculate the distance between two symbolic strings? If you understand that you should be able to build your distance matrix. The calculation is described in almost every paper of Keogh. He also held a talk at google which is available on youtube. In this talk he is explaining SAX by using simple examples.

Cheers

Sebastian

Hello gruedin,

did you find a solution, on how to implement SAX correctly, using KNIME?

I´m currently trying to built a really similar workflow based on the work of Usman Habib and Gerhard Zucker.
I managed successfully implement the first part of the SAX Transformation (doing PAA with a R-Snippet Node).

I think the solution for the symbolic distance matrix is to sum up the amount ocurrences of each symbol for different timesieries. In a next step I would calculate the euclidean distance between each of the timeseries (that´s basically like comparing a histogram).

I found this to be a good solution (theoretically!) but I´m struggeling with the implementation!
If you happen to have some functioning implementation I would really appreciate some advice on it.

Oso

Hello Oso and Gruedin,

I am also struggling to find any implementation or support of SAX within the KNIME platform; have either of you found anything?

I did learning that TeraData Aster offers SAX within their product; see link below for detail. I am setting up my environment now and may just need to walkaway from KNIME all together.

Data Science - IOT Pattern Discovery with SAX