Two tough KNIME PMML questions

Hi KNIME users!

I have 2 very specific questions regarding a k-means cluster model PMML file that I created from a KNIME project.

I really hope someone can help.. 

The questions are both in relation to a specific PMML file that was generated from KNIME - on a k-means cluster model. And where the workspace in KNIME also contains several PMML data prep nodes in succession.

Q1: Decoding the actual cluster model.

I have previously decoded other k-means models/PMMLs and always found that the number of ClusteringField fields are matching the array size. However in the one below, I have 18 ClusteringField variables - but array vector sizes of n=25. With that apparent "dis-match", how can I then construct the cluster distance function? 

    <ClusteringField field="WIMP*" compareFunction="absDiff"/>

    <ClusteringField field="SMS_CALLS*" compareFunction="absDiff"/>

    <ClusteringField field="SSD_MB*" compareFunction="absDiff"/>

    <ClusteringField field="MINUTTER*" compareFunction="absDiff"/>

    <ClusteringField field="Ikke PBS_PBS_FLAG*" compareFunction="absDiff"/>

    <ClusteringField field="PBS betaling_PBS_FLAG*" compareFunction="absDiff"/>

    <ClusteringField field="Foreningsaftaleholder_AGREEMENT_TYPE*" compareFunction="absDiff"/>

    <ClusteringField field="_AGREEMENT_TYPE*" compareFunction="absDiff"/>

    <ClusteringField field="_DEVICE_CATEGORY*" compareFunction="absDiff"/>

    <ClusteringField field="Feature phone_DEVICE_CATEGORY*" compareFunction="absDiff"/>

    <ClusteringField field="Smartphone_DEVICE_CATEGORY*" compareFunction="absDiff"/>

    <ClusteringField field="Entry Low_DEVICE_CATEGORY*" compareFunction="absDiff"/>

    <ClusteringField field="Entry High_DEVICE_CATEGORY*" compareFunction="absDiff"/>

    <ClusteringField field="Unknown_DEVICE_CATEGORY*" compareFunction="absDiff"/>

    <ClusteringField field="El-, gas- og fjernvarmeforsyning_OVERBRANCHE*" compareFunction="absDiff"/>

    <ClusteringField field="Private husholdninger med ansat medhjælp, husholdningers produktion af varer og tjenesteydelser til eget brug, i.a.n._OVERBRANCHE*" compareFunction="absDiff"/>

    <ClusteringField field="Bund50_DATAKALD_RANK10_binned" compareFunction="absDiff"/>

    <ClusteringField field="Top50_DATAKALD_RANK10_binned" compareFunction="absDiff"/>

    <Cluster name="cluster_0" size="1">

      <Array n="25" type="real">1.09936353E8 1.0920504761489662 -0.13657098566510845 -0.08871041220997422 0.9082287759009873 -0.0720947162535468 0.8862631402695447 -0.8862631402695447 0.0 1.4666454497176948 -1.274620567574196 0.0 -0.3795552369395683 -0.44879004404154327 -0.34698560762900704 -0.15785859445333428 -0.11502153268973915 -0.017327874182993024 0.0 0.0 -0.06725208779207874 -0.017327874182993024 0.0 0.0 1.0</Array>

    </Cluster>

    <Cluster name="cluster_1" size="786">

      <Array n="25" type="real">1.0241622963867685E9 -0.11816707638260911 -0.07015406511379416 -0.09669280607172925 -0.0521662059613372 3.4359484548460764E-4 -0.1773337232128674 0.1773337232128674 0.0 0.037131313624519324 -0.06957857983579316 0.0 -0.30286764421918966 0.28678311636337867 0.09251987701961893 0.13947076835359773 0.1315055638727917 -0.017327874182992923 0.0 0.0 0.008750186825534124 -0.017327874182992923 0.0 0.41475826972010177 0.5852417302798982</Array>

    </Cluster>

    <Cluster name="cluster_2" size="945">

      <Array n="25" type="real">1.1049297073597884E9 -0.18743113855257726 -0.03397868117632062 -0.08270569374616787 0.041033713258794305 -0.09513008545843013 -0.1049627604997198 0.1049627604997198 0.0 0.06849914634767855 -0.10891500911117245 0.0 0.12753234044864556 0.1460252893723146 -0.032674965232612346 0.02074877939506172 0.006143155750662331 -0.017327874182992857 0.0 0.0 -0.05144844021286006 -0.017327874182992857 0.0 0.4708994708994709 0.5291005291005291</Array>

    </Cluster>

    <Cluster name="cluster_3" size="4930">

      <Array n="25" type="real">1.3382572494432049E9 0.05454557758474083 0.017725663260999936 0.03128724879811668 2.6725153755483217E-4 0.018194718037712245 0.048212546040015 -0.048212546040015 0.0 -0.019347576319910128 0.03222881702404958 0.0 0.023917943584026657 -0.07362203608000215 -0.008416997076060357 -0.026181290438095115 -0.022120412546794022 0.006087602045626721 0.0 0.0 0.008480401875066384 0.006087602045625657 0.0 0.3594320486815416 0.6405679513184585</Array>

    </Cluster>


Q2: PMML writer or reader seems to "stumble"

In the same KNIME workspace as the above question, I have several PMML-data-prep nodes in succession and a number of derived variables that are made on previously derived variables. Initially the node-string is working fine in that I am able to construct the model and export the PMML file just fine. But when I subsequently try to import the PMML file back into KNIME to another workspace, it then fails at import. It seems that the PMML reader gets "confused" from the multiple derived variables. Is that a bug in the reader and/or is there a way to get around it?

I am somewhat new to KNIME (but really like it! :)), so I hope that these questions are not ones that I should have been able to get answered quickly elsewhere - instead of bothering you with them! :). If someone can point me to some PMML+KNIME documentation then that would probably also help me in finding the answers...

Thank you very much!!

John Westberg

Hi John,

I would like to have a closer look at that problem. Could you provide us with a workflow that shows the clustering problem (only if it's not top secret, of course)? And can you also attach the PMML document that is causing the problems with the PMML reader?

Regards,

Alexander Fillbrunn

Hi Alexander,

Thank you very much for trying to answer my questions.

I have exported the knime project to here:
https://www.dropbox.com/s/aekfavnwhe1v0o0/KNIME_K-means_PMML_example.zip

And the PMML file is here:
https://www.dropbox.com/s/jnn2n4x6bv2o0h3/knime_kmeans2.pmml

The KNIME workflow is connected to a PMML decoding project that I am working on, so I did attempt to create a rather complex PMML file - to challenge my decoding algorithm. As you can see in the workflow the string is working fine - but when I try to re-import the generated PMML file, it then fails.

Thank you!!

/John Westberg

Hi John,

thank you for providing this information. The problem with the PMML that cannot be imported lies in your "Column Filter (PMML)" node. This node removes fields from the data dictionary, which is a list of all input fields used by your models in the PMML document. When you remove fields from this list, you have to make sure that they are not used subsequently in models or transformations. In your case, there are several transformations in the local transformations section of the k-means model that reference fields which were removed by the column filter. These "dangling" transformations are then invalid because they try to transform a field that, according to the data dictionary, does not exist. Hence the warning in the column filter node: "Transformation dictionary uses excluded column xxx".

As a solution for this problem, I would suggest you remove the column filter node and select the clustering fields you want to use directly in the k-means node.

I will have a look at the clustering fields problem tomorrow and let you know what I find out.

Regards,

Alexander

Hi John,

regarding the second problem with the clustering fields: this is a bug in knime. You probably configured the k-means node with 18 columns, then changed something in the nodes coming before, resulting in more columns being used for the clustering, but not being recognized in the configuration. As a workaround, you can open the configuration dialog of the k-means node, remove all columns and then include them again. This updates the configuration and the clustering fields in the PMML document.

Regards,

Alexander 

Hi Alexander!

Thank you so very much!! These were exactly the answers I was hoping for - because otherwise I would have misunderstood the pmml and/or the k-means model completely. 

With regard to the first question, I removed the column filter node - and as you predicted - it worked.

Regarding the second question I am actually happy that is is a bug in KNIME - because it means that the rest of my pmml-decoding-work is fine, and that I can just write a bit of code that checks for this bug.

I really appreciate your help - so please don't hesitate to write me, if I can be of assistance in any way!

Thanks again,

John Westberg

Hi,

you don't need any code to check for this bug. Just open the configuration dialog of the k-means node, remove all columns and then include them again, as I have suggested above. This bug only affects the model if you change the input of the node without updating the configuration. KNIME has the columns it uses for clustering and those it writes into the model out of sync and by doing those steps they are the same again.

Regards,

Alexander