Hi KNIME users!
I have 2 very specific questions regarding a k-means cluster model PMML file that I created from a KNIME project.
I really hope someone can help..
The questions are both in relation to a specific PMML file that was generated from KNIME - on a k-means cluster model. And where the workspace in KNIME also contains several PMML data prep nodes in succession.
Q1: Decoding the actual cluster model.
I have previously decoded other k-means models/PMMLs and always found that the number of ClusteringField fields are matching the array size. However in the one below, I have 18 ClusteringField variables - but array vector sizes of n=25. With that apparent "dis-match", how can I then construct the cluster distance function?
<ClusteringField field="WIMP*" compareFunction="absDiff"/> <ClusteringField field="SMS_CALLS*" compareFunction="absDiff"/> <ClusteringField field="SSD_MB*" compareFunction="absDiff"/> <ClusteringField field="MINUTTER*" compareFunction="absDiff"/> <ClusteringField field="Ikke PBS_PBS_FLAG*" compareFunction="absDiff"/> <ClusteringField field="PBS betaling_PBS_FLAG*" compareFunction="absDiff"/> <ClusteringField field="Foreningsaftaleholder_AGREEMENT_TYPE*" compareFunction="absDiff"/> <ClusteringField field="_AGREEMENT_TYPE*" compareFunction="absDiff"/> <ClusteringField field="_DEVICE_CATEGORY*" compareFunction="absDiff"/> <ClusteringField field="Feature phone_DEVICE_CATEGORY*" compareFunction="absDiff"/> <ClusteringField field="Smartphone_DEVICE_CATEGORY*" compareFunction="absDiff"/> <ClusteringField field="Entry Low_DEVICE_CATEGORY*" compareFunction="absDiff"/> <ClusteringField field="Entry High_DEVICE_CATEGORY*" compareFunction="absDiff"/> <ClusteringField field="Unknown_DEVICE_CATEGORY*" compareFunction="absDiff"/> <ClusteringField field="El-, gas- og fjernvarmeforsyning_OVERBRANCHE*" compareFunction="absDiff"/> <ClusteringField field="Private husholdninger med ansat medhjælp, husholdningers produktion af varer og tjenesteydelser til eget brug, i.a.n._OVERBRANCHE*" compareFunction="absDiff"/> <ClusteringField field="Bund50_DATAKALD_RANK10_binned" compareFunction="absDiff"/> <ClusteringField field="Top50_DATAKALD_RANK10_binned" compareFunction="absDiff"/> <Cluster name="cluster_0" size="1"> <Array n="25" type="real">1.09936353E8 1.0920504761489662 -0.13657098566510845 -0.08871041220997422 0.9082287759009873 -0.0720947162535468 0.8862631402695447 -0.8862631402695447 0.0 1.4666454497176948 -1.274620567574196 0.0 -0.3795552369395683 -0.44879004404154327 -0.34698560762900704 -0.15785859445333428 -0.11502153268973915 -0.017327874182993024 0.0 0.0 -0.06725208779207874 -0.017327874182993024 0.0 0.0 1.0</Array> </Cluster> <Cluster name="cluster_1" size="786"> <Array n="25" type="real">1.0241622963867685E9 -0.11816707638260911 -0.07015406511379416 -0.09669280607172925 -0.0521662059613372 3.4359484548460764E-4 -0.1773337232128674 0.1773337232128674 0.0 0.037131313624519324 -0.06957857983579316 0.0 -0.30286764421918966 0.28678311636337867 0.09251987701961893 0.13947076835359773 0.1315055638727917 -0.017327874182992923 0.0 0.0 0.008750186825534124 -0.017327874182992923 0.0 0.41475826972010177 0.5852417302798982</Array> </Cluster> <Cluster name="cluster_2" size="945"> <Array n="25" type="real">1.1049297073597884E9 -0.18743113855257726 -0.03397868117632062 -0.08270569374616787 0.041033713258794305 -0.09513008545843013 -0.1049627604997198 0.1049627604997198 0.0 0.06849914634767855 -0.10891500911117245 0.0 0.12753234044864556 0.1460252893723146 -0.032674965232612346 0.02074877939506172 0.006143155750662331 -0.017327874182992857 0.0 0.0 -0.05144844021286006 -0.017327874182992857 0.0 0.4708994708994709 0.5291005291005291</Array> </Cluster> <Cluster name="cluster_3" size="4930"> <Array n="25" type="real">1.3382572494432049E9 0.05454557758474083 0.017725663260999936 0.03128724879811668 2.6725153755483217E-4 0.018194718037712245 0.048212546040015 -0.048212546040015 0.0 -0.019347576319910128 0.03222881702404958 0.0 0.023917943584026657 -0.07362203608000215 -0.008416997076060357 -0.026181290438095115 -0.022120412546794022 0.006087602045626721 0.0 0.0 0.008480401875066384 0.006087602045625657 0.0 0.3594320486815416 0.6405679513184585</Array> </Cluster>
Q2: PMML writer or reader seems to "stumble"
In the same KNIME workspace as the above question, I have several PMML-data-prep nodes in succession and a number of derived variables that are made on previously derived variables. Initially the node-string is working fine in that I am able to construct the model and export the PMML file just fine. But when I subsequently try to import the PMML file back into KNIME to another workspace, it then fails at import. It seems that the PMML reader gets "confused" from the multiple derived variables. Is that a bug in the reader and/or is there a way to get around it?
I am somewhat new to KNIME (but really like it! :)), so I hope that these questions are not ones that I should have been able to get answered quickly elsewhere - instead of bothering you with them! :). If someone can point me to some PMML+KNIME documentation then that would probably also help me in finding the answers...
Thank you very much!!
John Westberg