Dear KNIMErs,
I find the KNIME vision really good (and not only since I did my Ph.D. in ETHZ),
and am trying to “convert” myself from another commercial environment which does not deliver such a
user friendly workflow environment and powerful integration with several state-of-the-science activities (e.g. R, …).
Nevertheless, I am facing now a task which is trivial in the “other” application, while seems to me impossible in Knime. I am alluding at the task of joining tables using a shared column not being a ROWID.
Just to make the example simple to understand, I got 2 tables:
TABLE1: Patient ID, Lab Measurement Type, Lab Value
TABLE2: Lab Measurement Type, Lab Measurement super group, Lab Measurement Type
TABLE2
Lab Measurement Type, Lab Measurement super group, Lab Measurement Type
GLUCOSE GROUP1 CHEAP
LDH GROUP2 COMMON
I would like to join them so that I can then get
Patient ID LabMeasType LabVal
0001 GLUCOSE 10 GROUP1 CHEAP
0001 LDH 12 GROUP2 COMMON
0002 GLUCOSE 11 GROUP1 CHEAP
If I try to do as adviced in the documentation, I have to use the Joiner preceeded by RowID.
However unfortunately when I connect RowID to the TABLE1 the system “complains” since there are multiple values. Of course if I check on uniqueness then I am going to loose patients…
I am thinking now to use an R external script in order to do this task as fastest “workaround”,
any other suggestion ?
It seems that a solution in R is really straightforward:
dataToMerge=data.frame(yourtablename2$PROBE_ID2, yourtablename2$SEARCH_KEY2)
mergedData=merge(yourtablename1,dataToMerge,by.x=SEARCH_KEY1,by.y=SEARCH_KEY
2,all.x=T,all.y=F)
found in
http://finzi.psych.upenn.edu/Rhelp08/2008-October/176161.html
I guess if you put the RowID node before the second table you will not have a problem with the duplication. Is it a correct assumption?
In case it is, then I would use the second table’s RowID as the first input, and the first table as second input. (And select Right outer join.) If neither first, neither second is unique you can try GroupBy one of them based on Lab Measurement Type column (keeping the other important columns’ contents in a List).
Unfortunately, I am not that familiar with Python and have very little time. I went as far as this
dts0=inData0.getDataTableSpec()
dts1=inData1.getDataTableSpec()
ic0=dts0.findColumnIndex(“Lab Measurement Type”)
ic1=dts1.findColumnIndex(“Lab Measurement Type”)
iterator0=inData0.iterator()
iterator1=inData1.iterator()
while iterator0.hasNext():
row=iterator0.next()
cell=row.getCell(ic0)
while iterator1.hasNext():
row1=iterator1.next()
cell1=row.getCell(ic1)
if (cell0==cell1) :
iterator2=row1.iterator()
while iterator2.hasNext():
newRow=AppendedColumnRow(row,iterator2.getCell())
outContainer.addRowToTable(newRow)
however it does not work and I have no much time available to be “clean” on this topic.
Therefore I will do the data transformation using “the commercial tool” and then import that data in Knime.
Perhaps some of you could help in cleaning up this script ?
Help appreciated !
Hi, I am not sure you have seen my previous comment (I think it should work, and easy and hard to do it wrong), although I see a little problem with your solution. You mentioned LabMeasType in your second table, and used Lab Measurement Type in your code. That might be a problem. Bests, gabor
Dear Gabor,
indeed I haven’t seen your previous comment and I thankyou for pointing my attention to it.
I will give it a try this morning before doing the external way…
I was wrong in my example: the column name doing the “join” in the 2 tables is called precisely with the same label.
Eventually, a Jpython code would be very appreciated as well
I wish you a nice day and thankyou for your help.
I give it a try later on and let you know.
Luca
Dear Gabor,
sorry but either I do not understand what you suggest or it does not work…
Here is what I did:
Created Node “File Reader” to read a file with the TABLE1 content.
Created Node “RowID"
Connect the output of “File Reader” Node with the input of “RowID” Node.
Configure the “RowID” Node as follows:
x Replace RowID with selected column values
New RowID column: [String]LabMeasType
When I now Execute the “RowID” Node, I get the error message
"Execute failed: error in row 5 Duplicate value …”
Well, the image is still not so good… So the lines should cross each other in this setup. (From RowID to first inport, from the File Reader (Table 1) to the second inport.)
Hi all,
Sorry for jumping into your interesting discussion. I am not sure if your approach is applicable if both table columns to be joint are not unique. Let me give an example where the first column in both tables should be joint:
Table1:
A 1 2
B 3 4
B 5 6
Table2:
A 3 4
A 4 2
B 3 5
Grouping Table2 leads to (where {} indicates a set of values):
A {3,4} {4,2}
B {3} {5}
Joining Table1 with Table2* leads to:
A 1 2 {3,4} {4,2}
B 3 4 {3} {5}
B 5 6 {3} {5}
I am not sure if this is the expected result which in my option would look like this:
A 1 2 3 4
A 1 2 4 2
B 3 4 3 5
B 5 6 3 5
In order to gain this result currently with KNIME, I could only imagine to use a flow using the looping mechanism. That means, you loop the values from the first table and filter the second table based on the current loop value. The remaining columns from the first table are joint onto the first table. This is a little bit tricky since you need to aggregate them first into one column, encapsulate them into a new variable which is appended to Table2. Split up the aggregated column at the end of the loop. Hope this becomes helpful…
Cheers, Thomas
Dear Thomas,
thankyou for your “jumping into” !
The second table column’s are unique, while the first table are duplicated.
Taking your example, assuming to join using shared key in column1
Table1:
A 1 2
B 3 4
B 5 6
Table2:
1 3 4
2 4 5
3 5 6
4 6 7
5 7 8
The result I would expect is
A 1 2 3 4
B 3 4 5 6
B 5 6 7 8
namely column1 and column2 from table 2 are added to table1
Perfect. As aborg already mentioned: You just need to to use the RowID node to transfer the first column from Table2 as row ID (if the column is unique the RowID node does not complaint!) and then connect this table to the second input of the Joiner node. An right outer join should deliver exactly the desired result. Good luck, Thomas
Gabriel, I guess you meant: first input in this sentence: You just need to to use the RowID node to transfer the first column from Table2 as row ID (if the column is unique the RowID node does not complaint!) and then connect this table to the second input of the Joiner node.
The solution you described for non-unique tables is really complicated, but I guess that is the easiest way currently.
Hi aborg, you are right. You have to use the second table on the first input of the Joiner node. Sorry and thanks! Yezz, using the looping behavior in the case is not that straight forward, but this is currently the only solution I see when non-unique columns are present.
Regards, Thomas
Dear Thomas and Aborg,
I think I have to go through the looping way, since the Table1 contains duplicated values in column1 (the value 1) and using the GroupBy messes up completely my data.
I realised that in my previous example I did not provide you with the duplicated row for Table1.
Here the example which has the appropriate row showing the problem.
Table1