Joining tables, using a shared column, not RowID.

lucatoldo · August 24, 2009, 3:36pm

Dear KNIMErs,
I find the KNIME vision really good (and not only since I did my Ph.D. in ETHZ),
and am trying to “convert” myself from another commercial environment which does not deliver such a
user friendly workflow environment and powerful integration with several state-of-the-science activities (e.g. R, …).
Nevertheless, I am facing now a task which is trivial in the “other” application, while seems to me impossible in Knime. I am alluding at the task of joining tables using a shared column not being a ROWID.

Just to make the example simple to understand, I got 2 tables:
TABLE1: Patient ID, Lab Measurement Type, Lab Value
TABLE2: Lab Measurement Type, Lab Measurement super group, Lab Measurement Type

For example

TABLE1
Patient ID LabMeasType LabVal
0001 GLUCOSE 10
0001 LDH 12
0002 GLUCOSE 11

TABLE2
Lab Measurement Type, Lab Measurement super group, Lab Measurement Type
GLUCOSE GROUP1 CHEAP
LDH GROUP2 COMMON

I would like to join them so that I can then get

Patient ID LabMeasType LabVal
0001 GLUCOSE 10 GROUP1 CHEAP
0001 LDH 12 GROUP2 COMMON
0002 GLUCOSE 11 GROUP1 CHEAP

If I try to do as adviced in the documentation, I have to use the Joiner preceeded by RowID.
However unfortunately when I connect RowID to the TABLE1 the system “complains” since there are multiple values. Of course if I check on uniqueness then I am going to loose patients…

I am thinking now to use an R external script in order to do this task as fastest “workaround”,
any other suggestion ?

Your advice is really appreciated.

lucatoldo · August 24, 2009, 3:47pm

It seems that a solution in R is really straightforward:


dataToMerge=data.frame(yourtablename2$PROBE_ID2, yourtablename2$SEARCH_KEY2)
mergedData=merge(yourtablename1,dataToMerge,by.x=SEARCH_KEY1,by.y=SEARCH_KEY
2,all.x=T,all.y=F)

found in http://finzi.psych.upenn.edu/Rhelp08/2008-October/176161.html

aborg · August 24, 2009, 3:49pm

I guess if you put the RowID node before the second table you will not have a problem with the duplication. Is it a correct assumption?
In case it is, then I would use the second table’s RowID as the first input, and the first table as second input. (And select Right outer join.) If neither first, neither second is unique you can try GroupBy one of them based on Lab Measurement Type column (keeping the other important columns’ contents in a List).

lucatoldo · August 24, 2009, 3:59pm

Unfortunately, there is no 2:1 metanode available for R… therefore I will have to eyploit the JPython Script 2:1 node for this purpose !

lucatoldo · August 24, 2009, 4:02pm

Indeed, seems that the same way was also seeked by others …: http://www.knime.org/node/531

lucatoldo · August 24, 2009, 4:58pm

Unfortunately, I am not that familiar with Python and have very little time. I went as far as this
dts0=inData0.getDataTableSpec() dts1=inData1.getDataTableSpec() ic0=dts0.findColumnIndex(“Lab Measurement Type”) ic1=dts1.findColumnIndex(“Lab Measurement Type”) iterator0=inData0.iterator() iterator1=inData1.iterator() while iterator0.hasNext(): row=iterator0.next() cell=row.getCell(ic0) while iterator1.hasNext(): row1=iterator1.next() cell1=row.getCell(ic1) if (cell0==cell1) : iterator2=row1.iterator() while iterator2.hasNext(): newRow=AppendedColumnRow(row,iterator2.getCell()) outContainer.addRowToTable(newRow)
however it does not work and I have no much time available to be “clean” on this topic.
Therefore I will do the data transformation using “the commercial tool” and then import that data in Knime.
Perhaps some of you could help in cleaning up this script ?
Help appreciated !

aborg · August 24, 2009, 5:15pm

Hi, I am not sure you have seen my previous comment (I think it should work, and easy and hard to do it wrong), although I see a little problem with your solution. You mentioned LabMeasType in your second table, and used Lab Measurement Type in your code. That might be a problem. Bests, gabor

lucatoldo · August 25, 2009, 8:04am

Dear Gabor,
indeed I haven’t seen your previous comment and I thankyou for pointing my attention to it.
I will give it a try this morning before doing the external way…
I was wrong in my example: the column name doing the “join” in the 2 tables is called precisely with the same label.
Eventually, a Jpython code would be very appreciated as well
I wish you a nice day and thankyou for your help.
I give it a try later on and let you know.
Luca

lucatoldo · August 25, 2009, 11:38am

Dear Gabor,
sorry but either I do not understand what you suggest or it does not work…
Here is what I did:
Created Node “File Reader” to read a file with the TABLE1 content.
Created Node “RowID"
Connect the output of “File Reader” Node with the input of “RowID” Node.
Configure the “RowID” Node as follows:
x Replace RowID with selected column values
New RowID column: [String]LabMeasType
When I now Execute the “RowID” Node, I get the error message
"Execute failed: error in row 5 Duplicate value …”

Am I doing it in the way you suggested ?

aborg · August 25, 2009, 12:08pm

I suggested the RowID for the second table. (And if there the values are not unique, then you should GroupBy your values.)

aborg · August 25, 2009, 12:39pm

Sorry, my English is not perfect, I will try to draw the setup I usually use:

File Reader (Table 1) --------------------Joiner File Reader (Table 2) (⇒ GroupBy) ⇒ RowID/
RowID: replace by Lab Measurement Type
Joiner: column from second table (from table 1!) by LabMeasureType, Join mode: Right outer join

(Is it intentional that you have two Lab Measurement Type columns in second table?)

aborg · August 25, 2009, 12:43pm

Well, the image is still not so good… So the lines should cross each other in this setup. (From RowID to first inport, from the File Reader (Table 1) to the second inport.)

lucatoldo · August 25, 2009, 12:57pm

Dear Aborg,
thankyou for your efforts ! I’ll try to make it … would be nice if this Forum would have the possibility to upload images

gabriel · August 25, 2009, 1:27pm

Hi all,
Sorry for jumping into your interesting discussion. I am not sure if your approach is applicable if both table columns to be joint are not unique. Let me give an example where the first column in both tables should be joint:

Table1:
A 1 2
B 3 4
B 5 6

Table2:
A 3 4
A 4 2
B 3 5

Grouping Table2 leads to (where {} indicates a set of values):
A {3,4} {4,2}
B {3} {5}

Joining Table1 with Table2* leads to:
A 1 2 {3,4} {4,2}
B 3 4 {3} {5}
B 5 6 {3} {5}

I am not sure if this is the expected result which in my option would look like this:
A 1 2 3 4
A 1 2 4 2
B 3 4 3 5
B 5 6 3 5

In order to gain this result currently with KNIME, I could only imagine to use a flow using the looping mechanism. That means, you loop the values from the first table and filter the second table based on the current loop value. The remaining columns from the first table are joint onto the first table. This is a little bit tricky since you need to aggregate them first into one column, encapsulate them into a new variable which is appended to Table2. Split up the aggregated column at the end of the loop. Hope this becomes helpful…
Cheers, Thomas

lucatoldo · August 25, 2009, 2:06pm

Dear Thomas,
thankyou for your “jumping into” !
The second table column’s are unique, while the first table are duplicated.
Taking your example, assuming to join using shared key in column1
Table1:
A 1 2 B 3 4 B 5 6
Table2:
1 3 4 2 4 5 3 5 6 4 6 7 5 7 8
The result I would expect is
A 1 2 3 4 B 3 4 5 6 B 5 6 7 8
namely column1 and column2 from table 2 are added to table1

gabriel · August 25, 2009, 2:48pm

Perfect. As aborg already mentioned: You just need to to use the RowID node to transfer the first column from Table2 as row ID (if the column is unique the RowID node does not complaint!) and then connect this table to the second input of the Joiner node. An right outer join should deliver exactly the desired result. Good luck, Thomas

aborg · August 25, 2009, 3:00pm

Gabriel, I guess you meant: first input in this sentence: You just need to to use the RowID node to transfer the first column from Table2 as row ID (if the column is unique the RowID node does not complaint!) and then connect this table to the second input of the Joiner node.

The solution you described for non-unique tables is really complicated, but I guess that is the easiest way currently.

gabriel · August 25, 2009, 3:18pm

Hi aborg, you are right. You have to use the second table on the first input of the Joiner node. Sorry and thanks! Yezz, using the looping behavior in the case is not that straight forward, but this is currently the only solution I see when non-unique columns are present.
Regards, Thomas

lucatoldo · August 25, 2009, 5:19pm

Dear Thomas and Aborg, thankyou so much for your help.
I will try to do what you described and feedback you.
thankyou again
luca

lucatoldo · August 26, 2009, 10:39am

Dear Thomas and Aborg,
I think I have to go through the looping way, since the Table1 contains duplicated values in column1 (the value 1) and using the GroupBy messes up completely my data.
I realised that in my previous example I did not provide you with the duplicated row for Table1.
Here the example which has the appropriate row showing the problem.
Table1

Table2:

The result I need is