I’m hoping that someone can help me with a data problem I just can’t quick crack. I’m trying to match titles based on lists of ISBNs. What I need to do is find a transitive relationship among the titles. For example, if Title A matches with Title B due to a shared ISBN, and Title B matches with Title C due to another shared ISBN, then A should be associated with both B and C (even if A and C don’t directly share an ISBN).
As an example, let’s say we have 3 ISBNs with their associated ID numbers:
How can I cluster all of these IDs together, seeing that the first ISBN matches the second by virtue of ID 2878, and the second ISBN is connected to the third by virtue of ID 2879.
@Cosmo72 Can I assume that your list may be “infinitely” long and the associated ISBNs are not physically adjacent to each other? Is it possible to provide a sample dataset?
Thanks for helping out! You are correct that the list may be infinitely long (generally up to about 15,000 lines) and the ISBNs are sometimes, but not always in rows that are adjacent to each other. I will attach a test file. The particular problem I am trying to solve is illustrated in lines 125, 126, 127. In these entries for the book " A House Divided?", you will see that line 125 has the ISBNs 0919666132, 9780919666139, line 126 has the ISBN 0919666132 and line 127 has the ISBN 9780919666139. I need a solution that would capture all of these titles as the same book.
Tnx for sharing a sample file. This is not straight forward. Do you have any idea what the output should look like? You can e.g. create a network graph, to see how isbn and the book_id’s are related. I have not enough knowledge about the configuration of the KNIME Network Nodes. But maybe its is possible to extract the cluster created in these graphs. isbn_network.knwf (126.0 KB)
Wow, that is really cool! I did not know that could be accomplished in KNIME. I will remember it for other projects. But for this one, the output I am looking for is simply a condensed list of titles, with a new column containing the IDs which were included in the collapsed title.
I think this workflow does what you want. The titles were pretty dirty and needed some preprocessing to make them uniform. I haven’t spent much time QAing the workflow or splitting the output into separate columns. I thought I’d get your comments first.
I made a few changes. A few of the titles had the “remainder” included in the main title. I removed those so the grouping would be consistent. I also added an entry counter so you can see how many times a title appears in the original list.
Thanks for taking the time to work on this. While some aspects of what you present here will be useful for my workflow (and future workflows), it doesn’t address some of the data problems I am trying to overcome. For example, I cannot simply group on title alone, because there are often multiple editions of these titles and I need to treat each edition as a distinct entity. Also, titles are often not entered exactly the same, so a group on title will inevitably miss some that have been entered in a non-standard way. With no field being a good “groupby” candidate, I was hoping to find a solution that was based on the ISBN numbers themselves.
I will keep working on it, but would like to thank you for your generosity in trying to find a solution!
A question: Are the ISBN numbers manually input into your database? If so, they’re as likely to have errors as the book titles.
An obsvervation: Some of your entries have no ISBNs and the secondary identifiers don’t match. See the example below. Unless these are truly different books with the same title, you’ll mever match them with ISBNs alone.
Yes, the ISBNs are enter by humans, and yes they sometimes contain mistakes. You are correct in observing that ISBN matching will not be able to match all the titles. This is an only component of a much larger workflow that takes into consideration other match keys that are generated from the data (i.e. title, author date, etc.). The matching is done in sequences, each time collapsing the list, until no more matches can be made.