Any way to extract the "Keywords" section of scientific publications using Document Data Extractor or any other nodes?

I have queried PubMed using the Document Grabber node and inputting a PMCID in the query section. Besides extracting things like Title, Abstract, etc., I would also like to extract the keywords section that is often found in scientific publications. Is this possible? Any ideas would be greatly appreciated!

Hey zhuma,

extracting the keywords section is currently not possible by using the Document Grabber node, but I will create a ticket for this. However, I can’t give you a date when this will be implemented.

Another way might be using the XPath node to parse the PubMed xml pages to get the keywords section. However you have to get the URLs for the articles you want to parse by using REST for searching PubMed.

I hope this helps,
Julian

1 Like

Thanks, Julian! I tried your suggestion, but a lot of the articles I am querying return a ? in the body column for some reason and I’m not sure why? The URLs seem fine. KNIME_project4.knwf (778.7 KB) Any insights you can provide as to why this may be happening would be greatly appreciated!

Hello @zhuma -

If you look at the status codes in the output from the GET Request node, you see almost all 429s. This is a “Too many requests” response from the server. You could try setting a delay in the GET Request configuration dialog - perhaps that will help.

EDIT: I just tried myself using a 1000ms second delay on 20 URLs, and it seemed to work fine. You might be able to reduce the delay even more.

2 Likes

Thank you, Scott! This worked perfectly!!

Though a couple have returned a ? even in the status code box. Any ideas what this could be? I tried a 500ms delay, so I will try increasing it to see if that helps further.

My guess is that those particular PMCIDs are malformed, or don’t correspond to anything in the database. But that’s just speculation. It’s definitely odd that the status code is returned missing.

Thanks for all your help, Scott! Increasing the time delay to 1000ms worked on all but one URL, so that’s awesome!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.