Create a new column based on flow variable or input excel?

paramasi · August 22, 2022, 12:56pm

Hi Everyone,
I am creating an automated PubMed search for multiple topics. I build this based on inspiration from a few former workflows posted in this forum. Everything works fine except I want to create a new column to annotate topics for each query. For example. in the first image, I have searched terms and topics. In the second image, I need the topics column for each query. Can someone have a clue how to do it? I also uploaded the workflow if it makes it easier to understand. Thank you.

Pubmed.knwf (121.9 KB)
KNIME1

ArjenEX · August 22, 2022, 1:22pm

Hi @paramasi ,

You can quickly achieve this with the Variable to Table Column node. You already have the topic as variable due to your loop start configuration.

The topic per loop iteration is therefore already known.

Add mentioned node, for example between the doc grabber and loop end, and select the topic to be (forced) included.

This should give you the desired output if I understood it corectly.

Hope this helps!

aworker · August 22, 2022, 1:25pm

Hi @Paramasi

Please find below your workflow with the required modifications to get it to work.

I also added a generic way of gathering a current temp1 workflow directory “knime://knime.workflow/data/temp1” so that temp files can be stored locally making this workflow more generic for everybody:

Pubmed (v2).knwf (242.8 KB)

Hope it helps.

Best
Ael

paramasi · August 22, 2022, 2:11pm

Hi @aworker and @ArjenEX,

Brilliant!!! Thank you both for explaining how to achieve it and for editing the workflow. I appreciate it very much. I am learning a lot everyday from this forum

Best,
Prasath

aworker · August 22, 2022, 2:18pm

Thanks @paramasi for your kind comments and glad to help

Best wishes,
Ael

paramasi · August 22, 2022, 3:16pm

Hi @aworker @ArjenEX and all,
it seems there is a problem with the workflow when I search for a narrow date range. For the search term (Covid) AND (biology), I get 4 hits on the PubMed website but I get an empty table with the workflow. Unfortunately, I could not identify where this problem is coming from. Could anyone of you spot the issue?

ArjenEX · August 22, 2022, 4:08pm

It’s on the left hand side under Publication date, meaning from yesterday and today.

@paramasi
As to why you are getting different results, it appears to be a long outstanding bug unfortunately.
See:

github.com/dami82/easyPubMed

Number of records retrieved with batch_pubmed_download() don't match the site

opened 09:56PM - 26 Jun 20 UTC

JFormoso

Hi! I can't figure out what I am doing wrong. Pubmed shows 66 records and the df… resulting from this code returns 27. I've tried altering the sintax of the string and I always get the same result. If anyone can point me in the right direction... Thanks! busqueda <- '((inference) AND (verbal ability)) AND (comprehension)' output <- batch_pubmed_download(pubmed_query_string = busqueda, dest_file_prefix = "NUBL_18_", encoding = "ASCII") archivo <- output[[1]] base <- table_articles_byAuth(pubmed_data = archivo, included_authors = "first", max_chars = -1, encoding = "ASCII")

I also played around with it but the only other noteworthy thing I found is that the website gives a slightly different query when I search for the same terms and the date range: (Covid) AND (biology) AND (2022/8/20:2022/8/22[pdat]) . This instead of dp like you have, but ultimately it didn’t make a difference.

paramasi · August 22, 2022, 4:32pm

Hi @aworker I was using the date range from 21.08.2022 to 22.08.2022.
As @ArjenEX mentioned above, there seems to be a bug that may be causing this issue. Thanks for bringing it to my attention!
When I played around, with a wider date range, I noticed that the most recent articles are not listed by the document grabber, but the rest are in perfect match with a direct search on the PubMed website.

I updated the workflow with a few more options to visualize the results (attached).

Is there any other option I can use other than the document grabber to solve the issue of search results? I was trying to use the ‘European PubMed Central Advanced Search’ node to replace the document grabber, but the result format is in XML and I do not know how to extract this information. Seems more work and help are needed

Pubmed-V1.knwf (319.5 KB)

ArjenEX · August 22, 2022, 6:28pm

@paramasi

You won’t get this overnight I’m afraid. The result you’re getting out of the node currently refers to the id and pmcid of each document. You need to then use these values as input for another subsequent query.

See their documentation:

Getting these values is not an issue, you can use the Xpath node to get information from an xml.

Then create the required url for the fullTextXML request.

The GET Request then returns the full xml that you can query further.

Notes:

The XML that it’s returning is a very poorly build one which makes it a pain in the rear to extract data properly.
It’s a very slow process due to the size of each xml, it will take ages to process all the results.
The majority of the search results is not even accessible and will give you a 404 error.
The structure of the xml’s is also different so getting the actual title, authors, etc. also has to be done dynamically. Could get a complex matter, even more so if you don’t have a lot of xml experience.

Honestly, I would take this route as last resort. You might be even better of with a webscraper approach. For example with Selenium nodes.

The EU website is somewhat structured whereby the different section are pretty easily recognizable.

But this also comes with its disadvantages like having to account for pagination.

What you also could try is approach the great people of @Vernalis , who made the PubMed Central Advanced Search node. I don’t see any public examples on the KNIME Hub, but maybe they have some reference material on how this can be used effectively.

paramasi · August 23, 2022, 7:51am

Hi @ArjenEX, Thank you so much for your very detailed answers. You are absolutely right, it does seem complicated. For the moment, I will use the current workflow and keep an eye on any other alternative options.
Thanks again to both @ArjenEX and @aworker for your support.

Cheers,
P

Vernalis · August 23, 2022, 1:06pm

We don’t have any specific examples unfortunately for the node, but it should be reasonably obvious I think how to try using it. Most of the query can be copied from the ePMC website or your input table into the ‘General Query’ box:

e.g. (Covid) AND (Biology) AND 2022/08/10:2022/08/22[dp]

I hope that helps. You will still need an XPath node to parse the resulting XML

Steve

system · November 21, 2022, 1:07pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.