Document Grabber - Read .gz

I’m new to KNIME and currently working on searching the PubMed database. My goal is to generate a file that contains the PubMed ID, all authors, and their corresponding affiliations.

With the EuropePMC node, I can perform the query and extract the XML data using the X-Path node in the required format. However, the EuropePMC database is not as reliable as PubMed. Therefore, I now want to search the PubMed database using the Document Grabber node, which works well for retrieving the data.

The issue arises when extracting the information using the Document Data Extractor node—I am unable to retrieve the affiliations and PubMed IDs. The downloaded .gz files contain XML data, including the ID and affiliations, but I cannot read them directly in KNIME.

Is there a workaround for this, or does anyone have tips on how to implement it? Any help is appreciated!

@SCordes welcome to the KNIME forum. Do you have problems decompressing the files or reading the XML?

Maybe you can provide an example.

1 Like

Thanks for the quick response.

I had problems reading the XML file in the fetched .gz files. Initially, I wasn’t able to directly read the files in the compressed folder. As a result, I extracted the .gz file. However, the XML format was not readable by the XML reader.

Yesterday, I did a comparison search in EuropePMC and PubMed. A significant portion of the literature in EuropePMC was not correctly indexed. I then reported an issue to EuropePMC. They quickly implemented a fix. Today, I ran another comparison search and achieved a 99.7% matching of the IDs.
This is why I will use the EuropePMC node—because it directly provides the XML and contains more metadata than the PubMed XML files.

I will test the nodes you suggested when I get the chance. However, my problem is currently solved.