Skipping corrupt URLs in File Reader loop

jdwill · August 7, 2020, 1:59pm

Greetings KNIME community,

I have (I hope) a relatively simple problem, but I can’t seem to figure out how to get around it…

I am a chemist working on big data-related project. The ZINC database provides a useful cache of molecules to use with molecular modeling and other related tasks. You can download these for free, but the download process is a bit laborious. One must download a large number of “tranches”, each containing a few hundred or a few thousand compounds. I would like to automate this process so I constructed a simple workflow to accomplish this:

The first file reader takes the list of URL addresses generated by the ZINC database and passes them to the variable loop. The loop contains its own reader that then uses the URL list to read the data and compile it into one list. The filters just get rid of some unwanted formatting; the writer then creates the file I want. It works great until I run into one particular address that is not valid. Then the whole workflow stops. I get the following warning generated by the reader in the loop: “Execute failed: Not a file or knime URL: 'http://files.docking.org/2D/KB/KBCA.smi”. I tried accessing that file with my browser, and it looks like the address is simply corrupt.

I tried a couple work-arounds using try/catch loops and using the File Meta Info node to pre-filter the list of URLs, but neither was successful. The try/catch approach gets stuck, I think, because the warning generated isn’t actually an error, just a warning, And the meta data approach doesn’t work because the “exists” column does not populate for remote sites.

Can anybody think of a good way I can have my loop continue if it comes upon a corrupt address? I don’t mind skipping the data, but it is quite inconvenient if I need to keep going back to trim out the corrupt locations from my original list of URLs.

Thanks for your help!

-JW

elsamuel · August 7, 2020, 3:23pm

Hi @jdwill, welcome to the forum.

“Execute failed: Not a file or knime URL: 'http://files.docking.org/2D/KB/KBCA.smi”. I tried accessing that file with my browser, and it looks like the address is simply corrupt.

That address works just fine for me.

Can you post your workflow?

jdwill · August 7, 2020, 3:44pm

My apologies, but I pasted the wrong link that was generating the error. The actual URL is: http://files.docking.org/2D/KB/KBCD.smi

The workflow is attached:
ZINC download.knwf (15.8 KB)

Thanks in advance for the help!

-JW

jdwill · August 7, 2020, 3:50pm

Oh, and this file would also be helpful…
ZINC-downloader-2D-smi.txt (200 Bytes)

elsamuel · August 7, 2020, 5:34pm

For this I’d try using a GET request, because it will return the status of each file download request:

As you’ve pointed out, the one that failed is the KBCD SMILES file. It turns out that this file doesn’t exist in http://files.docking.org/2D/KB/ .

In the configuration of this node, you can instruct it not to fail on a 404/page not found error.

Getting the requested data in usable shape requires some manipulation, but it’s nothing too complicated:

jdwill · August 10, 2020, 2:38pm

Wow, that worked great! I suppose that using a Get Request node (and not the file reader) was the trick to making the whole thing work. Thanks so much for the help on this!

system · August 17, 2020, 2:38pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.