Downloading 'Input files' for knime workflow

I'm using Knime 3.1.2 on OSX and Linux for OPENMS analysis (Mass Spectrometry).

Currently, it uses static filename.mzML files manually put in a directory. It always has more than one file pressed in at a time ('Input FileS' module not 'Input File' knime module) using a ZipLoopStart.

I want these files to be downloaded dynamically and then pressed into the workflow...but I'm not sure the best way to do that. 

Currently, I have a Python script that downloads .gz files (from AWS S3) and then unzips them. I already have variations that can unzip the files into memory (using StringIO and maybe pass them into the workflow from there as data??). 

It can also download them to a directory...which maybe can them be used as the source? But I don't know how to tell the ZipLoop to wait and check the directory after the python script is run. 

I also could have the python script run as a separate entity (outside of knime) and then, once the directory is populated, call knime...HOWEVER there will always be a different number of files (maybe 1, maybe three)...and I don't know how to make the 'Input Files' knime node to handle an unknown number of input files. 

I hope this makes sense. 

Thanks!

Mike

Hi Mike,

I would have the Python script do its job of downloading and unzipping the files to a directory accessible to KNIME, then write a placeholder trigger file in the same directory once done (say "trigger.txt").

On the KNIME side, you can use the Wait... node to monitor the folder until the trigger.txt file appears, then use the List Files to list all files in that folder (filtering out the trigger.txt file which you don't need).

With the list of files in the folder, you can enter a Chunk Loop and go through them one by one, reading, processing and writing what you need to.

Would that work?

Cheers,
Marco.

Hi Marco, 

Thats a good idea. I did play around a little bit with the 'List files' node...but, while I could get it to list the files, I couldn't attach it directly to the zipLoop. Can you tell me what I need to put in between?

I'me very new to Knime. I'm sure I'm missing something obvious. 

Thanks!

P.s. Also, having looked at the 'Wait node' I have absolutely no idea how to configure that. Continuing to read the docs and google the web...but if you could offer some advice there, I'd take it :D

 

HA! OK. I figured it out. Being new to Knime, I don't know if this is an efficient use of Knime, or a complete Kluge...but it does work. 

So, part of the problem is some of the Knime specific objects - One of which is called URIDataValue.

A Python Pandas dataframe is, apparently, interchangable with the Knime tables. However, I don't know if there's a way to import one of these URIDataValue objects into Python. So here's what I did...

1. I wrote a Python script that creates a Pandas Dataframe, and populates it with one Column:

from pandas import DataFrame
# Create empty table
T = DataFrame(
	[
		['file:///Users/.../copy/lfq_spikein_dilution_1.mzML'], 
		['file:///Users/.../copy/lfq_spikein_dilution_2.mzML'], 
	], 
)
T.columns = ['URIDataValue']						
#print T
output_table = T
Python Pandas Dataframe
URIDataValue (string)
file:///Users/.../copy/lfq_spikein_dilution_1.mzML(string) 
file:///Users/.../copy/lfq_spikein_dilution_2.mzML(string) 

Note: The column name and values are strings. This is important, since (apparently) if the column name is not 'URIDataValue' the next node doesn't know what to do. 

NEXT, the 'output_table' from the 'Python Source' node is patched to a 'String to URI' node, which (apparently and magically) knows to change the entire columns string values to URIDataValues (presumably based on the name of the first column...don't know that for sure). 

Finally, the NEW table, with the correct data objects goes to a 'URI to PORT' node...since apparently 'Port' objects and a 'URI' object are different. 

This, then, matches the needed input to the ZipLoop...which is normally the out put from a static (hard coded) 'Input Files' node. 

Tadaa!

I have no idea what I'm doing, but it worked. 

 

Hi, glad you could figure it out in the meantime.

The difference between a URI and a String, beside being labeled as such internally, is that the URI has a well defined structure while a String hasn't. Converting from String to URI ensures that the URI has the proper structure and is therefore valid:

  scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]

The ZipLoopStart node on the other hand expects not a URI table but a URI data object to its input port. Hence the necessity to use the URI to Port node to convert between the two. This is exactly what you did.

Regarding the Wait... node, here is how to use it.

1) Place the Wait... node at the beginning of the workflow that needs to be executed when the trigger.txt file is created.

2) Configure the Wait... node to wait for File Creation (third option) and indicate the directory/filename to wait for (e.g. trigger.txt)

3) Display the Flow Variable port of your second workflow node, the one that should start executing once the Wait... node has been triggered by the creation of the trigger.txt file. You should see two red "Mickey Mouse ears" on the node. 

4) Connect the output port of the Wait... node (right port) to the flow variable input port (left port) of the second workflow node. In this way, you are telling the second node to wait for the Wait... node to trigger before executing.

This is pretty much it. The Wait... node will now wait for the creation of the trigger.txt file and, once that happens, allow the workflow to continue from the second node on.

Here is how it should look like with your other nodes.

Hope this helps.

Cheers,
Marco.

Hi Marco, 

Thanks again for the reply...and the tutorial on wait. Thats going to come in handy. 

Some questions on the URI object: 
- I'm assuming this is a Java object (thats the underlying code for Kinme isn' it?)
- Do you know if Python can generate the object directly (knot through all the conversion nodes I used)?

Thanks!

Mike

Hi Mike,

yes, it's a Java IURIPortObject, but I don't know whether you can create such an object in Python and then inject it into a KNIME workflow. This kind of Python to Java interaction goes beyond my knowledge, sorry!

You may want to have a look at the source code for the ZipLoopStart node here:

http://www.programcreek.com/java-api-examples/index.php?source_dir=GenericKnimeNodes-master/com.genericworkflownodes.knime/src/com/genericworkflownodes/knime/nodes/flow/listzip/ListZipLoopStartNodeModel.java

Look for the execute method.

This said, what you can do (and already did) is to generate in Python a table with a string column and then convert it to URI Port Object via the String to URI and URI to Port nodes. Any reasons why this isn't a suitable solution?

Cheers,
Marco.

 

Hey Mike,

I know it is a bit late but I want to share it with others: Exactly for that problem I rewrote the List Remote Files node from the File Handling plugin to directly output URIPortObjects.

It is called Input Directory and is part of the GenericKNIMENodes package (required by OpenMS). You can give it a regex (so the number of files can vary) and it is directly compatible with the ZipLoop.

Nonetheless, the String To URI and URI to Port combination might come up at other places interfacing with usual workflow control nodes, so it's handy to know.

Cheers

Julianus