Loop for accumulating data

Hi everybody,

I need to build a focused crawler to download some web pages and extract specific data from them and just had the idea of realizing it using a KNIME workflow. The general mechanism would be pretty straight forward using some Palladian and XML nodes.

In general it works as follows: I have a table containing some seed URLs for which the corresponding HTML pages are fetched and parsed. Some of these pages might contain further links which (extracted via XPath), which I also want to process.

Problem/question: Is there a way to create a loop and dynamically put the extracted URLs to the initial seed URL table, rerun the workflow until a certain iteration/condition and continuously append all the extracted results to a single target table (maybe similar to the X-Validation nodes collecting data from multiple iterations)?

I looked at the various looping nodes, but I'm a bit puzzled, whether my plan is possible at all.

Any hints, comments would be greatly appreciated :)


need two tables on the way:

  1. extracted_data_from_one_url: results which are parsed from one url
  2. urls_left_to_handle: at current spot, urls left to work on.

flow to build:

  1. send the initial urls list to node “Delegating Loop Start”
  2. get the extracted_data_from_one_url from the top one in the url list
  3. remove worked url, add new found urls, to get urls_left_to_handle
    4.a. send extracted_data_from_one_url(outcome of step 2) to first input of node "Delegating Loop End"
    4.b send urls_left_to_handle to the second input of node “Delegating Loop End”
  4. node “Delegating Loop Start” will take urls_left_to_handle as input to begin next loop.

So, the key here is that Delegating Loop allows you feed out put of a set of processing to loop start.

Thank Iris for let me know of this node!

Thank you! Sounds promising, I'll give it a try, when I find some spare time :)