I need to build a focused crawler to download some web pages and extract specific data from them and just had the idea of realizing it using a KNIME workflow. The general mechanism would be pretty straight forward using some Palladian and XML nodes.
In general it works as follows: I have a table containing some seed URLs for which the corresponding HTML pages are fetched and parsed. Some of these pages might contain further links which (extracted via XPath), which I also want to process.
Problem/question: Is there a way to create a loop and dynamically put the extracted URLs to the initial seed URL table, rerun the workflow until a certain iteration/condition and continuously append all the extracted results to a single target table (maybe similar to the X-Validation nodes collecting data from multiple iterations)?
I looked at the various looping nodes, but I'm a bit puzzled, whether my plan is possible at all.
Any hints, comments would be greatly appreciated :)