Using AI Agents for Dynamic Web Scraping and Data Enrichment in KNIME

Hi KNIME Community,

I’m experimenting with AI Agents in KNIME to orchestrate multiple tools dynamically based on user input. Here’s a concrete example I’m exploring:

Imagine a user wants to find Michelin-starred restaurants matching multiple criteria (cuisine type, city, price range, etc.). The workflow could be structured as follows:

  1. Predefined Scraping Tools – Multiple scraping workflows are already prepared for different sources. For example, one tool scrapes restaurants from Site A, another from Site B, etc. The AI Agent determines which tool(s) to execute based on the user’s request. An LLM could dynamically adjust filters and selectors within each scraping workflow to match the specified criteria.

  2. Data Enrichment Tool – For each restaurant, enrich missing data by querying other sources, such as Google Maps for addresses, financial databases for business status, or other public information.

  3. Deduplication & Merge Tool – Merge restaurant entries from multiple sources and remove duplicates to ensure a clean dataset.

  4. Response Generation / Export Tool – Format the final results and export them to CSV or present them in a user-friendly report.

My questions are:

  • How flexible are AI Agents in KNIME for modifying tool parameters or scripts on the fly, especially when an LLM needs to adjust scraping filters based on user input?

  • How are tools executed by the Agent? Can they be parallelized, or is execution strictly sequential?

  • Any best practices for managing complex scraping and enrichment workflows using AI Agents?

I’d love to hear from anyone who has experimented with similar AI-driven scraping and data enrichment workflows.

Thanks in advance!

Hey there,

I’ve been experimenting with simple - medium complexity agentic set ups so let me try to summarise my thoughts on your project and your question:

  1. Predefined scraping tools - in general tools can be build for everything - including for scraping. You can do this e.g. with Web Interaction extension. Needless to say you’ll have to keep a close eye on any modifications of your source, which may affect your scraper
  2. Data Enrichment / Deduplication / Export: all of the above are possible in my view. I get the feeling that you expect the agent to be super flexible - e.g. if source A does not deliver all the results then find Source B and fill in the gaps. That is an interesting concept, but definitely on the high-complexity side. I think this could be possible by e.g. generically extracting links from a website where then the raw html is fed back to the LLM to extract relevant information from. Given the somewhat recursive nature of this I am not sure if there maybe are technical limitations

Now to your questions:

  1. Flexibility: Extracting information from user messages that drive if and which tools to pick, which parameters to send to the tool etc. is the core job of your agent / the LLM behind it - so very flexible is the answer
  2. From what I have experienced so far it seems to be sequential execution - but there can be multiple tools being used before the agent responds. i.e. let’s say a tool can query a database for countries gdp growth and another one creates a bar chart and the user wants the bar chart to show gdp growth for country A, B, C - the agent would use the query tool 3 times and then send the data to the bar chart tool so there’d be 4 tool calls
  3. Not from my end - and given that agents in KNIME are still fairly young you might actually be dealing with a blank canvas / blue ocean type of scenario - so feel free to keep us (me) posted in this thread on what your experience is :slight_smile:
4 Likes

Hi @MartinDDDD

Thank you so much for your detailed and thoughtful reply!
It’s extremely helpful to have feedback from someone with your experience, especially since this whole “agentic” approach is quite new to me as well.

I’ll start exploring on my side with something simple first, just to get a solid understanding of the logic behind agentic workflows. Once I’m more comfortable, I plan to gradually increase the complexity.

For example, I’m thinking about a simple workflow where a user wants to collect restaurant data:

  • Tool A scrapes restaurants from a website

  • Tool B enriches the data, for instance by filling missing addresses

One thing I’m still unclear about is how the agent decides which lines to pass to each tool. For instance, if some addresses are missing after Tool A, should I pre-filter the data to only pass those rows to Tool B, or does the agent detect this dynamically? How is the interaction between tools handled in terms of passing variables or partial datasets?

I’ll start with this basic setup to understand the core mechanics and then gradually add more complexity.

Thanks again for taking the time to share your insights—it’s a big help and very motivating!

Best regards,

1 Like