Add ability to mark data sensitivity level to prevent accidental sharing with AI

Hi,

I’d like to propose the option to persistently mark the sensitifiy level of data. That would allow to prevent i.e. sensitive data to be passed un-anonymized to AI-Agents, providing the ability to instate safetymeasures and automations to anonymize data.

Best
Mike

Adding to this (have discussed this with Mike and also some KNIMErs at DataHop in Stuttgart):

I feel this would be a great feature - especially in the context of working with AI Agents.

The overall story with having a data layer that allows the user to control what data (if any) gets fed to the LLM is great already. In my use case I still face a challenge that any data required to actually call a tool right now has to go to the LLM.

E.g. if my agent should create a customer record in my CRM, say for Customer “MartinDDDD” then:

  1. MartinDDDD is typed into chat interface
  2. It is then send to LLM
  3. LLM generates the Payload required to call the tool (which will include the parameter e.g. customerName: MartinDDDD
  4. The tool creates the record

I have thought about alternatives on how to implement this:

  • Option 1: Create a custom Agent UI (i.e. not using Agent Chat view etc), sanitise user input before it goes to the LLM (e.g. using Presidio extension) => this works for the input, but given that the tool call happens before the response gets back, this would mean that a record for the sanitised name is created (and also not using the great Chat Views / Widgets seems like a waste)
  • Option 2 (and this is just conceptual, probably impossible): Somehow funneling the customer name into the data layer - but this would probably mean for the user to have a form available which saves the name in e.g. .table-format in a temp location (which I understand right now is tricky as relative paths are not supported between Agent-Level WF and the Tools), then have the agent use a tool which fetches the data and in the same tool to create the record. This seems super complicated and impractical
  • Option 3: ignore this challenge in KNIME and e.g. use a self-hosted LLM (renting GPUs…), which will be pricey and likely not a possibility for most users

What I am thinking of how this could be solved in KNIME (possibly with a trade of in terms of increased latency):

  1. MartinDDDD is typed into chat interface
  2. Agent Chat View / Prompter / Chat Widget contain a setting to trigger anonymisation e.g. using Presidio under the hood => if active:
    1. Text from Interface / Table is anonymised, temp e.g. presidio model is stored) - e.g.:
      1. Customer_A: MartinDDDD
  3. It is then send to LLM
  4. LLM generates the Payload required to call the tool (which will include the anonymised parameter values e.g. customerName: Customer_A
  5. Before tools are invoked, the temp presidio model is used to de-anonymise to turn customerName: Customer_A into customerName: MartinDDDD
  6. The tool creates the record

I hope the above reasoning and example makes sense. Happy to explain my views in more detail :slight_smile:

1 Like