Doubt about sumarization and A.I

Hello friends
I would like to ask a question about the concept of API requests.

I am using Gemini, and I decided to run the following test:

I am studying languages and using a flashcard app.
This app (AnkiApp) allows uploading content via a CSV file.
I downloaded a list of 10,000 words from GitHub.

  1. I created a workflow in Knime to read the Excel file with the 10,000 rows.
  2. I connected to Gemini and set up the following prompt: Translate the word from english to japanese and give me at least 1 example, or if needed, max of 5 examples with context : ā€œwordā€
  3. My list has 10,000 words, meaning 10,000 rows. When I send the request to Gemini, 10,000 requests will be made for each row. The return will be the translation with the examples (taking about 20 minutes)

But my real doubt arose from the following question:
What if my database were related to another topic and I needed Gemini to analyze the entire content to give me a summary of the result?
What I want to know is, if the result of the analysis depended on reading the entire document for the A.I. to then provide an output, would I have to make 10,000 requests?

Practically and performantly, how is a summarized analysis done by an A.I. on a spreadsheet with 10,000 rows?
For me doesn’t make sense do it this way.

(I understand that Knime has its own tools for data analysis, like GroupBy, Average, etc., and obviously not everything needs to be created by A.I., but my question is what to do in a situation that requires summarization of a large table of information with rows)

(I understand that if I wanted to summarize a text-related subject, I wouldn’t do it row by row, but would instead use a datatype like ā€˜document’)

1 Like

Hey,

So in general I’d split this into two scenarios:

  1. the 10k rows in your DB together have a meaning that can be understood by reading the full text - say for whatever reason you have a book split into 10k chunks and each chunk represents a row in your DB
  2. Your 10k rows are more true tabular data - e.g. assume you have 10k records of time series data - maybe related to stock price development of different stocks.

In scenario 1 - if it exists - I think part of the answer you have given yourself: Combine the 10k chunks into one string and create your prompt. The make one request to the LLM instead of 10k requests to a model that has enough context size…

In scenario 2: LLMs are known to be pretty bad with tabular data - might be that this has improved since I last looked deeper into it (played around with ā€žChain of tableā€œ about a year ago, with not so good results - maybe google it…).
So in this case, in my view, using LLMs with enhanced capabilities like code writing and receiving files as input. E.g, you can already now upload a file into chatGPT, switch on canvas (coding mode) and let it Analyse a file with python…
I’m not sure if this can be done via KNIME yet, but am pretty certain this new ā€žMessageā€œ data type that was introduced with 5.5 alongside the Agentic Festures will keep evolving…
Maybe also try to turn your 10k records into a json object and pass that in as string into your prompt…

You could also already now build a knime based Agent and give it a tool on the ā€ždata layerā€œ - I.e. a tool that uses your table of 10k records as an input. In that case you need to define this tool yourself for your specific use case…

Just some thoughts :slight_smile:

2 Likes

Thanks for the information Martin (as always :slight_smile: ) :fist_right: :sparkles: :fist_left:

The Json strategy i think is good.
I’ll try later

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.