Building a Reproducible LLM Workflow for Investment Advisory

Hi everyone,

I’m currently writing my bachelor’s thesis on AI in investment advisory, and I want to run an experiment where an AI “competes” against humans. I’d like to build the workflow in Knime so that the process is as reproducible as possible.

Since reproducibility is tricky with generative AI, my current idea is to structure the pipeline like this: extract the relevant text from a PDF, send that text to the model, run the same prompt multiple times, and then aggregate the outputs (for example, by identifying the most frequently occurring answer). The goal would be to get at least some level of consistency through repeated runs. I’m also considering setting the temperature to 0 and controlling the output further via style/format constraints.

What do you think about this approach? How would you implement it cleanly in practice (especially in terms of prompt design, evaluation/aggregation, and documentation)? And what should I expect cost-wise if I do this via an API, given that token usage adds up quickly?

I noticed that with a standard ChatGPT account I can’t really control parameters like temperature reliably, so using the API seems like the most realistic option. I’d really appreciate any input on how to present this method in an academically sound way and how to configure it sensibly.