How to Benchmark Multiple LLMs in KNIME?

Hi everyone,

I’m currently working on evaluating the performance of different Large Language Models (LLMs) within KNIME and was wondering if anyone here has already explored a similar approach.

More specifically, I’m interested in building a structured evaluation framework that would allow comparing:

  • Multiple models (both local deployments and API-based models)

  • Different parameter settings (e.g. temperature, top-k, max tokens, etc.)

  • Across a variety of tasks such as:

    • Text classification

    • Text generation

    • Information extraction

    • Image analysis (for multimodal models)

The goal is to assess how model choice and parameter tuning impact output quality depending on the task.

I’m particularly curious about:

  • Existing KNIME workflows or components designed for LLM benchmarking or evaluation

  • Best practices for setting up reproducible experiments in KNIME

  • Approaches for automating comparisons across models and configurations

  • Methods for integrating human evaluation or proxy metrics into the workflow

If anyone has already worked on something similar or has relevant resources, examples, or ideas to share, I’d really appreciate your input.

Thanks in advance!