Evaluating Tool Calling Capabilities in Large Language Models: A Literature Review

Large language models (LLMs) are increasingly being used to interact with external tools and APIs. However, evaluating their tool calling capabilities presents unique challenges compared to traditional single-turn or retrieval-augmented use cases. This post is an attempt to synthesize our findings from recent papers to help builders effectively evaluate LLM tool calling. We read through and reviewed a dozen papers published between 2023 and 2025 to better understand how tool calling is evaluated. ...

May 6, 2025 · Freddie Vargus, Deanna Emery

Introducing HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Detecting Hallucinations in Real-World Scenarios

As large language models (LLMs) are increasingly adopted in critical industries, ensuring their outputs are factually grounded has emerged as a major concern. One prominent issue is “hallucination,” where models generate content unsupported by or contrary to the provided evidence. Existing hallucination detection benchmarks are often limited, synthetic, or narrowly focused on specific tasks like question-answering. Recognizing this gap, we developed HalluMix: a task-agnostic, multi-domain benchmark designed to evaluate hallucination detection in realistic, diverse contexts. ...

May 2, 2025 · Deanna Emery, Mike Goitia, Freddie Vargus, Julia Neagu

Subject-Matter Expert Language Liaison (SMELL): A framework for aligning LLM evaluators to human feedback

Abstract Evaluating large language model (LLM) outputs efficiently and accurately – especially for domain-specific tasks – remains a significant challenge in AI development. We present Subject-Matter Expert Language Liaison (SMELL), a novel framework that combines human expertise with LLM capabilities to automatically create feedback-informed LLM judges. SMELL addresses the limitations of generic LLM judges while maintaining scalability through a four-stage pipeline: data annotation by human experts, feedback synthesis, rubric generation, and evaluation using the generated rubric. ...

October 8, 2024 · Deanna Emery, James Liounis