Introducing limbic-tool-use: A Small Model for Detecting Problems with Tool Use in AI Agents
Despite widespread adoption of tool use, there has been no dedicated model for evaluating tool-use accuracy—until now.
Despite widespread adoption of tool use, there has been no dedicated model for evaluating tool-use accuracy—until now.
Hallucinations—model-generated outputs that appear confident yet contain fabricated or incorrect information—remain one of the peskiest issues facing AI engineers today. As Retrieval- and Search-Augmented Systems have proliferated, systematically identifying and mitigating hallucinations is now critical. ...
Large language models (LLMs) are increasingly being used to interact with external tools and APIs. However, evaluating their tool calling capabilities presents unique challenges compared to traditional single-turn or retrieval-augmented use cases. This post is an attempt to synthesize our findings from recent papers to help builders effectively evaluate LLM tool calling. We read through and reviewed a dozen papers published between 2023 and 2025 to better understand how tool calling is evaluated. ...
On April 25th, OpenAI shared a surprising update: after introducing thumbs-up/down feedback from ChatGPT users into their GPT‑4o fine-tuning process, the model got noticeably worse. ...
As large language models (LLMs) are increasingly adopted in critical industries, ensuring their outputs are factually grounded has emerged as a major concern. One prominent issue is “hallucination,” where models generate content unsupported by or contrary to the provided evidence. Existing hallucination detection benchmarks are often limited, synthetic, or narrowly focused on specific tasks like question-answering. Recognizing this gap, we developed HalluMix: a task-agnostic, multi-domain benchmark designed to evaluate hallucination detection in realistic, diverse contexts. ...
We're thrilled to announce the release of evaluations in Quotient's Python SDK!
We're excited to introduce judges, an open-source library of LLM-as-a-judge evaluators designed to help bootstrap your evaluation process!
We present Subject-Matter Expert Language Liaison (SMELL), a novel framework that combines human expertise with LLM capabilities to automatically create feedback-informed LLM judges.
Quotient's AI-backed Policy Compliance Evaluator rigorously assessed the compliance of Wayfair's AI agent responses with internal customer support policies.
In this article, we discuss how developing AI products differs from traditional software development and why reliable evaluations are the key for consistently shipping them.
We’re excited to announce that Quotient and MongoDB are teaming up to provide developers with a new, transformative solution for enhancing their AI products with Retrieval Augmented Generation (RAG)!
We're excited to announce a new, dynamic partnership between Quotient and Qdrant.
We’re on a mission to change all that, and help developers build better AI products, faster.