Welcome to Quotient's blog 👋

Detecting and Fixing Tool Call Errors With limbic-tool-use

October 10, 2025 · Quotient Team

Learn how to use `limbic-tool-use` to catch wrong tool selections, parameter errors, and schema mismatches in your AI agents. Includes real-world examples, evaluation data, and practical tips for improving tool call reliability.

Why Language Models Hallucinate: It Was the Context All Along

September 16, 2025 · Julia Neagu

Education

OpenAI’s recent paper on why language models hallucinate highlights how current evaluation methods create the wrong incentives, rewarding models for guessing rather than admitting uncertainty. ...

Building Reliable AI Financial Agents with Automatic Trace Analysis

September 3, 2025 · Quotient Team

Education

Financial firms are beginning to lean on AI agents for high-stakes tasks that go far beyond simple Q&A.

Introducing limbic-tool-use: A Small Model for Detecting Problems with Tool Use in AI Agents

July 25, 2025 · Deanna Emery, Freddie Vargus, Julia Neagu

Announcement

Despite widespread adoption of tool use, there has been no dedicated model for evaluating tool-use accuracy—until now.

How to Detect Hallucinations in Retrieval Augmented Systems: A Primer

May 28, 2025 · Julia Neagu

Education

Hallucinations—model-generated outputs that appear confident yet contain fabricated or incorrect information—remain one of the peskiest issues facing AI engineers today. As Retrieval- and Search-Augmented Systems have proliferated, systematically identifying and mitigating hallucinations is now critical. ...

Evaluating Tool Calling Capabilities in Large Language Models: A Literature Review

May 6, 2025 · Freddie Vargus, Deanna Emery

Research

Large language models (LLMs) are increasingly being used to interact with external tools and APIs. However, evaluating their tool calling capabilities presents unique challenges compared to traditional single-turn or retrieval-augmented use cases. This post is an attempt to synthesize our findings from recent papers to help builders effectively evaluate LLM tool calling. We read through and reviewed a dozen papers published between 2023 and 2025 to better understand how tool calling is evaluated. ...

What OpenAI’s Sycophancy Problem Teaches Us About Using User Data to Evaluate AI

May 2, 2025 · Julia Neagu

Education

On April 25th, OpenAI shared a surprising update: after introducing thumbs-up/down feedback from ChatGPT users into their GPT‑4o fine-tuning process, the model got noticeably worse. ...

Introducing HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Detecting Hallucinations in Real-World Scenarios

May 2, 2025 · Deanna Emery, Mike Goitia, Freddie Vargus, Julia Neagu

Research

As large language models (LLMs) are increasingly adopted in critical industries, ensuring their outputs are factually grounded has emerged as a major concern. One prominent issue is “hallucination,” where models generate content unsupported by or contrary to the provided evidence. Existing hallucination detection benchmarks are often limited, synthetic, or narrowly focused on specific tasks like question-answering. Recognizing this gap, we developed HalluMix: a task-agnostic, multi-domain benchmark designed to evaluate hallucination detection in realistic, diverse contexts. ...

Supercharge your AI evaluation pipeline with Quotient's SDK

February 3, 2025 · Julia Neagu

Announcement

We're thrilled to announce the release of evaluations in Quotient's Python SDK!

Introducing judges: a library of research-backed LLM-as-a-judge evaluators

January 9, 2025 · Julia Neagu, Freddie Vargus

Announcement

We're excited to introduce judges, an open-source library of LLM-as-a-judge evaluators designed to help bootstrap your evaluation process!

Subject-Matter Expert Language Liaison (SMELL): A framework for aligning LLM evaluators to human feedback

October 8, 2024 · Deanna Emery, James Liounis

Research

We present Subject-Matter Expert Language Liaison (SMELL), a novel framework that combines human expertise with LLM capabilities to automatically create feedback-informed LLM judges.

Wayfair: Building Customer Support AI for the Fortune 500

August 21, 2024 · Julia Neagu

Case Study

Quotient's AI-backed Policy Compliance Evaluator rigorously assessed the compliance of Wayfair's AI agent responses with internal customer support policies.

Eval-driven development for the AI Engineer

August 6, 2024 · Julia Neagu

Education

In this article, we discuss how developing AI products differs from traditional software development and why reliable evaluations are the key for consistently shipping them.

MongoDB and Quotient join forces to accelerate enterprise AI product development

May 23, 2024 · Freddie Vargus

Announcement

We’re excited to announce that Quotient and MongoDB are teaming up to provide developers with a new, transformative solution for enhancing their AI products with Retrieval Augmented Generation (RAG)!

Building high-quality RAG applications with Qdrant and Quotient

April 23, 2024 · Julia Neagu

Announcement

We're excited to announce a new, dynamic partnership between Quotient and Qdrant.

Hello World, we’re Quotient 👋

April 10, 2024 · Julia Neagu, Freddie Vargus

Announcement

We’re on a mission to change all that, and help developers build better AI products, faster.

Products

Developers

Company