Quotient AI
Products
PromptLab Quotient IQ Quotient SDK Quotient Enterprise
Developers
Docs GitHub Discord
Pricing
Company
Blog
Contact Sales
Sign In

Products

PromptLab Quotient IQ Quotient SDK Quotient Enterprise

Developers

Docs GitHub Discord

Company

Pricing Company Blog
Contact Sales
Sign In

Welcome to Quotient's blog đź‘‹

Evaluating Tool Calling Capabilities in Large Language Models: A Literature Review

May 6, 2025 · Freddie Vargus, Deanna Emery
Research

Large language models (LLMs) are increasingly being used to interact with external tools and APIs. However, evaluating their tool calling capabilities presents unique challenges compared to traditional single-turn or retrieval-augmented use cases. This post is an attempt to synthesize our findings from recent papers to help builders effectively evaluate LLM tool calling. We read through and reviewed a dozen papers published between 2023 and 2025 to better understand how tool calling is evaluated. ...

What OpenAI’s Sycophancy Problem Teaches Us About Using User Data to Evaluate AI

May 2, 2025 · Julia Neagu
Education

On April 25th, OpenAI shared a surprising update: after introducing thumbs-up/down feedback from ChatGPT users into their GPT‑4o fine-tuning process, the model got noticeably worse. ...

Introducing HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Detecting Hallucinations in Real-World Scenarios

May 2, 2025 · Deanna Emery, Mike Goitia, Freddie Vargus, Julia Neagu
Research

As large language models (LLMs) are increasingly adopted in critical industries, ensuring their outputs are factually grounded has emerged as a major concern. One prominent issue is “hallucination,” where models generate content unsupported by or contrary to the provided evidence. Existing hallucination detection benchmarks are often limited, synthetic, or narrowly focused on specific tasks like question-answering. Recognizing this gap, we developed HalluMix: a task-agnostic, multi-domain benchmark designed to evaluate hallucination detection in realistic, diverse contexts. ...

Supercharge your AI evaluation pipeline with Quotient's SDK

February 3, 2025 · Julia Neagu
Announcement

We're thrilled to announce the release of evaluations in Quotient's Python SDK!

Introducing judges: a library of research-backed LLM-as-a-judge evaluators

January 9, 2025 · Julia Neagu, Freddie Vargus
Announcement

We're excited to introduce judges, an open-source library of LLM-as-a-judge evaluators designed to help bootstrap your evaluation process!

Subject-Matter Expert Language Liaison (SMELL): A framework for aligning LLM evaluators to human feedback

October 8, 2024 · Deanna Emery, James Liounis
Research

We present Subject-Matter Expert Language Liaison (SMELL), a novel framework that combines human expertise with LLM capabilities to automatically create feedback-informed LLM judges.

Wayfair: Building Customer Support AI for the Fortune 500

August 21, 2024 · Julia Neagu
Case Study

Quotient's AI-backed Policy Compliance Evaluator rigorously assessed the compliance of Wayfair's AI agent responses with internal customer support policies.

Eval-driven development for the AI Engineer

August 6, 2024 · Julia Neagu
Education

In this article, we discuss how developing AI products differs from traditional software development and why reliable evaluations are the key for consistently shipping them.

MongoDB and Quotient join forces to accelerate enterprise AI product development

May 23, 2024 · Freddie Vargus
Announcement

We’re excited to announce that Quotient and MongoDB are teaming up to provide developers with a new, transformative solution for enhancing their AI products with Retrieval Augmented Generation (RAG)!

Building high-quality RAG applications with Qdrant and Quotient

April 23, 2024 · Julia Neagu
Announcement

We're excited to announce a new, dynamic partnership between Quotient and Qdrant.

Hello World, we’re Quotient 👋

April 10, 2024 · Julia Neagu, Freddie Vargus
Announcement

We’re on a mission to change all that, and help developers build better AI products, faster.