Human-grounded QA for your LLM outputs

EvalCore AI

We review your model’s responses with a rigorous rubric and deliver structured scores, dashboards, and failure breakdowns — so your team can ship AI features with confidence.

Who this is for

We support teams shipping real AI features — where model quality is mission-critical.

Product teams building with LLMs
Who need confidence that their model’s responses are accurate, coherent, and safe.

If LLM output quality matters to your product, EvalCore AI becomes part of your workflow.

Engineers and researchers who want structured evaluation — not intuition
Scores, dashboards, and consistent failure categories instead of guesswork.

Teams shipping AI to production
Who must monitor regression, track improvements, and validate new model versions.

Enterprise teams requiring reliable human review
Including SLAs, compliance checks, and high-volume evaluation pipelines.

The problem

LLMs fail in ways that are hard to detect — and even harder to measure.

The real issues teams face:

Inconsistent logic across near-identical prompts
Domain errors that only experts notice
Regressions between model versions with no clear reason why

You can’t fix LLM quality unless you know where and how it fails.

Hidden hallucinations that slip past reviewers
Outputs that “feel fine” until they reach production
Instruction drift, where the model slowly ignores constraints

And without structured measurement, you can’t improve any of it.

Models don’t simply hallucinate. They produce outputs that look correct, follow instructions almost, or break under small prompt variations.

What we do

We review every model output using a structured, human-grounded rubric.
Each response is evaluated across five core dimensions designed to surface real-world failure modes.

2. Coherence / Logic

Does the response make sense as a complete line of reasoning?
We flag contradictions, logical gaps, circular reasoning, and inconsistent conclusions — especially across
similar or repeated prompts.

Human-written reviewer notes explaining failures and risks
A structured evaluation sheet plus an aggregated metrics dashboard

1. Factual Accuracy

Does the output state information correctly?
We identify hallucinations, outdated facts, unsupported claims, and misleading phrasing.
This includes content that sounds plausible but cannot be verified against reliable sources.

3. Instruction Following

Did the model follow the prompt fully and precisely?
We assess whether constraints, requirements, and instructions were respected, including scope, tone, format,
and completeness of the response.

This gives a repeatable, high-signal process for understanding model performance — and improving it.

Every evaluated output includes:

A 0–3 score per dimension
A severity rating (0–3) reflecting practical impact
Clear error categorization (e.g. hallucination, missing information, reasoning flaw, formatting issue...)

4. Harmfulness / Bias

Does the output introduce risk, bias, or unsafe content?
We evaluate harmful assumptions, biased language, policy-sensitive content, and outputs that could cause
real-world harm if deployed without review.

5. Structure / Format

Is the response clearly structured and usable as delivered?
We check formatting, clarity, organization, and adherence to the expected output structure — including
readability and downstream usability.

See exactly what your evaluation looks like

We deliver a structured Google Sheets report with:

• Scored outputs (0–3 per dimension) • Severity levels • Error categories

• Reviewer notes • A metrics dashboard (averages, distribution, error breakdowns)

All data stays private and is used only for your evaluation.

Example of aggregated severity distribution and error breakdown across evaluated outputs.

Below is a visual mockup of the structure your team receives — making it easy to integrate with your internal QA or analysis workflows.

When failure modes become visible, your model becomes fixable.

1. You share your model outputs

Upload your prompts and model responses directly through our form or by sending a simple spreadsheet.
We work with clean, human-readable text — no JSON, no API dumps required.

3. Severity is computed automatically

Using our rubric, we calculate a Severity Score (0–3) for each output based on the lowest-performing dimension.
This process turns qualitative model behavior into measurable signals your team can act on.

4. Your metrics dashboard is generated

We compute your averages, severity distribution, and error categories, producing a clear dashboard that mirrors the structure of your Google Sheet.
This gives your team a single, consistent view of model quality.

5. You receive your full evaluation package

You receive your Evaluation sheet, your Metrics Dashboard, and a short human-written summary of insights — all in a clean Google Sheets format.

6. Improve and iterate with clarity

Once every output is scored and categorized, your team finally understands where the model fails, why it fails, and how severe each issue is.
This clarity supports safer launches, prompt iteration, and ongoing quality monitoring.

Clear evaluation makes improvement predictable.

How it works

Structured, human-reviewed model evaluation — made for teams who need clarity they can act on.

2. We evaluate every output using a structured rubric

Each output is independently scored, categorized, and annotated by a human reviewer. We also classify the failure mode for each output.

Monthly Evaluation

Ongoing quality tracking for teams using AI in production.

Pricing is based on total monthly volume

We’ll review volume and cadence before setting up billing.

Enterprise / High-Volume

For large-scale evaluation, SLAs, and specialized workflows.

1,000+ outputs per month

Tell us about your use case — we’ll follow up to discuss details.

Not sure which plan fits your team?
Start with the Pilot Evaluation and we’ll recommend the right ongoing setup based on your model and volume.

Pricing

Transparent, developer-friendly pricing — start with a one-off pilot or move to ongoing evaluation.

Pilot Evaluation

One-time, high-signal evaluation for fast insights.

Up to 100 model outputs

$490 / month

Custom Pricing

Up to 300 outputs per month

Submit for review

Continuous evaluation using the same rigorous rubric
Monthly updated dashboard and trend analysis
Monitor regression, improvement & model version drift
Priority support and faster turnaround
Optional custom metrics for your workflows
Private & secure

Request evaluation

High-volume evaluation pipelines
Optional dual-review workflows
Custom dashboards & reporting
SLAs, NDAs, and compliance reviews
Secure, private and scalable
Dedicated account manager for coordination & support
Workflow customization (schemas, rubrics, or domain-specific rules)

Request quote

$190

Severity scoring (0–3)
Error classification (factuality, hallucination, missing info…)
Human-validated evaluation for every output
Metrics dashboard with KPIs & error breakdown
Structured evaluation sheet
Summary insights for your team
Private & secure — no training on your data

We’ll review your submission and confirm next steps by email.

Request an evaluation

Share your details so we can review your use case and confirm next steps.

We review each submission to ensure the scope and data are aligned before starting the evaluation.
You’ll receive a confirmation by email with timelines and payment details.

Clear scope. Clear evaluation. Predictable results.

FAQ

Answers to the most common questions from developers and teams.

1. What exactly do you deliver?

We deliver a structured evaluation package in Google Sheets, including:

a line-by-line evaluation of each model output
scores across multiple quality dimensions
an assigned error category for each output
a Metrics Dashboard with aggregated scores and distributions
a short summary of insights for your team

This matches exactly what you see in the sample dashboard on our site.

2. Is the evaluation automated or human-reviewed?

All evaluations are performed by human reviewers using a consistent, structured rubric.
There is no automated scoring or model-based judgment involved.

This ensures clarity, accountability, and high signal quality.

3. What dimensions do you evaluate?

Each output is scored across five dimensions:

Factual Accuracy
Coherence / Logic
Instruction Following
Harmfulness / Bias
Structure / Format

Scores range from 0 to 3 for each dimension.

4. How is the Severity Score calculated?

The Severity Score is calculated automatically based on the lowest-scoring dimension for each output.
This makes it easy to identify which responses require immediate attention.

The calculation matches what appears in your Evaluation sheet.

5. How do you classify errors?

Each output is assigned one primary error category, such as:

Factual Error / Inaccuracy
Hallucination
Missing Information
Contradiction
Logic Inconsistency
Instruction Violation
Formatting Error
Incomplete Output
Biased Content
Unsafe / Harmful

Each output receives one primary category to highlight the dominant failure.

6. What input formats do you support?

We work with human-readable text only.

You can submit your data as:

a Google Sheet
a spreadsheet file (CSV or Excel)
plain text shared via a document link

We do not require JSON, API access, or model integrations.

7. Do you store or reuse our data?

No.
Your data is used only for the purpose of your evaluation.

We do not train models, reuse outputs, or retain data beyond the delivery period unless explicitly requested.

8. Is this suitable for production models?

Yes.
EvalCore AI is designed to help teams understand failure modes, regressions, and quality risks before or after deployment.

Many teams use it to evaluate model versions, prompt changes, or edge cases.
It is especially useful for identifying subtle failures that automated metrics often miss.

9. Can we customize the rubric or metrics?

For Pilot and Monthly plans, we use a standardized rubric to ensure consistency.
For Enterprise engagements, custom dimensions, scoring rules, or reporting formats can be discussed.

10. How long does an evaluation take?

Turnaround time depends on volume and complexity, but Pilot evaluations are typically delivered within a few business days.

We confirm timelines after reviewing your submission and before starting each engagement.

11. Is this a one-time report or an ongoing service?

Both.

The Pilot is a one-time evaluation.
The Monthly plan supports continuous evaluation over time.
Enterprise plans are tailored to your workflow and volume.

12. Is this a replacement for automated benchmarks?

No — and it’s not meant to be.

EvalCore AI complements automated metrics by providing human judgment and structured qualitative insight, which benchmarks alone cannot capture.