RubricLLM

Lightweight LLM evaluation framework for Ruby, inspired by Ragas, powered by RubyLLM.

Provider-agnostic evaluation with pluggable metrics, statistical A/B comparison, and test framework integration — no Rails, no ActiveRecord, no UI. Works anywhere Ruby runs.

Installation

Add to your Gemfile:

gem "rubric_llm"

Or install directly:

gem install rubric_llm

Quick Start

require "rubric_llm"

RubricLLM.configure do |c|
  c.judge_model = "gpt-4o"
  c.judge_provider = :openai
end

result = RubricLLM.evaluate(
  question: "What is the capital of France?",
  answer: "The capital of France is Paris, located on the Seine river.",
  context: ["Paris is the capital and largest city of France."],
  ground_truth: "Paris"
)

result.faithfulness      # => 0.95
result.relevance         # => 0.92
result.correctness       # => 0.98
result.overall           # => 0.94
result.pass?             # => true

Configuration

Global

RubricLLM.configure do |c|
  c.judge_model = "gpt-4o"           # any model RubyLLM supports
  c.judge_provider = :openai          # :openai, :anthropic, :gemini, etc.
  c.temperature = 0.0                 # deterministic scoring (default)
  c.max_tokens = 4096                 # max tokens for judge response
end

Environment Variables

All config fields can be set via environment variables:

Variable	Default	Description
`RUBRIC_JUDGE_MODEL`	`gpt-4o`	Judge LLM model name
`RUBRIC_JUDGE_PROVIDER`	`openai`	RubyLLM provider
`RUBRIC_TEMPERATURE`	`0.0`	Judge temperature
`RUBRIC_MAX_TOKENS`	`4096`	Max response tokens
`RUBRIC_MAX_RETRIES`	`2`	Max retries on transient failures
`RUBRIC_RETRY_BASE_DELAY`	`1.0`	Base delay (seconds) for exponential backoff
`RUBRIC_CONCURRENCY`	`1`	Thread pool size for batch evaluation

# Reads all RUBRIC_* env vars automatically
config = RubricLLM::Config.from_env

Per-Evaluation Override

custom = RubricLLM::Config.new(judge_model: "claude-haiku-4-5", judge_provider: :anthropic)

result = RubricLLM.evaluate(question: "...", answer: "...", config: custom)
report = RubricLLM.evaluate_batch(dataset, config: custom)

Rails Setup

# config/initializers/rubric_llm.rb
RubricLLM.configure do |c|
  c.judge_model = "gpt-4o"
  c.judge_provider = :openai
end

Metrics

LLM-as-Judge Metrics

These metrics use a judge LLM to evaluate quality. Each sends a structured prompt and parses a JSON response with a 0.0–1.0 score.

Metric	Question it answers	Requires
Faithfulness	Is every claim in the answer supported by the context?	`context`
Relevance	Does the answer address what was asked?	`question`
Correctness	Does the answer match the known correct answer?	`ground_truth`
Context Precision	Are the retrieved context chunks actually relevant?	`question`, `context`
Context Recall	Do the contexts cover the information in the ground truth?	`context`, `ground_truth`
Factual Accuracy	Are there factual discrepancies between candidate and reference?	`ground_truth`

# Only context — gets faithfulness, relevance, context_precision
result = RubricLLM.evaluate(
  question: "How does photosynthesis work?",
  answer: "Plants convert sunlight into energy.",
  context: ["Photosynthesis is the process by which plants convert light energy into chemical energy."]
)

# With ground truth — gets all metrics
result = RubricLLM.evaluate(
  question: "How does photosynthesis work?",
  answer: "Plants convert sunlight into energy.",
  context: ["Photosynthesis is the process by which plants convert light energy into chemical energy."],
  ground_truth: "Plants use photosynthesis to convert sunlight, water, and CO2 into glucose and oxygen."
)

Custom Metrics

class ToneMetric < RubricLLM::Metrics::Base
  SYSTEM_PROMPT = "Rate professional tone from 0.0 to 1.0. Respond with JSON: {\"score\": 0.0, \"tone\": \"description\"}"

  def call(answer:, **)
    result = judge_eval(system_prompt: SYSTEM_PROMPT, user_prompt: "Answer: #{answer}")
    return { score: nil, details: result } unless result.is_a?(Hash) && result["score"]

    { score: Float(result["score"]), details: { tone: result["tone"] } }
  end
end

result = RubricLLM.evaluate(
  question: "q", answer: "a",
  metrics: [RubricLLM::Metrics::Faithfulness, ToneMetric]
)
result.scores[:tone_metric]  # => 0.85

Retrieval Metrics

Pure math — no LLM calls, no API key needed.

result = RubricLLM.evaluate_retrieval(
  retrieved: ["doc_a", "doc_b", "doc_c", "doc_d"],
  relevant: ["doc_a", "doc_c"]
)

result.precision_at_k(3)  # => 0.67
result.recall_at_k(3)     # => 1.0
result.mrr                # => 1.0
result.ndcg               # => 0.86
result.hit_rate           # => 1.0

Batch Evaluation

Evaluate a dataset and get aggregate statistics:

dataset = [
  { question: "What is Ruby?", answer: "A programming language.",
    context: ["Ruby is a dynamic language."], ground_truth: "Ruby is a programming language." },
  { question: "What is Rails?", answer: "A web framework.",
    context: ["Rails is a web framework for Ruby."], ground_truth: "Rails is a Ruby web framework." },
  # ...
]

report = RubricLLM.evaluate_batch(dataset)

# Speed up with concurrent evaluation (thread pool)
report = RubricLLM.evaluate_batch(dataset, concurrency: 4)

puts report.summary
# RubricLLM Evaluation Report
# ========================================
# Samples: 20
# Duration: 45.2s
#   faithfulness          mean=0.920  std=0.050  min=0.850  max=0.980  n=20

report.worst(3)                    # 3 lowest-scoring results
report.failures(threshold: 0.8)   # results below 0.8
report.export_csv("results.csv")      # export to CSV
report.export_json("results.json")    # export to JSON
report.to_json                        # returns JSON string

A/B Model Comparison

Compare two models with statistical significance testing:

config_a = RubricLLM::Config.new(judge_model: "gpt-4o")
config_b = RubricLLM::Config.new(judge_model: "claude-sonnet-4-6")

report_a = RubricLLM.evaluate_batch(dataset, config: config_a)
report_b = RubricLLM.evaluate_batch(dataset, config: config_b)

comparison = RubricLLM.compare(report_a, report_b)

puts comparison.summary
# A/B Comparison
# ======================================================================
# Metric                      A        B    Delta    p-value  Sig
# ----------------------------------------------------------------------
# faithfulness                0.880    0.920   +0.040     0.0230    *
# relevance                   0.850    0.860   +0.010     0.4210
# correctness                 0.910    0.940   +0.030     0.0089   **

comparison.significant_improvements   # => [:faithfulness, :correctness]
comparison.significant_regressions    # => []

Significance markers: * (p < 0.05), ** (p < 0.01), *** (p < 0.001)

Test Integration

Minitest

require "rubric_llm/minitest"

class AdvisorTest < Minitest::Test
  include RubricLLM::Assertions

  def test_answer_is_faithful
    answer = my_llm.ask("What is Ruby?", context: docs)
    assert_faithful answer, docs, threshold: 0.8
  end

  def test_answer_is_correct
    answer = my_llm.ask("What is 2+2?")
    assert_correct answer, "4", threshold: 0.9
  end

  def test_no_hallucination
    answer = my_llm.ask("Summarize this", context: docs)
    refute_hallucination answer, docs
  end

  def test_answer_is_relevant
    answer = my_llm.ask("How do I deploy Rails?")
    assert_relevant "How do I deploy Rails?", answer, threshold: 0.7
  end
end

RSpec

require "rubric_llm/rspec"

RSpec.describe "My LLM" do
  include RubricLLM::RSpecMatchers

  let(:answer) { my_llm.ask(question, context: docs) }

  it { expect(answer).to be_faithful_to(docs).with_threshold(0.8) }
  it { expect(answer).to be_relevant_to(question) }
  it { expect(answer).to be_correct_for(expected_answer) }
  it { expect(answer).not_to hallucinate_from(docs) }
end

Error Handling

begin
  result = RubricLLM.evaluate(question: "q", answer: "a", context: ["c"])
rescue RubricLLM::JudgeError => e
  # LLM call failed (network, auth, rate limit)
  puts "Judge error: #{e.message}"
rescue RubricLLM::ConfigurationError => e
  # Invalid configuration
  puts "Config error: #{e.message}"
rescue RubricLLM::Error => e
  # Catch-all for any RubricLLM error
  puts "Error: #{e.message}"
end

Individual metric failures are handled gracefully — a failed metric returns nil for the score and includes the error in details:

result = RubricLLM.evaluate(question: "q", answer: "a")
result.scores[:faithfulness]           # => nil (if judge failed)
result.details[:faithfulness][:error]  # => "Judge call failed: ..."
result.overall                         # => mean of non-nil scores only

Development

bundle install
bundle exec rake test
bundle exec rubocop

Limitations

RubricLLM uses LLM-as-Judge — an LLM scores another LLM's output. This is the industry-standard approach (used by Ragas, DeepEval, ARES), but it means the judge shares the same class of failure modes as the system being evaluated. If the judge hallucinates that an answer is faithful, you get a false positive.

Mitigations built into the framework:

Cross-model judging. Configure a different model as judge than the one being evaluated. Don't let GPT-4o grade GPT-4o.
Retrieval metrics are pure math. precision_at_k, recall_at_k, mrr, ndcg — no LLM involved, no judge bias.
Custom non-LLM metrics. Subclass Metrics::Base with regex checks, embedding similarity, or any deterministic logic.
Statistical comparison. A/B testing with paired t-tests surfaces systematic judge bias across runs.

For high-stakes evaluation, pair LLM-as-Judge metrics with retrieval metrics and periodic human review.

Why RubricLLM?

Ruby has two LLM evaluation options today. Neither fits most use cases:

	eval-ruby	leva	RubricLLM
What it is	Generic RAG metrics	Rails engine with UI	Lightweight eval framework
LLM access	Raw HTTP (OpenAI/Anthropic only)	You implement it	RubyLLM (any provider)
Rails required?	No	Yes (engine + 6 migrations)	No
ActiveRecord?	No	Yes	No
A/B comparison	Basic	No	Paired t-test with p-values
Test assertions	Minitest + RSpec	No	Minitest + RSpec
Pluggable metrics	No (fixed set)	Yes	Yes
Retrieval metrics	Yes	No	Yes

Requirements

Ruby >= 3.4
ruby_llm ~> 1.0
An API key for your chosen LLM provider (set via RubyLLM configuration)

Contributing

Bug reports and pull requests are welcome on GitHub.

License

MIT