RubricLLM
Lightweight LLM evaluation framework for Ruby, inspired by Ragas, powered by RubyLLM.
Provider-agnostic evaluation with pluggable metrics, statistical A/B comparison, and test framework integration — no Rails, no ActiveRecord, no UI. Works anywhere Ruby runs.
Installation
Add to your Gemfile:
gem "rubric_llm"
Or install directly:
gem install rubric_llm
Quick Start
require "rubric_llm"
RubricLLM.configure do |c|
c.judge_model = "gpt-4o"
c.judge_provider = :openai
end
result = RubricLLM.evaluate(
question: "What is the capital of France?",
answer: "The capital of France is Paris, located on the Seine river.",
context: ["Paris is the capital and largest city of France."],
ground_truth: "Paris"
)
result.faithfulness # => 0.95
result.relevance # => 0.92
result.correctness # => 0.98
result.overall # => 0.94
result.pass? # => true
Configuration
Global
RubricLLM.configure do |c|
c.judge_model = "gpt-4o" # any model RubyLLM supports
c.judge_provider = :openai # :openai, :anthropic, :gemini, etc.
c.temperature = 0.0 # deterministic scoring (default)
c.max_tokens = 4096 # max tokens for judge response
end
Environment Variables
All config fields can be set via environment variables:
| Variable | Default | Description |
|---|---|---|
RUBRIC_JUDGE_MODEL |
gpt-4o |
Judge LLM model name |
RUBRIC_JUDGE_PROVIDER |
openai |
RubyLLM provider |
RUBRIC_TEMPERATURE |
0.0 |
Judge temperature |
RUBRIC_MAX_TOKENS |
4096 |
Max response tokens |
RUBRIC_MAX_RETRIES |
2 |
Max retries on transient failures |
RUBRIC_RETRY_BASE_DELAY |
1.0 |
Base delay (seconds) for exponential backoff |
RUBRIC_CONCURRENCY |
1 |
Thread pool size for batch evaluation |
# Reads all RUBRIC_* env vars automatically
config = RubricLLM::Config.from_env
Per-Evaluation Override
custom = RubricLLM::Config.new(judge_model: "claude-haiku-4-5", judge_provider: :anthropic)
result = RubricLLM.evaluate(question: "...", answer: "...", config: custom)
report = RubricLLM.evaluate_batch(dataset, config: custom)
Rails Setup
# config/initializers/rubric_llm.rb
RubricLLM.configure do |c|
c.judge_model = "gpt-4o"
c.judge_provider = :openai
end
Metrics
LLM-as-Judge Metrics
These metrics use a judge LLM to evaluate quality. Each sends a structured prompt and parses a JSON response with a 0.0–1.0 score.
| Metric | Question it answers | Requires |
|---|---|---|
| Faithfulness | Is every claim in the answer supported by the context? | context |
| Relevance | Does the answer address what was asked? | question |
| Correctness | Does the answer match the known correct answer? | ground_truth |
| Context Precision | Are the retrieved context chunks actually relevant? | question, context |
| Context Recall | Do the contexts cover the information in the ground truth? | context, ground_truth |
| Factual Accuracy | Are there factual discrepancies between candidate and reference? | ground_truth |
# Only context — gets faithfulness, relevance, context_precision
result = RubricLLM.evaluate(
question: "How does photosynthesis work?",
answer: "Plants convert sunlight into energy.",
context: ["Photosynthesis is the process by which plants convert light energy into chemical energy."]
)
# With ground truth — gets all metrics
result = RubricLLM.evaluate(
question: "How does photosynthesis work?",
answer: "Plants convert sunlight into energy.",
context: ["Photosynthesis is the process by which plants convert light energy into chemical energy."],
ground_truth: "Plants use photosynthesis to convert sunlight, water, and CO2 into glucose and oxygen."
)
Custom Metrics
class ToneMetric < RubricLLM::Metrics::Base
SYSTEM_PROMPT = "Rate professional tone from 0.0 to 1.0. Respond with JSON: {\"score\": 0.0, \"tone\": \"description\"}"
def call(answer:, **)
result = judge_eval(system_prompt: SYSTEM_PROMPT, user_prompt: "Answer: #{answer}")
return { score: nil, details: result } unless result.is_a?(Hash) && result["score"]
{ score: Float(result["score"]), details: { tone: result["tone"] } }
end
end
result = RubricLLM.evaluate(
question: "q", answer: "a",
metrics: [RubricLLM::Metrics::Faithfulness, ToneMetric]
)
result.scores[:tone_metric] # => 0.85
Retrieval Metrics
Pure math — no LLM calls, no API key needed.
result = RubricLLM.evaluate_retrieval(
retrieved: ["doc_a", "doc_b", "doc_c", "doc_d"],
relevant: ["doc_a", "doc_c"]
)
result.precision_at_k(3) # => 0.67
result.recall_at_k(3) # => 1.0
result.mrr # => 1.0
result.ndcg # => 0.86
result.hit_rate # => 1.0
Batch Evaluation
Evaluate a dataset and get aggregate statistics:
dataset = [
{ question: "What is Ruby?", answer: "A programming language.",
context: ["Ruby is a dynamic language."], ground_truth: "Ruby is a programming language." },
{ question: "What is Rails?", answer: "A web framework.",
context: ["Rails is a web framework for Ruby."], ground_truth: "Rails is a Ruby web framework." },
# ...
]
report = RubricLLM.evaluate_batch(dataset)
# Speed up with concurrent evaluation (thread pool)
report = RubricLLM.evaluate_batch(dataset, concurrency: 4)
puts report.summary
# RubricLLM Evaluation Report
# ========================================
# Samples: 20
# Duration: 45.2s
# faithfulness mean=0.920 std=0.050 min=0.850 max=0.980 n=20
report.worst(3) # 3 lowest-scoring results
report.failures(threshold: 0.8) # results below 0.8
report.export_csv("results.csv") # export to CSV
report.export_json("results.json") # export to JSON
report.to_json # returns JSON string
A/B Model Comparison
Compare two models with statistical significance testing:
config_a = RubricLLM::Config.new(judge_model: "gpt-4o")
config_b = RubricLLM::Config.new(judge_model: "claude-sonnet-4-6")
report_a = RubricLLM.evaluate_batch(dataset, config: config_a)
report_b = RubricLLM.evaluate_batch(dataset, config: config_b)
comparison = RubricLLM.compare(report_a, report_b)
puts comparison.summary
# A/B Comparison
# ======================================================================
# Metric A B Delta p-value Sig
# ----------------------------------------------------------------------
# faithfulness 0.880 0.920 +0.040 0.0230 *
# relevance 0.850 0.860 +0.010 0.4210
# correctness 0.910 0.940 +0.030 0.0089 **
comparison.significant_improvements # => [:faithfulness, :correctness]
comparison.significant_regressions # => []
Significance markers: * (p < 0.05), ** (p < 0.01), *** (p < 0.001)
Test Integration
Minitest
require "rubric_llm/minitest"
class AdvisorTest < Minitest::Test
include RubricLLM::Assertions
def test_answer_is_faithful
answer = my_llm.ask("What is Ruby?", context: docs)
assert_faithful answer, docs, threshold: 0.8
end
def test_answer_is_correct
answer = my_llm.ask("What is 2+2?")
assert_correct answer, "4", threshold: 0.9
end
def test_no_hallucination
answer = my_llm.ask("Summarize this", context: docs)
refute_hallucination answer, docs
end
def test_answer_is_relevant
answer = my_llm.ask("How do I deploy Rails?")
assert_relevant "How do I deploy Rails?", answer, threshold: 0.7
end
end
RSpec
require "rubric_llm/rspec"
RSpec.describe "My LLM" do
include RubricLLM::RSpecMatchers
let(:answer) { my_llm.ask(question, context: docs) }
it { expect(answer).to be_faithful_to(docs).with_threshold(0.8) }
it { expect(answer).to be_relevant_to(question) }
it { expect(answer).to be_correct_for(expected_answer) }
it { expect(answer).not_to hallucinate_from(docs) }
end
Error Handling
begin
result = RubricLLM.evaluate(question: "q", answer: "a", context: ["c"])
rescue RubricLLM::JudgeError => e
# LLM call failed (network, auth, rate limit)
puts "Judge error: #{e.}"
rescue RubricLLM::ConfigurationError => e
# Invalid configuration
puts "Config error: #{e.}"
rescue RubricLLM::Error => e
# Catch-all for any RubricLLM error
puts "Error: #{e.}"
end
Individual metric failures are handled gracefully — a failed metric returns nil for the score and includes the error in details:
result = RubricLLM.evaluate(question: "q", answer: "a")
result.scores[:faithfulness] # => nil (if judge failed)
result.details[:faithfulness][:error] # => "Judge call failed: ..."
result.overall # => mean of non-nil scores only
Development
bundle install
bundle exec rake test
bundle exec rubocop
Limitations
RubricLLM uses LLM-as-Judge — an LLM scores another LLM's output. This is the industry-standard approach (used by Ragas, DeepEval, ARES), but it means the judge shares the same class of failure modes as the system being evaluated. If the judge hallucinates that an answer is faithful, you get a false positive.
Mitigations built into the framework:
- Cross-model judging. Configure a different model as judge than the one being evaluated. Don't let GPT-4o grade GPT-4o.
- Retrieval metrics are pure math.
precision_at_k,recall_at_k,mrr,ndcg— no LLM involved, no judge bias. - Custom non-LLM metrics. Subclass
Metrics::Basewith regex checks, embedding similarity, or any deterministic logic. - Statistical comparison. A/B testing with paired t-tests surfaces systematic judge bias across runs.
For high-stakes evaluation, pair LLM-as-Judge metrics with retrieval metrics and periodic human review.
Why RubricLLM?
Ruby has two LLM evaluation options today. Neither fits most use cases:
| eval-ruby | leva | RubricLLM | |
|---|---|---|---|
| What it is | Generic RAG metrics | Rails engine with UI | Lightweight eval framework |
| LLM access | Raw HTTP (OpenAI/Anthropic only) | You implement it | RubyLLM (any provider) |
| Rails required? | No | Yes (engine + 6 migrations) | No |
| ActiveRecord? | No | Yes | No |
| A/B comparison | Basic | No | Paired t-test with p-values |
| Test assertions | Minitest + RSpec | No | Minitest + RSpec |
| Pluggable metrics | No (fixed set) | Yes | Yes |
| Retrieval metrics | Yes | No | Yes |
Requirements
- Ruby >= 3.4
- ruby_llm ~> 1.0
- An API key for your chosen LLM provider (set via RubyLLM configuration)
Contributing
Bug reports and pull requests are welcome on GitHub.