Skip to main content

Deep Eval and Confident AI

price

link

Free package: basically enough

Starter package: US$20/month

image-20240415112030958

Quick Start

official

latestversion

pip install -U deepeval

Register forConfident AI

Get api key

log in

deepeval login --api-key xxxx

Create the file试验_example.py

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_answer_relevancy():
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output of your LLM application
actual_output="We offer a 30-day full refund at no extra cost."
)
assert_test(test_case, [answer_relevancy_metric])

The file name must begin with试验_

operation

deepeval test run test_example.py

Set the Folder in which to save results

Set Environment Variables

export DEEPEVAL_RESULTS_FOLDER="./deep-eval-results"

Parameter of deepeval试验run

parallel

deepeval test run test_example.py -n 4

cache

deepeval test run test_example.py -c

repetition

deepeval test run test_example.py -r 2

hook

...

@deepeval.on_test_run_end
def function_to_be_called_after_test_run():
print("Test finished!")

conceptual framework

Official Document

Test CaseContains input/actual_output/retrieval_context
DatasetCollection of试验Cases
GoldenCompared with the试验case, the actual_output

An actual试验Case

test_case = LLMTestCase(
input="Who is the current president of the United States of America?",
actual_output="Joe Biden",
retrieval_context=["Joe Biden serves as the current president of America."]
)

Use Pytest for evaluation

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

dataset = EvaluationDataset(test_cases=[...])

@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
answer_relevancy_metric = AnswerRelevancyMetric()
assert_test(test_case, [answer_relevancy_metric])

@pytest.mark.parametrizeis a decorator provided by Pytest. Recurrence simply evaluates each test case one by one inEvaluationDataset.

Run without CLI

Document

# A hypothetical LLM application example
import chatbot
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# ……
test_cases = [first_test_case, second_test_case]

metric = HallucinationMetric(threshold=0.7)
evaluate(test_cases, [metric])

Test Case

A standard试验Case

test_case = LLMTestCase(
input="What if these shoes don't fit?", #必选
expected_output="You're eligible for a 30 day refund at no extra cost.", #必选
actual_output="We offer a 30-day full refund at no extra cost.",

context=["All customers are eligible for a 30 day full refund at no extra cost."], # 参考值
retrieval_context=["Only shoes can be refunded."], # 实际检索结果

latency=10.0
)

-context is the ideal search result for a given Input, Normal from the Evaluation Data Set, -retrieval_context The actual search results of theLLM Application.

Dataset

Manually generate a dataset and and if you do push it to Confident AI

from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

# 原始数据
original_dataset = [
{
"input": "What are your operating hours?",
"actual_output": "...",
"context": [
"Our company operates from 10 AM to 6 PM, Monday to Friday.",
"We are closed on weekends and public holidays.",
"Our customer service is available 24/7.",
],
},
{
"input": "Do you offer free shipping?",
"actual_output": "...",
"expected_output": "Yes, we offer free shipping on orders over $50.",
},
{
"input": "What is your return policy?",
"actual_output": "...",
},
]

# 遍历,将生成 LLMTestCase 实例
test_cases = []
for datapoint in original_dataset:
input = datapoint.get("input", None)
actual_output = datapoint.get("actual_output", None)
expected_output = datapoint.get("expected_output", None)
context = datapoint.get("context", None)

test_case = LLMTestCase(
input=input,
actual_output=actual_output,
expected_output=expected_output,
context=context,
)
test_cases.append(test_case)

# 将 LLMTestCase 数组变成 EvaluationDataset
dataset = EvaluationDataset(test_cases=test_cases)

# 推送到Confident AI
dataset.push(alias="My Confident Dataset")

link

view results

image-20240415122342814

image-20240415122410879

Manually createdin Confident AI's Web UI

Support test

Manually modify all Prompts under/Users/xxx/anaconda3/envs/LI311-b/lib/python3.11/site-packages/deepeval/synthesizer/template.pyand add

6. `Rewritten Input` should be in Chinse.

Metric

Dimension of Assessment

Evaluation MetricDescription
Accuracy and Semantic SimilarityComparison of generated answers with reference answers
Context RelevancyQuery correlation on retrieved context
FaithfulnessConsistency of generated answers with retrieved context
Answer RelevancyThe correlation of the generated answer on the Query

Confident AI