Skip to content

Conversation

@czajkub
Copy link
Collaborator

@czajkub czajkub commented Sep 30, 2025

This pull request adds evaluations and tests for this server using Opik. The tests are run on a prepared dataset and include:

  • Unit tests for database calls for each tool with DuckDB (in query_tests.py)
  • (redundant) e2e test template with llm-as-a-judge and checking tool calls (in e2e_tests.py)
  • experiments using opik (in opik/tool_calls.py): evaluating answers based on a set of questions in Opik and judging them on metrics like hallucination or answer relevancy

As an addition, unit tests and opik tests are added to Github Actions, however an Opik API key and workspace name need to be set in secrets

Opik experiment result example:
image

Actions results:
image
image

It is also possible to show the results of individual tests instead of averages in the pipeline (for Opik)

also change parquetpath to path and add .parquet suffix to the path in config
Copy link
Collaborator

@KaliszS KaliszS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need instructions to run those tests locally without actions. So in order to achieve taht we need to change .env.example and README.

dev = [
"fastapi>=0.116.2",
"opik>=1.8.56",
"pydantic-ai>=1.0.10",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pydantic-ai is for the agent used in opik tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generated from (old) template

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generated from gist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants