-
Notifications
You must be signed in to change notification settings - Fork 12
Adding MCP Evals with Opik #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
czajkub
wants to merge
89
commits into
the-momentum:main
Choose a base branch
from
czajkub:opiktests
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
89 commits
Select commit
Hold shift + click to select a range
8a7b919
added sum to trend data
czajkub 7795bf7
added device grouping to duckdb for test
czajkub a70831c
added device as well to query
czajkub 74c5428
ch and duck device/interval grouping
czajkub 954654c
docstring tweak
czajkub 1afd7fe
docstring improving
czajkub 0e42919
remove debug code
czajkub c3cbcb6
standardise errors and change trend docstrings
czajkub 18e16d0
add localhost support for parquet
czajkub 3beeb1f
Merge branch 'main' of https://github.com/czajkub/apple-health-mcp-se…
czajkub 32e6dda
Merge branch 'main' of https://github.com/czajkub/apple-health-mcp-se…
czajkub fab22eb
remove debug from client
czajkub b619143
unterminated string
czajkub 1c1678f
remove debug and add fileserver example
czajkub bd0bb50
Update README.md
czajkub 664bbad
add fastapi to dev group
czajkub f0a90b1
Merge branch 'main' of https://github.com/czajkub/apple-health-mcp-se…
czajkub d6b01f8
Merge branch 'main' of https://github.com/czajkub/apple-health-mcp-se…
czajkub 84b6428
Merge branch 'main' of https://github.com/czajkub/apple-health-mcp-se…
czajkub 4204fc6
Merge branch 'tableschemas' of https://github.com/czajkub/apple-healt…
czajkub 60e64bb
workouts and stats added as pq files
czajkub e73004b
concat check
czajkub 4fa826d
asfas
czajkub d60551b
import fix
czajkub fabea70
is nto noene
czajkub 554d9a5
tests
czajkub 5b92857
order by sourcename + add unit tests for all queries from duckdb
czajkub 5fa274e
linting i think + change textvalue case + all unit tests added
czajkub 504576b
name fix
czajkub 73c40ca
stupid linter
czajkub 8f0d515
inspector workflow
czajkub 08d5f88
add uv
czajkub dc9a6d2
add e2e tests with llm judge and test workflow
czajkub befeebb
workflow tweak
czajkub 16b5250
hehehe
czajkub 4d1889a
commit
czajkub b337cbb
ennvar
czajkub 89e070b
added inspector test
czajkub c26fe15
node uv
czajkub 817e671
dev deps
czajkub 9e068f4
test file paht
czajkub 6dcc5a8
mcp path
czajkub 4d33a70
opik added
czajkub 6500bd5
opik costam
czajkub 31f94b1
comemgnae
czajkub 4a3356f
curl mcp server
czajkub c6e7374
typo..
czajkub cf0f139
linting + cleaned up tests + added action
czajkub 2eba7c9
using composite action
czajkub 68b5e22
added shell and removed api keys
czajkub 4457709
added mcp composite to opik tests
czajkub bb99d6c
debug
czajkub b9c1bb5
more debug
czajkub 3face71
github guy fix
czajkub 1128791
name fix typo
czajkub 7ff2d57
workspace name fix
czajkub db279bc
cleanup & linting
czajkub 55525e1
split readme
czajkub 243955a
removing debug code & linting
czajkub fc45a4c
new parquet file example
czajkub 4a801dd
changed test path
czajkub 333a386
fixed trend data test
czajkub 723d1cc
test debug for pipeline
czajkub f3fa00e
print test output in pipeline
czajkub 7dca53a
readme merge conflict
czajkub 003cb32
Merge branch 'the-momentum:main' into opiktests
czajkub c5a5924
Create tests.md
czajkub f8afe7e
rollback unstable changes
czajkub b3afd7c
Merge branch 'opiktests' of https://github.com/czajkub/apple-health-m…
czajkub 674d75d
remove redundant tests
czajkub 920a617
lint
czajkub 2f38793
Update tests.md
czajkub a83535d
Update tests.md
czajkub 4402014
test improvement
czajkub 08c1286
Merge branch 'opiktests' of https://github.com/czajkub/apple-health-m…
czajkub a8de0be
add config for opik in .env
czajkub 1116778
Merge branch 'the-momentum:main' into opiktests
czajkub 47452da
Update README.md
czajkub b0674e7
Update tests.md
czajkub 2b8185e
lint
czajkub 883b51b
Merge branch 'opiktests' of https://github.com/czajkub/apple-health-m…
czajkub 082bab2
Merge branch 'main' into opiktests
czajkub 06e5f92
add test for workouts and tweak current tests
czajkub c1b72a8
changed example file
czajkub fee3925
changed workflow example
czajkub 08b3c4e
changed pytest path
czajkub afe214d
linter
czajkub 39e9eee
change composite action default value
czajkub 2cacf9f
Update tests.md
czajkub File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| name: 'MCP server setup' | ||
| description: 'Sync uv dependencies and run MCP server' | ||
|
|
||
|
|
||
| inputs: | ||
| DUCKDB_FILENAME: | ||
| description: 'path to duckdb file' | ||
| required: false | ||
| default: 'tests/duckdb.example' | ||
|
|
||
| runs: | ||
| using: "composite" | ||
| steps: | ||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.13' | ||
| - name: Install uv | ||
| run: curl -LsSf https://astral.sh/uv/install.sh | sh | ||
| shell: bash | ||
| - name: Install dependencies | ||
| run: uv sync --group dev | ||
| shell: bash | ||
| - name: Run fileserver | ||
| run: uv run --directory tests/ fileserver.py & | ||
| shell: bash | ||
| - name: Run mcp server | ||
| run: uv run fastmcp run -t http app/main.py & | ||
| env: | ||
| DUCKDB_FILENAME: ${{ inputs.DUCKDB_FILENAME }} | ||
| shell: bash | ||
| - name: Wait for mcp initialization | ||
| run: sleep 5 | ||
| shell: bash |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| name: tests | ||
|
|
||
| on: [push] | ||
|
|
||
| env: | ||
| DUCKDB_FILENAME: tests/duckdb.example | ||
| OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} | ||
| OPIK_API_KEY: ${{ secrets.OPIK_API_KEY }} | ||
| OPIK_WORKSPACE: ${{ secrets.OPIK_WORKSPACE }} | ||
|
|
||
| jobs: | ||
| test: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - uses: ./.github/workflows/mcp-composite-action | ||
| with: | ||
| DUCKDB_FILENAME: 'tests/duckdb.example' | ||
| - name: Run tests | ||
| run: uv run --directory tests/ pytest query_tests.py | ||
| inspector: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v5 | ||
| - uses: actions/setup-node@v4 | ||
| with: | ||
| node-version: '20' | ||
| - uses: ./.github/workflows/mcp-composite-action | ||
| with: | ||
| DUCKDB_FILENAME: 'tests/duckdb.example' | ||
| - name: Run inspector | ||
| run: npx @modelcontextprotocol/inspector --cli http://127.0.0.1:8000/mcp --method tools/list | ||
| opik: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v5 | ||
| - uses: ./.github/workflows/mcp-composite-action | ||
| with: | ||
| DUCKDB_FILENAME: 'tests/duckdb.example' | ||
| - name: Run opik experiments | ||
| run: uv run tests/opik/tool_calls.py |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| [← Back to README](../README.md) | ||
|
|
||
| ## Testing 🧪 | ||
|
|
||
| There are 3 types of tests in this projects, all of which are included in the pipeline: | ||
|
|
||
| Every test is done on [pre-prepared mock apple health data](https://gist.github.com/czajkub/7ee7a01c35990f910f034f46dbf83b66): | ||
|
|
||
|
|
||
| ## Unit tests 🔧: | ||
| - Testing the importing of XML data to .duckdb and database calls to DuckDB | ||
|
|
||
| ## MCP Inspector tests 🔍: | ||
| - Uses the [MCP Inspector](https://modelcontextprotocol.io/docs/tools/inspector) provided by Anthropic to test connection to the server hosted with streamable HTTP | ||
| - Mainly used in the pipeline, but can be run locally | ||
|
|
||
| ## Opik tests 🤖: | ||
| - End-to-End tests using an agent created from [this](https://github.com/the-momentum/python-ai-kit) AI development kit | ||
| - Two types of tests: | ||
| - Checking whether the correct tool was called | ||
| - Judging the answer from an LLM by three metrics: | ||
| - Answer relevancy: whether the answer is relevant to the user's question 🎯 | ||
| - Hallucination: whether the answer contains misleading or false information 🚫 | ||
| - Levenshtein ratio: Heuristic checking the text structure similarity 📊 | ||
|
|
||
| # How to run tests locally 💻: | ||
| - ### Unit tests 🔧: | ||
| ```bash | ||
| pytest tests/query_tests.py | ||
| ``` | ||
|
|
||
| Before running the next tests, make sure you have the server up and running: | ||
| ```bash | ||
| uv run fastmcp run -t http app/main.py | ||
| ``` | ||
|
|
||
| - ### Inspector tests 🔍: | ||
| ```bash | ||
| npx @modelcontextprotocol/inspector --cli http://localhost:8000/mcp --transport http --method tools/list | ||
| ``` | ||
|
|
||
| - ### Opik tests 🤖: | ||
| Make sure your `OPENAI_API_KEY`, `OPIK_WORKSPACE` and `OPIK_API_KEY` environmental variables are set | ||
| (Opik workspace refers to your profile name and not project name) | ||
| ```bash | ||
| uv run tests/opik/tool_calls.py | ||
| ``` | ||
|
|
||
| ### How to run Opik tests in pipeline: | ||
| - Create an account on Opik if you already haven't | ||
| - Copy your `OPIK_API_KEY` and `OPIK_WORKSPACE` to Github secrets | ||
|
|
||
|
|
||
| To add new tests, you can either do it in the code ([example from opik](https://www.comet.com/docs/opik/evaluation/manage_datasets)): | ||
| ```python | ||
| import opik | ||
| # Get or create a dataset | ||
| client = opik.Opik() | ||
| dataset = client.get_or_create_dataset(name="My dataset") | ||
| # Add dataset items to it | ||
| dataset.insert([ | ||
| {"user_question": "Hello, world!", "expected_output": {"assistant_answer": "Hello, world!"}}, | ||
| {"user_question": "What is the capital of France?", "expected_output": {"assistant_answer": "Paris"}}, | ||
| ]) | ||
| ``` | ||
|
|
||
| Or add it on the website: | ||
| <img width="1919" height="873" alt="image" src="https://github.com/user-attachments/assets/dc9f3807-40b4-4227-b4c2-5a1ea44396e7" /> | ||
|
|
||
| When adding tool call questions, make sure the `input` and `tool_call` values are present, and when adding output checks make sure `input` and `expected_output` are set correctly. | ||
|
|
||
| [← Back to README](../README.md) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. generated from (old) template |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| import asyncio | ||
| import os | ||
|
|
||
| from pydantic_ai.models.openai import OpenAIChatModel | ||
| from pydantic_ai.providers.openai import OpenAIProvider | ||
| from pydantic_ai import Agent | ||
| from pydantic_ai.tools import Tool | ||
| from pydantic_ai.mcp import MCPServerStreamableHTTP | ||
|
|
||
| from dotenv import load_dotenv | ||
| import opik | ||
|
|
||
| load_dotenv() | ||
|
|
||
| class AgentManager: | ||
| def __init__(self): | ||
| self.agent: Agent | None = None | ||
| self.mcp_client: MCPServerStreamableHTTP | None = None | ||
| self.tools: list[Tool] | None = None | ||
| self._initialized = False | ||
|
|
||
| async def initialize(self, model: str = "gpt-4o", | ||
| system_prompt: str | None = None): | ||
| if self._initialized: | ||
| return | ||
| try: | ||
| self.mcp_client = MCPServerStreamableHTTP("http://localhost:8000/mcp") | ||
| except Exception as e: | ||
| self.mcp_client = None | ||
| raise ConnectionError("Could not connect to MCP server") from e | ||
|
|
||
| if system_prompt is None: | ||
| system_prompt = "You are an AI assistant to help the user as best as you can. You can use the tools provided to you to help the user." | ||
|
|
||
| self.agent = self._create_agent(model, system_prompt) | ||
| self._initialized = True | ||
|
|
||
| def _create_agent(self, model: str, system_prompt: str) -> Agent: | ||
| model = OpenAIChatModel(model, provider=OpenAIProvider(api_key=os.getenv("openai_api_key"))) | ||
| return Agent( | ||
| model=model, | ||
| deps_type=dict[str, str], | ||
| system_prompt=system_prompt, | ||
| toolsets=[self.mcp_client], | ||
| output_type=str, | ||
| ) | ||
|
|
||
| @opik.track | ||
| async def handle_message(self, message: str) -> str: | ||
| if not self._initialized: | ||
| raise RuntimeError("Agent not initialized. Call initialize() first.") | ||
|
|
||
| async with self.agent: | ||
| result = await self.agent.run(message) | ||
| return result.output | ||
|
|
||
| def is_initialized(self) -> bool: | ||
| return self._initialized | ||
|
|
||
|
|
||
|
|
||
| agent_manager = AgentManager() | ||
|
|
||
|
|
||
| async def main(): | ||
| await agent_manager.initialize() | ||
|
|
||
| try: | ||
| while True: | ||
| user_input = input("Enter your message: ") | ||
| if user_input == "exit": | ||
| break | ||
| print("User: ", user_input) | ||
| response = await agent_manager.handle_message(user_input) | ||
| print("Agent: ", response) | ||
| finally: | ||
| print("Closing agent") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| asyncio.run(main()) |
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. generated from gist |
Binary file not shown.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pydantic-ai is for the agent used in opik tests