Tool System (MCP-Based)
ProbeLLM’s tool system is the core extensibility layer, allowing you to inject custom test generation strategies without modifying the search engine.
Overview
Design Goals:
Pluggable: Add/remove tools at runtime
Standardized: Follow Model Context Protocol (MCP)
Type-Safe: JSON Schema validation for inputs
Composable: Tools can call other tools
Architecture:
User Code
│
▼
ToolRegistry (central dispatcher)
│
├──> LocalMCPTool("perturbation", handler_fn)
├──> LocalMCPTool("python_exec", handler_fn)
├──> LocalMCPTool("web_search", handler_fn)
└──> LocalMCPTool("my_custom_tool", handler_fn)
Built-in Tools
perturbation
Purpose: Generate semantic-preserving variations
Input Schema:
{
"input": str, # Original question
"expected": str, # Ground truth
"operations": list, # ["paraphrase", "reformulate"]
"forms": list, # ["multiple_choice", "true_false"]
"num_variants": int # Number to generate
}
Output:
{
"variants": [
{
"operation": "paraphrase",
"form": "free_text",
"text": "Reworded question...",
"rationale": "..."
},
{
"operation": "reformulate",
"form": "multiple_choice",
"text": "Question stem...",
"options": ["A", "B", "C", "D"],
"answer_key": "B",
"rationale": "..."
}
]
}
Example:
from probellm.tools import build_default_tool_registry
registry = build_default_tool_registry(model="gpt-5.2")
result = registry.call_tool("perturbation", {
"input": "What is 2+2?",
"expected": "4",
"operations": ["paraphrase"],
"num_variants": 3
})
for variant in result["variants"]:
print(variant["text"])
python_exec
Purpose: Execute Python code for computational/algorithmic questions
Input Schema:
{
"code": str, # Python code to execute
"purpose": str, # Description (for error correction)
"max_retries": int, # Auto-fix attempts (default: 3)
"timeout_sec": int # Execution timeout (default: 6)
}
Output:
{
"success": bool,
"stdout": str, # Captured output
"stderr": str, # Error messages
"returncode": int,
"fix_tokens": int # Tokens used in retry attempts
}
Safety Features:
Sandbox: Runs in temporary directory with
-I -B -SflagsTimeout: Kills after
timeout_secAuto-repair: If execution fails, sends error to LLM for fix (up to
max_retries)Standard library only: No
numpy,pandas, etc.
Example:
result = registry.call_tool("python_exec", {
"code": "import math\\nresult = math.factorial(5)\\nprint(result)",
"purpose": "Calculate 5 factorial"
})
if result["success"]:
print(result["stdout"]) # "120"
web_search
Purpose: Retrieve external knowledge via OpenAI Responses API
Input Schema:
{
"topic": str, # Search query
"max_results": int # Max results to retrieve (default: 5)
}
Output:
{
"question": str,
"answer": str,
"evidence": [
{"url": str, "title": str, "quote": str},
...
],
"citations_validation": {"valid": bool, "reason": str}
}
Example:
result = registry.call_tool("web_search", {
"topic": "quantum entanglement",
"max_results": 3
})
print(result["answer"])
for ev in result["evidence"]:
print(f"- {ev['title']}: {ev['url']}")
Custom Tool Development
Step 1: Define Tool Specification
spec = {
"name": "my_domain_tool",
"description": "Generates biology-specific test cases",
"inputSchema": {
"type": "object",
"properties": {
"topic": {"type": "string", "description": "Biology topic"},
"difficulty": {"type": "string", "enum": ["easy", "hard"]}
},
"required": ["topic"]
}
}
Step 2: Implement Handler Function
def biology_tool_handler(arguments: dict) -> dict:
topic = arguments.get("topic", "")
difficulty = arguments.get("difficulty", "easy")
# Your custom logic here
# - Could call external APIs
# - Query specialized databases
# - Use domain-specific LLM prompts
questions = generate_biology_questions(topic, difficulty)
return {"questions": questions}
Step 3: Register Tool
from probellm.tools import ToolRegistry, LocalMCPTool
registry = ToolRegistry()
registry.register(LocalMCPTool(spec, biology_tool_handler))
Step 4: Use in Pipeline
from probellm import VulnerabilityPipelineAsync
pipeline = VulnerabilityPipelineAsync(
model_name="gpt-5.2",
test_model="gpt-4o-mini",
tool_registry=registry # Inject custom registry
)
# Now test generation can use "my_domain_tool"
Advanced: Tool Composition
Tools can call other tools via the registry:
def composite_tool_handler(arguments: dict, registry: ToolRegistry) -> dict:
# Step 1: Use web_search to get context
web_result = registry.call_tool("web_search", {
"topic": arguments["topic"]
})
# Step 2: Use python_exec to process data
code = generate_processing_code(web_result)
exec_result = registry.call_tool("python_exec", {"code": code})
# Step 3: Synthesize final result
return {"processed_data": exec_result["stdout"]}
# Register with registry access
spec = {...}
tool = LocalMCPTool(spec, lambda args: composite_tool_handler(args, registry))
registry.register(tool)
Tool Selection in MCTS
During expansion, TestCaseGenerator automatically:
Calls LLM to select appropriate tool
Executes tool via
ToolRegistry.call_tool()Uses tool output to synthesize test case
You don’t need to modify search logic — just register your tool and it becomes available for selection.
Planner sees:
Available tools: perturbation, python_exec, web_search, my_domain_tool
Base QA: {...}
Choose the best tool and provide arguments.
Best Practices
Validation: Validate inputs using JSON Schema
Error Handling: Return
{"error": ...}instead of raising exceptionsIdempotency: Same inputs → same outputs (for reproducibility)
Documentation: Clear
description+inputSchemadescriptionsLogging: Use
print()for debugging (captured in logs)
See Also
Custom Tool Integration Guide: Step-by-step tutorial
API reference: Full API reference
Core Concepts: Tool selection strategy