Tool System (MCP-Based)

ProbeLLM’s tool system is the core extensibility layer, allowing you to inject custom test generation strategies without modifying the search engine.

Overview

Design Goals:

Pluggable: Add/remove tools at runtime
Standardized: Follow Model Context Protocol (MCP)
Type-Safe: JSON Schema validation for inputs
Composable: Tools can call other tools

Architecture:

User Code
   │
   ▼
ToolRegistry (central dispatcher)
   │
   ├──> LocalMCPTool("perturbation", handler_fn)
   ├──> LocalMCPTool("python_exec", handler_fn)
   ├──> LocalMCPTool("web_search", handler_fn)
   └──> LocalMCPTool("my_custom_tool", handler_fn)

Built-in Tools

perturbation

Purpose: Generate semantic-preserving variations

Input Schema:

{
  "input": str,          # Original question
  "expected": str,       # Ground truth
  "operations": list,    # ["paraphrase", "reformulate"]
  "forms": list,         # ["multiple_choice", "true_false"]
  "num_variants": int    # Number to generate
}

Output:

{
  "variants": [
    {
      "operation": "paraphrase",
      "form": "free_text",
      "text": "Reworded question...",
      "rationale": "..."
    },
    {
      "operation": "reformulate",
      "form": "multiple_choice",
      "text": "Question stem...",
      "options": ["A", "B", "C", "D"],
      "answer_key": "B",
      "rationale": "..."
    }
  ]
}

Example:

from probellm.tools import build_default_tool_registry

registry = build_default_tool_registry(model="gpt-5.2")
result = registry.call_tool("perturbation", {
    "input": "What is 2+2?",
    "expected": "4",
    "operations": ["paraphrase"],
    "num_variants": 3
})

for variant in result["variants"]:
    print(variant["text"])

python_exec

Purpose: Execute Python code for computational/algorithmic questions

Input Schema:

{
  "code": str,           # Python code to execute
  "purpose": str,        # Description (for error correction)
  "max_retries": int,    # Auto-fix attempts (default: 3)
  "timeout_sec": int     # Execution timeout (default: 6)
}

Output:

{
  "success": bool,
  "stdout": str,         # Captured output
  "stderr": str,         # Error messages
  "returncode": int,
  "fix_tokens": int      # Tokens used in retry attempts
}

Safety Features:

Sandbox: Runs in temporary directory with -I -B -S flags
Timeout: Kills after timeout_sec
Auto-repair: If execution fails, sends error to LLM for fix (up to max_retries)
Standard library only: No numpy, pandas, etc.

Example:

result = registry.call_tool("python_exec", {
    "code": "import math\\nresult = math.factorial(5)\\nprint(result)",
    "purpose": "Calculate 5 factorial"
})

if result["success"]:
    print(result["stdout"])  # "120"

web_search

Purpose: Retrieve external knowledge via OpenAI Responses API

Input Schema:

{
  "topic": str,          # Search query
  "max_results": int     # Max results to retrieve (default: 5)
}

Output:

{
  "question": str,
  "answer": str,
  "evidence": [
    {"url": str, "title": str, "quote": str},
    ...
  ],
  "citations_validation": {"valid": bool, "reason": str}
}

Example:

result = registry.call_tool("web_search", {
    "topic": "quantum entanglement",
    "max_results": 3
})

print(result["answer"])
for ev in result["evidence"]:
    print(f"- {ev['title']}: {ev['url']}")

Custom Tool Development

Step 1: Define Tool Specification

spec = {
    "name": "my_domain_tool",
    "description": "Generates biology-specific test cases",
    "inputSchema": {
        "type": "object",
        "properties": {
            "topic": {"type": "string", "description": "Biology topic"},
            "difficulty": {"type": "string", "enum": ["easy", "hard"]}
        },
        "required": ["topic"]
    }
}

Step 2: Implement Handler Function

def biology_tool_handler(arguments: dict) -> dict:
    topic = arguments.get("topic", "")
    difficulty = arguments.get("difficulty", "easy")

    # Your custom logic here
    # - Could call external APIs
    # - Query specialized databases
    # - Use domain-specific LLM prompts

    questions = generate_biology_questions(topic, difficulty)

    return {"questions": questions}

Step 3: Register Tool

from probellm.tools import ToolRegistry, LocalMCPTool

registry = ToolRegistry()
registry.register(LocalMCPTool(spec, biology_tool_handler))

Step 4: Use in Pipeline

from probellm import VulnerabilityPipelineAsync

pipeline = VulnerabilityPipelineAsync(
    model_name="gpt-5.2",
    test_model="gpt-4o-mini",
    tool_registry=registry  # Inject custom registry
)

# Now test generation can use "my_domain_tool"

Advanced: Tool Composition

Tools can call other tools via the registry:

def composite_tool_handler(arguments: dict, registry: ToolRegistry) -> dict:
    # Step 1: Use web_search to get context
    web_result = registry.call_tool("web_search", {
        "topic": arguments["topic"]
    })

    # Step 2: Use python_exec to process data
    code = generate_processing_code(web_result)
    exec_result = registry.call_tool("python_exec", {"code": code})

    # Step 3: Synthesize final result
    return {"processed_data": exec_result["stdout"]}

# Register with registry access
spec = {...}
tool = LocalMCPTool(spec, lambda args: composite_tool_handler(args, registry))
registry.register(tool)

Tool Selection in MCTS

During expansion, TestCaseGenerator automatically:

Calls LLM to select appropriate tool
Executes tool via ToolRegistry.call_tool()
Uses tool output to synthesize test case

You don’t need to modify search logic — just register your tool and it becomes available for selection.

Planner sees:

Available tools: perturbation, python_exec, web_search, my_domain_tool

Base QA: {...}

Choose the best tool and provide arguments.

Best Practices

Validation: Validate inputs using JSON Schema
Error Handling: Return {"error": ...} instead of raising exceptions
Idempotency: Same inputs → same outputs (for reproducibility)
Documentation: Clear description + inputSchema descriptions
Logging: Use print() for debugging (captured in logs)

Tool System (MCP-Based)

Overview

Built-in Tools

perturbation

python_exec

web_search

Custom Tool Development

Step 1: Define Tool Specification

Step 2: Implement Handler Function

Step 3: Register Tool

Step 4: Use in Pipeline

Advanced: Tool Composition

Tool Selection in MCTS

Best Practices

See Also