Tool System (MCP-Based) ======================== ProbeLLM's tool system is the **core extensibility layer**, allowing you to inject custom test generation strategies without modifying the search engine. Overview -------- **Design Goals**: 1. **Pluggable**: Add/remove tools at runtime 2. **Standardized**: Follow Model Context Protocol (MCP) 3. **Type-Safe**: JSON Schema validation for inputs 4. **Composable**: Tools can call other tools **Architecture**: .. code-block:: text User Code │ ▼ ToolRegistry (central dispatcher) │ ├──> LocalMCPTool("perturbation", handler_fn) ├──> LocalMCPTool("python_exec", handler_fn) ├──> LocalMCPTool("web_search", handler_fn) └──> LocalMCPTool("my_custom_tool", handler_fn) Built-in Tools -------------- perturbation ^^^^^^^^^^^^ **Purpose**: Generate semantic-preserving variations **Input Schema**: .. code-block:: python { "input": str, # Original question "expected": str, # Ground truth "operations": list, # ["paraphrase", "reformulate"] "forms": list, # ["multiple_choice", "true_false"] "num_variants": int # Number to generate } **Output**: .. code-block:: python { "variants": [ { "operation": "paraphrase", "form": "free_text", "text": "Reworded question...", "rationale": "..." }, { "operation": "reformulate", "form": "multiple_choice", "text": "Question stem...", "options": ["A", "B", "C", "D"], "answer_key": "B", "rationale": "..." } ] } **Example**: .. code-block:: python from probellm.tools import build_default_tool_registry registry = build_default_tool_registry(model="gpt-5.2") result = registry.call_tool("perturbation", { "input": "What is 2+2?", "expected": "4", "operations": ["paraphrase"], "num_variants": 3 }) for variant in result["variants"]: print(variant["text"]) python_exec ^^^^^^^^^^^ **Purpose**: Execute Python code for computational/algorithmic questions **Input Schema**: .. code-block:: python { "code": str, # Python code to execute "purpose": str, # Description (for error correction) "max_retries": int, # Auto-fix attempts (default: 3) "timeout_sec": int # Execution timeout (default: 6) } **Output**: .. code-block:: python { "success": bool, "stdout": str, # Captured output "stderr": str, # Error messages "returncode": int, "fix_tokens": int # Tokens used in retry attempts } **Safety Features**: - **Sandbox**: Runs in temporary directory with ``-I -B -S`` flags - **Timeout**: Kills after ``timeout_sec`` - **Auto-repair**: If execution fails, sends error to LLM for fix (up to ``max_retries``) - **Standard library only**: No ``numpy``, ``pandas``, etc. **Example**: .. code-block:: python result = registry.call_tool("python_exec", { "code": "import math\\nresult = math.factorial(5)\\nprint(result)", "purpose": "Calculate 5 factorial" }) if result["success"]: print(result["stdout"]) # "120" web_search ^^^^^^^^^^ **Purpose**: Retrieve external knowledge via OpenAI Responses API **Input Schema**: .. code-block:: python { "topic": str, # Search query "max_results": int # Max results to retrieve (default: 5) } **Output**: .. code-block:: python { "question": str, "answer": str, "evidence": [ {"url": str, "title": str, "quote": str}, ... ], "citations_validation": {"valid": bool, "reason": str} } **Example**: .. code-block:: python result = registry.call_tool("web_search", { "topic": "quantum entanglement", "max_results": 3 }) print(result["answer"]) for ev in result["evidence"]: print(f"- {ev['title']}: {ev['url']}") Custom Tool Development ----------------------- Step 1: Define Tool Specification ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python spec = { "name": "my_domain_tool", "description": "Generates biology-specific test cases", "inputSchema": { "type": "object", "properties": { "topic": {"type": "string", "description": "Biology topic"}, "difficulty": {"type": "string", "enum": ["easy", "hard"]} }, "required": ["topic"] } } Step 2: Implement Handler Function ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python def biology_tool_handler(arguments: dict) -> dict: topic = arguments.get("topic", "") difficulty = arguments.get("difficulty", "easy") # Your custom logic here # - Could call external APIs # - Query specialized databases # - Use domain-specific LLM prompts questions = generate_biology_questions(topic, difficulty) return {"questions": questions} Step 3: Register Tool ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python from probellm.tools import ToolRegistry, LocalMCPTool registry = ToolRegistry() registry.register(LocalMCPTool(spec, biology_tool_handler)) Step 4: Use in Pipeline ^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python from probellm import VulnerabilityPipelineAsync pipeline = VulnerabilityPipelineAsync( model_name="gpt-5.2", test_model="gpt-4o-mini", tool_registry=registry # Inject custom registry ) # Now test generation can use "my_domain_tool" Advanced: Tool Composition --------------------------- Tools can call other tools via the registry: .. code-block:: python def composite_tool_handler(arguments: dict, registry: ToolRegistry) -> dict: # Step 1: Use web_search to get context web_result = registry.call_tool("web_search", { "topic": arguments["topic"] }) # Step 2: Use python_exec to process data code = generate_processing_code(web_result) exec_result = registry.call_tool("python_exec", {"code": code}) # Step 3: Synthesize final result return {"processed_data": exec_result["stdout"]} # Register with registry access spec = {...} tool = LocalMCPTool(spec, lambda args: composite_tool_handler(args, registry)) registry.register(tool) Tool Selection in MCTS ---------------------- During expansion, ``TestCaseGenerator`` automatically: 1. Calls LLM to select appropriate tool 2. Executes tool via ``ToolRegistry.call_tool()`` 3. Uses tool output to synthesize test case **You don't need to modify search logic** — just register your tool and it becomes available for selection. **Planner sees**: .. code-block:: text Available tools: perturbation, python_exec, web_search, my_domain_tool Base QA: {...} Choose the best tool and provide arguments. Best Practices -------------- 1. **Validation**: Validate inputs using JSON Schema 2. **Error Handling**: Return ``{"error": ...}`` instead of raising exceptions 3. **Idempotency**: Same inputs → same outputs (for reproducibility) 4. **Documentation**: Clear ``description`` + ``inputSchema`` descriptions 5. **Logging**: Use ``print()`` for debugging (captured in logs) See Also -------- - :doc:`../guides/custom_tools`: Step-by-step tutorial - :doc:`../api`: Full API reference - :doc:`../concepts`: Tool selection strategy