Tool System (MCP-Based)
========================

ProbeLLM's tool system is the **core extensibility layer**, allowing you to inject custom test generation strategies without modifying the search engine.

Overview
--------

**Design Goals**:

1. **Pluggable**: Add/remove tools at runtime
2. **Standardized**: Follow Model Context Protocol (MCP)
3. **Type-Safe**: JSON Schema validation for inputs
4. **Composable**: Tools can call other tools

**Architecture**:

.. code-block:: text

   User Code
      │
      ▼
   ToolRegistry (central dispatcher)
      │
      ├──> LocalMCPTool("perturbation", handler_fn)
      ├──> LocalMCPTool("python_exec", handler_fn)
      ├──> LocalMCPTool("web_search", handler_fn)
      └──> LocalMCPTool("my_custom_tool", handler_fn)

Built-in Tools
--------------

perturbation
^^^^^^^^^^^^

**Purpose**: Generate semantic-preserving variations

**Input Schema**:

.. code-block:: python

   {
     "input": str,          # Original question
     "expected": str,       # Ground truth
     "operations": list,    # ["paraphrase", "reformulate"]
     "forms": list,         # ["multiple_choice", "true_false"]
     "num_variants": int    # Number to generate
   }

**Output**:

.. code-block:: python

   {
     "variants": [
       {
         "operation": "paraphrase",
         "form": "free_text",
         "text": "Reworded question...",
         "rationale": "..."
       },
       {
         "operation": "reformulate",
         "form": "multiple_choice",
         "text": "Question stem...",
         "options": ["A", "B", "C", "D"],
         "answer_key": "B",
         "rationale": "..."
       }
     ]
   }

**Example**:

.. code-block:: python

   from probellm.tools import build_default_tool_registry

   registry = build_default_tool_registry(model="gpt-5.2")
   result = registry.call_tool("perturbation", {
       "input": "What is 2+2?",
       "expected": "4",
       "operations": ["paraphrase"],
       "num_variants": 3
   })
   
   for variant in result["variants"]:
       print(variant["text"])

python_exec
^^^^^^^^^^^

**Purpose**: Execute Python code for computational/algorithmic questions

**Input Schema**:

.. code-block:: python

   {
     "code": str,           # Python code to execute
     "purpose": str,        # Description (for error correction)
     "max_retries": int,    # Auto-fix attempts (default: 3)
     "timeout_sec": int     # Execution timeout (default: 6)
   }

**Output**:

.. code-block:: python

   {
     "success": bool,
     "stdout": str,         # Captured output
     "stderr": str,         # Error messages
     "returncode": int,
     "fix_tokens": int      # Tokens used in retry attempts
   }

**Safety Features**:

- **Sandbox**: Runs in temporary directory with ``-I -B -S`` flags
- **Timeout**: Kills after ``timeout_sec``
- **Auto-repair**: If execution fails, sends error to LLM for fix (up to ``max_retries``)
- **Standard library only**: No ``numpy``, ``pandas``, etc.

**Example**:

.. code-block:: python

   result = registry.call_tool("python_exec", {
       "code": "import math\\nresult = math.factorial(5)\\nprint(result)",
       "purpose": "Calculate 5 factorial"
   })
   
   if result["success"]:
       print(result["stdout"])  # "120"

web_search
^^^^^^^^^^

**Purpose**: Retrieve external knowledge via OpenAI Responses API

**Input Schema**:

.. code-block:: python

   {
     "topic": str,          # Search query
     "max_results": int     # Max results to retrieve (default: 5)
   }

**Output**:

.. code-block:: python

   {
     "question": str,
     "answer": str,
     "evidence": [
       {"url": str, "title": str, "quote": str},
       ...
     ],
     "citations_validation": {"valid": bool, "reason": str}
   }

**Example**:

.. code-block:: python

   result = registry.call_tool("web_search", {
       "topic": "quantum entanglement",
       "max_results": 3
   })
   
   print(result["answer"])
   for ev in result["evidence"]:
       print(f"- {ev['title']}: {ev['url']}")

Custom Tool Development
-----------------------

Step 1: Define Tool Specification
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   spec = {
       "name": "my_domain_tool",
       "description": "Generates biology-specific test cases",
       "inputSchema": {
           "type": "object",
           "properties": {
               "topic": {"type": "string", "description": "Biology topic"},
               "difficulty": {"type": "string", "enum": ["easy", "hard"]}
           },
           "required": ["topic"]
       }
   }

Step 2: Implement Handler Function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   def biology_tool_handler(arguments: dict) -> dict:
       topic = arguments.get("topic", "")
       difficulty = arguments.get("difficulty", "easy")
       
       # Your custom logic here
       # - Could call external APIs
       # - Query specialized databases
       # - Use domain-specific LLM prompts
       
       questions = generate_biology_questions(topic, difficulty)
       
       return {"questions": questions}

Step 3: Register Tool
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   from probellm.tools import ToolRegistry, LocalMCPTool

   registry = ToolRegistry()
   registry.register(LocalMCPTool(spec, biology_tool_handler))

Step 4: Use in Pipeline
^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   from probellm import VulnerabilityPipelineAsync

   pipeline = VulnerabilityPipelineAsync(
       model_name="gpt-5.2",
       test_model="gpt-4o-mini",
       tool_registry=registry  # Inject custom registry
   )
   
   # Now test generation can use "my_domain_tool"

Advanced: Tool Composition
---------------------------

Tools can call other tools via the registry:

.. code-block:: python

   def composite_tool_handler(arguments: dict, registry: ToolRegistry) -> dict:
       # Step 1: Use web_search to get context
       web_result = registry.call_tool("web_search", {
           "topic": arguments["topic"]
       })
       
       # Step 2: Use python_exec to process data
       code = generate_processing_code(web_result)
       exec_result = registry.call_tool("python_exec", {"code": code})
       
       # Step 3: Synthesize final result
       return {"processed_data": exec_result["stdout"]}

   # Register with registry access
   spec = {...}
   tool = LocalMCPTool(spec, lambda args: composite_tool_handler(args, registry))
   registry.register(tool)

Tool Selection in MCTS
----------------------

During expansion, ``TestCaseGenerator`` automatically:

1. Calls LLM to select appropriate tool
2. Executes tool via ``ToolRegistry.call_tool()``
3. Uses tool output to synthesize test case

**You don't need to modify search logic** — just register your tool and it becomes available for selection.

**Planner sees**:

.. code-block:: text

   Available tools: perturbation, python_exec, web_search, my_domain_tool
   
   Base QA: {...}
   
   Choose the best tool and provide arguments.

Best Practices
--------------

1. **Validation**: Validate inputs using JSON Schema
2. **Error Handling**: Return ``{"error": ...}`` instead of raising exceptions
3. **Idempotency**: Same inputs → same outputs (for reproducibility)
4. **Documentation**: Clear ``description`` + ``inputSchema`` descriptions
5. **Logging**: Use ``print()`` for debugging (captured in logs)

See Also
--------

- :doc:`../guides/custom_tools`: Step-by-step tutorial
- :doc:`../api`: Full API reference
- :doc:`../concepts`: Tool selection strategy