Core Concepts ============= Understanding these concepts will help you effectively use ProbeLLM and extend it for your needs. Monte Carlo Tree Search (MCTS) ------------------------------- **Why MCTS for Vulnerability Detection?** Traditional fuzzing generates random inputs, but LLM input spaces are **vast and semantic**. MCTS addresses this by: 1. **Guided Exploration**: Uses UCB1 to balance exploration vs exploitation 2. **Adaptive Budget**: Focuses compute on promising failure regions 3. **Tree Structure**: Maintains history for analysis and replay **MCTS Phases in ProbeLLM**: .. code-block:: text Phase 1: Selection ────────────────── Start from root → navigate tree using UCB1 UCB1(node) = error_rate + C × sqrt(ln(parent_visits) / node_visits) → Selects nodes with high error_rate OR low visit count Phase 2: Expansion ────────────────── At selected node: 1. Tool planning: LLM chooses tool (perturbation/python_exec/web_search) 2. Tool execution: Generate new test case 3. Answer generation: Create ground truth for synthetic question 4. Add child node to tree Phase 3: Simulation ──────────────────── Test the target model: 1. Send query to model-under-test 2. Compare prediction vs ground truth (using judge LLM) 3. Record: correct / error + reasoning Phase 4: Backpropagation ───────────────────────── Update statistics from leaf → root: - visits += 1 - error_count += 1 (if error detected) → Future selections favor branches with high error rates **Stopping Criteria**: - Reaches ``num_simulations`` iterations - Exhausts sample budget (``num_samples``) - Maximum tree depth (``max_depth``) Micro vs Macro Search ---------------------- ProbeLLM uses **dual-strategy MCTS** to balance local exploitation and global exploration. Micro Search (Local Exploitation) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Goal**: Find variations of known failures **Strategy**: - Uses ``perturbation`` tool (paraphrasing, reformulation) - Stays in "trust region" around existing failures - Preserves semantic content but changes surface form **Example**: .. code-block:: text Original failure: Q: "What is the capital of France?" Model: "London" ❌ Micro-generated variations: - "Name the capital city of France." - "France's capital is which city?" - "Multiple choice: France's capital: (A) Berlin (B) Paris (C) Rome" **When to use**: When you have a known failure and want to understand its **robustness boundary**. Macro Search (Global Exploration) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Goal**: Discover **distant** failure modes **Strategy**: - Uses ``web_search`` tool + greedy-k-center sampling - Selects topics maximally different from seen failures (embedding distance) - Generates questions on novel domains **Example**: .. code-block:: python Existing failures (MMLU physics): - Kinematics, thermodynamics, optics Macro-generated topics: - Quantum entanglement (far from classical mechanics) - Astrophysics (different scale, different intuitions) **When to use**: When you want to expand test coverage beyond the initial failure set. **Greedy-K-Center Algorithm**: .. code-block:: text # Pseudo-code embeddings = [embed(q) for q in seen_failures] selected = [random_pick(embeddings)] for i in range(k-1): # Find point farthest from any selected point distances = [min(dist(e, s) for s in selected) for e in embeddings] next_idx = argmax(distances) selected.append(embeddings[next_idx]) # Use selected failures to prompt web_search tool -> Generate question on distant topic Tool Selection Strategy ----------------------- During expansion, ProbeLLM uses an **LLM-based planner** to choose the appropriate tool. **Planning Prompt**: .. code-block:: text "You are a tool planning expert. Based on the question, choose the best tool. Available tools: perturbation, python_exec, web_search Return JSON: {tool: ..., args: {...}, purpose: ...}" **Decision Factors**: .. list-table:: :header-rows: 1 :widths: 20 40 40 * - Tool - When Selected - Example Trigger * - ``perturbation`` - Nearby search, paraphrasing - "similar question but different form" * - ``python_exec`` - Computational/algorithmic tasks - "needs calculation", "code execution" * - ``web_search`` - Factual knowledge, far search - "needs external knowledge", "novel domain" **Why LLM planning?** - **Adaptive**: Tool choice depends on question characteristics - **Explainable**: ``purpose`` field explains reasoning - **Extensible**: Add new tools → planner automatically considers them Test Case Validity ------------------ Not all generated test cases are valid. ProbeLLM uses **multi-stage validation**: 1. **Generation-Time Validation** (``TestCaseGenerator``): .. code-block:: python def generate_nearby(q, a): candidate = llm.generate(...) verdict = llm.check(candidate) # {valid: bool, reasons: [str]} if not verdict["valid"]: # Log failure, potentially retry pass return candidate 2. **Answer-Time Validation** (``AnswerGenerator``): - Ensures answer is non-empty - Detects format specification leakage (e.g., "mapping: {0: 'A', ...}") - Retry with feedback if invalid 3. **Execution-Time Validation**: - For ``python_exec``: Timeout (6s), sandbox constraints - LLM-based error correction (up to 3 retries) 4. **Judge-Time Validation**: - Strict factual equivalence check - Lenient on formatting/wording Ground Truth Generation ------------------------ For **synthetic test cases**, we need ground truth. ProbeLLM uses: **Strategy**: 1. **Tool selection**: Choose ``web_search`` (factual) or ``python_exec`` (computational) 2. **Evidence retrieval**: Execute tool → gather context 3. **LLM synthesis**: .. code-block:: python prompt = f"""Question: {question} Evidence: {tool_output} Provide concise answer (<= 3 sentences), cite sources if applicable.""" answer = llm.generate(prompt) 4. **Validation**: Check answer is substantive (not metadata/format specs) **Confidence Tracking**: Each generated answer includes: - ``answer``: The actual answer text - ``confidence``: Float 0-1 (LLM self-assessed) - ``reasoning``: Justification **Use case**: Filter low-confidence questions or use confidence in scoring. Checkpoint & Resume ------------------- Long searches can be interrupted. ProbeLLM supports **resumable search**: **Checkpoint Structure**: .. code-block:: text { "metadata": { "dataset_id": "mmlu", "last_simulation": 42, "timestamp": "2026-01-27T12:00:00" }, "root_state": { "visits": 100, "error_count": 25, "tree_layer_num": [5, 12, 8] }, "nodes": [ { "id": "syn_mmlu_1_0", "parent_id": "root_mmlu", "depth": 1, "sample": {"query": "...", "ground_truth": "..."}, "visits": 10, "error_count": 3, "results": [...] }, ... ] } **Usage**: .. code-block:: bash from probellm import create_checkpoints, resume_from_checkpoint # Create checkpoint from interrupted run create_checkpoints("results/run_xxx/") # Resume resume_from_checkpoint("results/run_xxx/") **Features**: - Preserves tree structure + statistics - Resumes from exact simulation count - Appends to existing results files - Rebuilds embeddings.pkl if missing Token Usage Tracking -------------------- ProbeLLM tracks token consumption at **every step**: **Tracked Operations**: 1. **Tool Planning**: Selecting which tool to use 2. **Tool Execution**: Running tool (e.g., LLM calls inside ``perturbation``) 3. **Candidate Generation**: Synthesizing new question 4. **Validation**: Checking question validity 5. **Answer Generation**: Creating ground truth 6. **Model Inference**: Testing model-under-test 7. **Judging**: Comparing prediction vs ground truth **Result Format**: .. code-block:: text { "id": "syn_mmlu_2_5", "query": "...", "prediction": "...", "token_usage": { "testcase_gen": { "plan": {"prompt_tokens": 123, "completion_tokens": 45, "total_tokens": 168}, "candidate_generation": { ... }, "validation": { ... }, "total_tokens": 512 }, "answer_gen": { "plan": { ... }, "answer_generation": { ... }, "total_tokens": 300 }, "model_inference": {"total_tokens": 150}, "judge": {"total_tokens": 200}, "total_tokens": 1162 } } **Use case**: Cost analysis, budget allocation, optimization Error Classification -------------------- ProbeLLM records **why** a model failed: **Judge Output**: .. code-block:: python { "correct": false, "error_reason": "Model claimed Paris is in Germany, which is factually incorrect.", "correct_reason": "" # Empty for errors } **Analysis** (``pcaAnalysisEnhanced.py``): - Clusters errors by embedding similarity - Identifies **systematic failure patterns** - Generates human-readable reports **Example Clusters**: .. code-block:: text Cluster 0 (23 errors): "Capital city confusion" - Model consistently swaps European capitals - Likely memorization issue Cluster 1 (18 errors): "Unit conversion" - Fails when converting meters → feet - Suggests training data bias (metric vs imperial) MCP Tool Protocol ----------------- ProbeLLM tools follow **Model Context Protocol** (MCP) conventions: **Tool Specification**: .. code-block:: python { "name": "tool_name", "description": "Human-readable description", "inputSchema": { "type": "object", "properties": { "arg1": {"type": "string", "description": "..."}, "arg2": {"type": "integer", "minimum": 0} }, "required": ["arg1"] } } **Tool Response** (JSON-RPC-like envelope): .. code-block:: python # Success { "result": {...} # Tool-specific payload } # Error { "error": { "code": -32603, "message": "Tool execution failed", "data": {"details": "..."} } } **Why MCP?** - **Standardized**: Interoperable with other MCP-compliant systems - **Typed**: Input schemas enforce validation - **Extensible**: Easy to add new tools without modifying core Next Steps ---------- - :doc:`quickstart`: Hands-on tutorial - :doc:`modules/tools`: Deep dive into tool system - :doc:`modules/search`: MCTS implementation details - :doc:`guides/custom_tools`: Build your own tools