Core Concepts

Understanding these concepts will help you effectively use ProbeLLM and extend it for your needs.

Monte Carlo Tree Search (MCTS)

Why MCTS for Vulnerability Detection?

Traditional fuzzing generates random inputs, but LLM input spaces are vast and semantic. MCTS addresses this by:

Guided Exploration: Uses UCB1 to balance exploration vs exploitation
Adaptive Budget: Focuses compute on promising failure regions
Tree Structure: Maintains history for analysis and replay

MCTS Phases in ProbeLLM:

Phase 1: Selection
──────────────────
Start from root → navigate tree using UCB1
UCB1(node) = error_rate + C × sqrt(ln(parent_visits) / node_visits)

→ Selects nodes with high error_rate OR low visit count

Phase 2: Expansion
──────────────────
At selected node:
1. Tool planning: LLM chooses tool (perturbation/python_exec/web_search)
2. Tool execution: Generate new test case
3. Answer generation: Create ground truth for synthetic question
4. Add child node to tree

Phase 3: Simulation
────────────────────
Test the target model:
1. Send query to model-under-test
2. Compare prediction vs ground truth (using judge LLM)
3. Record: correct / error + reasoning

Phase 4: Backpropagation
─────────────────────────
Update statistics from leaf → root:
- visits += 1
- error_count += 1 (if error detected)

→ Future selections favor branches with high error rates

Stopping Criteria:

Reaches num_simulations iterations
Exhausts sample budget (num_samples)
Maximum tree depth (max_depth)

Micro vs Macro Search

ProbeLLM uses dual-strategy MCTS to balance local exploitation and global exploration.

Micro Search (Local Exploitation)

Goal: Find variations of known failures

Strategy: - Uses perturbation tool (paraphrasing, reformulation) - Stays in “trust region” around existing failures - Preserves semantic content but changes surface form

Example:

Original failure:
Q: "What is the capital of France?"
Model: "London" ❌

Micro-generated variations:
- "Name the capital city of France."
- "France's capital is which city?"
- "Multiple choice: France's capital: (A) Berlin (B) Paris (C) Rome"

When to use: When you have a known failure and want to understand its robustness boundary.

Macro Search (Global Exploration)

Goal: Discover distant failure modes

Strategy: - Uses web_search tool + greedy-k-center sampling - Selects topics maximally different from seen failures (embedding distance) - Generates questions on novel domains

Example:

Existing failures (MMLU physics):
- Kinematics, thermodynamics, optics

Macro-generated topics:
- Quantum entanglement (far from classical mechanics)
- Astrophysics (different scale, different intuitions)

When to use: When you want to expand test coverage beyond the initial failure set.

Greedy-K-Center Algorithm:

# Pseudo-code
embeddings = [embed(q) for q in seen_failures]
selected = [random_pick(embeddings)]

for i in range(k-1):
    # Find point farthest from any selected point
    distances = [min(dist(e, s) for s in selected) for e in embeddings]
    next_idx = argmax(distances)
    selected.append(embeddings[next_idx])

# Use selected failures to prompt web_search tool
-> Generate question on distant topic

Tool Selection Strategy

During expansion, ProbeLLM uses an LLM-based planner to choose the appropriate tool.

Planning Prompt:

"You are a tool planning expert. Based on the question, choose the best tool.

Available tools: perturbation, python_exec, web_search

Return JSON: {tool: ..., args: {...}, purpose: ...}"

Decision Factors:

Tool	When Selected	Example Trigger
`perturbation`	Nearby search, paraphrasing	“similar question but different form”
`python_exec`	Computational/algorithmic tasks	“needs calculation”, “code execution”
`web_search`	Factual knowledge, far search	“needs external knowledge”, “novel domain”

Why LLM planning?

Adaptive: Tool choice depends on question characteristics
Explainable: purpose field explains reasoning
Extensible: Add new tools → planner automatically considers them

Test Case Validity

Not all generated test cases are valid. ProbeLLM uses multi-stage validation:

Generation-Time Validation (TestCaseGenerator):

def generate_nearby(q, a):
    candidate = llm.generate(...)
    verdict = llm.check(candidate)  # {valid: bool, reasons: [str]}

    if not verdict["valid"]:
        # Log failure, potentially retry
        pass

    return candidate

Answer-Time Validation (AnswerGenerator):
- Ensures answer is non-empty
- Detects format specification leakage (e.g., “mapping: {0: ‘A’, …}”)
- Retry with feedback if invalid
Execution-Time Validation:
- For python_exec: Timeout (6s), sandbox constraints
- LLM-based error correction (up to 3 retries)
Judge-Time Validation:
- Strict factual equivalence check
- Lenient on formatting/wording

Ground Truth Generation

For synthetic test cases, we need ground truth. ProbeLLM uses:

Strategy:

Tool selection: Choose web_search (factual) or python_exec (computational)
Evidence retrieval: Execute tool → gather context

LLM synthesis:

prompt = f"""Question: {question}
Evidence: {tool_output}

Provide concise answer (<= 3 sentences), cite sources if applicable."""

answer = llm.generate(prompt)

Validation: Check answer is substantive (not metadata/format specs)

Confidence Tracking:

Each generated answer includes:

answer: The actual answer text
confidence: Float 0-1 (LLM self-assessed)
reasoning: Justification

Use case: Filter low-confidence questions or use confidence in scoring.

Checkpoint & Resume

Long searches can be interrupted. ProbeLLM supports resumable search:

Checkpoint Structure:

{
  "metadata": {
    "dataset_id": "mmlu",
    "last_simulation": 42,
    "timestamp": "2026-01-27T12:00:00"
  },
  "root_state": {
    "visits": 100,
    "error_count": 25,
    "tree_layer_num": [5, 12, 8]
  },
  "nodes": [
    {
      "id": "syn_mmlu_1_0",
      "parent_id": "root_mmlu",
      "depth": 1,
      "sample": {"query": "...", "ground_truth": "..."},
      "visits": 10,
      "error_count": 3,
      "results": [...]
    },
    ...
  ]
}

Usage:

from probellm import create_checkpoints, resume_from_checkpoint

# Create checkpoint from interrupted run
create_checkpoints("results/run_xxx/")

# Resume
resume_from_checkpoint("results/run_xxx/")

Features:

Preserves tree structure + statistics
Resumes from exact simulation count
Appends to existing results files
Rebuilds embeddings.pkl if missing

Token Usage Tracking

ProbeLLM tracks token consumption at every step:

Tracked Operations:

Tool Planning: Selecting which tool to use
Tool Execution: Running tool (e.g., LLM calls inside perturbation)
Candidate Generation: Synthesizing new question
Validation: Checking question validity
Answer Generation: Creating ground truth
Model Inference: Testing model-under-test
Judging: Comparing prediction vs ground truth

Result Format:

{
  "id": "syn_mmlu_2_5",
  "query": "...",
  "prediction": "...",
  "token_usage": {
    "testcase_gen": {
      "plan": {"prompt_tokens": 123, "completion_tokens": 45, "total_tokens": 168},
      "candidate_generation": { ... },
      "validation": { ... },
      "total_tokens": 512
    },
    "answer_gen": {
      "plan": { ... },
      "answer_generation": { ... },
      "total_tokens": 300
    },
    "model_inference": {"total_tokens": 150},
    "judge": {"total_tokens": 200},
    "total_tokens": 1162
  }
}

Use case: Cost analysis, budget allocation, optimization

Error Classification

ProbeLLM records why a model failed:

Judge Output:

{
  "correct": false,
  "error_reason": "Model claimed Paris is in Germany, which is factually incorrect.",
  "correct_reason": ""  # Empty for errors
}

Analysis (pcaAnalysisEnhanced.py):

Clusters errors by embedding similarity
Identifies systematic failure patterns
Generates human-readable reports

Example Clusters:

Cluster 0 (23 errors): "Capital city confusion"
  - Model consistently swaps European capitals
  - Likely memorization issue

Cluster 1 (18 errors): "Unit conversion"
  - Fails when converting meters → feet
  - Suggests training data bias (metric vs imperial)

MCP Tool Protocol

ProbeLLM tools follow Model Context Protocol (MCP) conventions:

Tool Specification:

{
  "name": "tool_name",
  "description": "Human-readable description",
  "inputSchema": {
    "type": "object",
    "properties": {
      "arg1": {"type": "string", "description": "..."},
      "arg2": {"type": "integer", "minimum": 0}
    },
    "required": ["arg1"]
  }
}

Tool Response (JSON-RPC-like envelope):

# Success
{
  "result": {...}  # Tool-specific payload
}

# Error
{
  "error": {
    "code": -32603,
    "message": "Tool execution failed",
    "data": {"details": "..."}
  }
}

Why MCP?

Standardized: Interoperable with other MCP-compliant systems
Typed: Input schemas enforce validation
Extensible: Easy to add new tools without modifying core

Next Steps

Quickstart: Hands-on tutorial
Tool System (MCP-Based): Deep dive into tool system
MCTS Search Engine: MCTS implementation details
Custom Tool Integration Guide: Build your own tools