Core Concepts

Understanding these concepts will help you effectively use ProbeLLM and extend it for your needs.

Monte Carlo Tree Search (MCTS)

Why MCTS for Vulnerability Detection?

Traditional fuzzing generates random inputs, but LLM input spaces are vast and semantic. MCTS addresses this by:

  1. Guided Exploration: Uses UCB1 to balance exploration vs exploitation

  2. Adaptive Budget: Focuses compute on promising failure regions

  3. Tree Structure: Maintains history for analysis and replay

MCTS Phases in ProbeLLM:

Phase 1: Selection
──────────────────
Start from root → navigate tree using UCB1
UCB1(node) = error_rate + C × sqrt(ln(parent_visits) / node_visits)

→ Selects nodes with high error_rate OR low visit count

Phase 2: Expansion
──────────────────
At selected node:
1. Tool planning: LLM chooses tool (perturbation/python_exec/web_search)
2. Tool execution: Generate new test case
3. Answer generation: Create ground truth for synthetic question
4. Add child node to tree

Phase 3: Simulation
────────────────────
Test the target model:
1. Send query to model-under-test
2. Compare prediction vs ground truth (using judge LLM)
3. Record: correct / error + reasoning

Phase 4: Backpropagation
─────────────────────────
Update statistics from leaf → root:
- visits += 1
- error_count += 1 (if error detected)

→ Future selections favor branches with high error rates

Stopping Criteria:

  • Reaches num_simulations iterations

  • Exhausts sample budget (num_samples)

  • Maximum tree depth (max_depth)

Tool Selection Strategy

During expansion, ProbeLLM uses an LLM-based planner to choose the appropriate tool.

Planning Prompt:

"You are a tool planning expert. Based on the question, choose the best tool.

Available tools: perturbation, python_exec, web_search

Return JSON: {tool: ..., args: {...}, purpose: ...}"

Decision Factors:

Tool

When Selected

Example Trigger

perturbation

Nearby search, paraphrasing

“similar question but different form”

python_exec

Computational/algorithmic tasks

“needs calculation”, “code execution”

web_search

Factual knowledge, far search

“needs external knowledge”, “novel domain”

Why LLM planning?

  • Adaptive: Tool choice depends on question characteristics

  • Explainable: purpose field explains reasoning

  • Extensible: Add new tools → planner automatically considers them

Test Case Validity

Not all generated test cases are valid. ProbeLLM uses multi-stage validation:

  1. Generation-Time Validation (TestCaseGenerator):

    def generate_nearby(q, a):
        candidate = llm.generate(...)
        verdict = llm.check(candidate)  # {valid: bool, reasons: [str]}
    
        if not verdict["valid"]:
            # Log failure, potentially retry
            pass
    
        return candidate
    
  2. Answer-Time Validation (AnswerGenerator):

    • Ensures answer is non-empty

    • Detects format specification leakage (e.g., “mapping: {0: ‘A’, …}”)

    • Retry with feedback if invalid

  3. Execution-Time Validation:

    • For python_exec: Timeout (6s), sandbox constraints

    • LLM-based error correction (up to 3 retries)

  4. Judge-Time Validation:

    • Strict factual equivalence check

    • Lenient on formatting/wording

Ground Truth Generation

For synthetic test cases, we need ground truth. ProbeLLM uses:

Strategy:

  1. Tool selection: Choose web_search (factual) or python_exec (computational)

  2. Evidence retrieval: Execute tool → gather context

  3. LLM synthesis:

    prompt = f"""Question: {question}
    Evidence: {tool_output}
    
    Provide concise answer (<= 3 sentences), cite sources if applicable."""
    
    answer = llm.generate(prompt)
    
  4. Validation: Check answer is substantive (not metadata/format specs)

Confidence Tracking:

Each generated answer includes:

  • answer: The actual answer text

  • confidence: Float 0-1 (LLM self-assessed)

  • reasoning: Justification

Use case: Filter low-confidence questions or use confidence in scoring.

Checkpoint & Resume

Long searches can be interrupted. ProbeLLM supports resumable search:

Checkpoint Structure:

{
  "metadata": {
    "dataset_id": "mmlu",
    "last_simulation": 42,
    "timestamp": "2026-01-27T12:00:00"
  },
  "root_state": {
    "visits": 100,
    "error_count": 25,
    "tree_layer_num": [5, 12, 8]
  },
  "nodes": [
    {
      "id": "syn_mmlu_1_0",
      "parent_id": "root_mmlu",
      "depth": 1,
      "sample": {"query": "...", "ground_truth": "..."},
      "visits": 10,
      "error_count": 3,
      "results": [...]
    },
    ...
  ]
}

Usage:

from probellm import create_checkpoints, resume_from_checkpoint

# Create checkpoint from interrupted run
create_checkpoints("results/run_xxx/")

# Resume
resume_from_checkpoint("results/run_xxx/")

Features:

  • Preserves tree structure + statistics

  • Resumes from exact simulation count

  • Appends to existing results files

  • Rebuilds embeddings.pkl if missing

Token Usage Tracking

ProbeLLM tracks token consumption at every step:

Tracked Operations:

  1. Tool Planning: Selecting which tool to use

  2. Tool Execution: Running tool (e.g., LLM calls inside perturbation)

  3. Candidate Generation: Synthesizing new question

  4. Validation: Checking question validity

  5. Answer Generation: Creating ground truth

  6. Model Inference: Testing model-under-test

  7. Judging: Comparing prediction vs ground truth

Result Format:

{
  "id": "syn_mmlu_2_5",
  "query": "...",
  "prediction": "...",
  "token_usage": {
    "testcase_gen": {
      "plan": {"prompt_tokens": 123, "completion_tokens": 45, "total_tokens": 168},
      "candidate_generation": { ... },
      "validation": { ... },
      "total_tokens": 512
    },
    "answer_gen": {
      "plan": { ... },
      "answer_generation": { ... },
      "total_tokens": 300
    },
    "model_inference": {"total_tokens": 150},
    "judge": {"total_tokens": 200},
    "total_tokens": 1162
  }
}

Use case: Cost analysis, budget allocation, optimization

Error Classification

ProbeLLM records why a model failed:

Judge Output:

{
  "correct": false,
  "error_reason": "Model claimed Paris is in Germany, which is factually incorrect.",
  "correct_reason": ""  # Empty for errors
}

Analysis (pcaAnalysisEnhanced.py):

  • Clusters errors by embedding similarity

  • Identifies systematic failure patterns

  • Generates human-readable reports

Example Clusters:

Cluster 0 (23 errors): "Capital city confusion"
  - Model consistently swaps European capitals
  - Likely memorization issue

Cluster 1 (18 errors): "Unit conversion"
  - Fails when converting meters → feet
  - Suggests training data bias (metric vs imperial)

MCP Tool Protocol

ProbeLLM tools follow Model Context Protocol (MCP) conventions:

Tool Specification:

{
  "name": "tool_name",
  "description": "Human-readable description",
  "inputSchema": {
    "type": "object",
    "properties": {
      "arg1": {"type": "string", "description": "..."},
      "arg2": {"type": "integer", "minimum": 0}
    },
    "required": ["arg1"]
  }
}

Tool Response (JSON-RPC-like envelope):

# Success
{
  "result": {...}  # Tool-specific payload
}

# Error
{
  "error": {
    "code": -32603,
    "message": "Tool execution failed",
    "data": {"details": "..."}
  }
}

Why MCP?

  • Standardized: Interoperable with other MCP-compliant systems

  • Typed: Input schemas enforce validation

  • Extensible: Easy to add new tools without modifying core

Next Steps