Core Concepts
Understanding these concepts will help you effectively use ProbeLLM and extend it for your needs.
Monte Carlo Tree Search (MCTS)
Why MCTS for Vulnerability Detection?
Traditional fuzzing generates random inputs, but LLM input spaces are vast and semantic. MCTS addresses this by:
Guided Exploration: Uses UCB1 to balance exploration vs exploitation
Adaptive Budget: Focuses compute on promising failure regions
Tree Structure: Maintains history for analysis and replay
MCTS Phases in ProbeLLM:
Phase 1: Selection
──────────────────
Start from root → navigate tree using UCB1
UCB1(node) = error_rate + C × sqrt(ln(parent_visits) / node_visits)
→ Selects nodes with high error_rate OR low visit count
Phase 2: Expansion
──────────────────
At selected node:
1. Tool planning: LLM chooses tool (perturbation/python_exec/web_search)
2. Tool execution: Generate new test case
3. Answer generation: Create ground truth for synthetic question
4. Add child node to tree
Phase 3: Simulation
────────────────────
Test the target model:
1. Send query to model-under-test
2. Compare prediction vs ground truth (using judge LLM)
3. Record: correct / error + reasoning
Phase 4: Backpropagation
─────────────────────────
Update statistics from leaf → root:
- visits += 1
- error_count += 1 (if error detected)
→ Future selections favor branches with high error rates
Stopping Criteria:
Reaches
num_simulationsiterationsExhausts sample budget (
num_samples)Maximum tree depth (
max_depth)
Micro vs Macro Search
ProbeLLM uses dual-strategy MCTS to balance local exploitation and global exploration.
Micro Search (Local Exploitation)
Goal: Find variations of known failures
Strategy:
- Uses perturbation tool (paraphrasing, reformulation)
- Stays in “trust region” around existing failures
- Preserves semantic content but changes surface form
Example:
Original failure:
Q: "What is the capital of France?"
Model: "London" ❌
Micro-generated variations:
- "Name the capital city of France."
- "France's capital is which city?"
- "Multiple choice: France's capital: (A) Berlin (B) Paris (C) Rome"
When to use: When you have a known failure and want to understand its robustness boundary.
Macro Search (Global Exploration)
Goal: Discover distant failure modes
Strategy:
- Uses web_search tool + greedy-k-center sampling
- Selects topics maximally different from seen failures (embedding distance)
- Generates questions on novel domains
Example:
Existing failures (MMLU physics):
- Kinematics, thermodynamics, optics
Macro-generated topics:
- Quantum entanglement (far from classical mechanics)
- Astrophysics (different scale, different intuitions)
When to use: When you want to expand test coverage beyond the initial failure set.
Greedy-K-Center Algorithm:
# Pseudo-code
embeddings = [embed(q) for q in seen_failures]
selected = [random_pick(embeddings)]
for i in range(k-1):
# Find point farthest from any selected point
distances = [min(dist(e, s) for s in selected) for e in embeddings]
next_idx = argmax(distances)
selected.append(embeddings[next_idx])
# Use selected failures to prompt web_search tool
-> Generate question on distant topic
Tool Selection Strategy
During expansion, ProbeLLM uses an LLM-based planner to choose the appropriate tool.
Planning Prompt:
"You are a tool planning expert. Based on the question, choose the best tool.
Available tools: perturbation, python_exec, web_search
Return JSON: {tool: ..., args: {...}, purpose: ...}"
Decision Factors:
Tool |
When Selected |
Example Trigger |
|---|---|---|
|
Nearby search, paraphrasing |
“similar question but different form” |
|
Computational/algorithmic tasks |
“needs calculation”, “code execution” |
|
Factual knowledge, far search |
“needs external knowledge”, “novel domain” |
Why LLM planning?
Adaptive: Tool choice depends on question characteristics
Explainable:
purposefield explains reasoningExtensible: Add new tools → planner automatically considers them
Test Case Validity
Not all generated test cases are valid. ProbeLLM uses multi-stage validation:
Generation-Time Validation (
TestCaseGenerator):def generate_nearby(q, a): candidate = llm.generate(...) verdict = llm.check(candidate) # {valid: bool, reasons: [str]} if not verdict["valid"]: # Log failure, potentially retry pass return candidate
Answer-Time Validation (
AnswerGenerator):Ensures answer is non-empty
Detects format specification leakage (e.g., “mapping: {0: ‘A’, …}”)
Retry with feedback if invalid
Execution-Time Validation:
For
python_exec: Timeout (6s), sandbox constraintsLLM-based error correction (up to 3 retries)
Judge-Time Validation:
Strict factual equivalence check
Lenient on formatting/wording
Ground Truth Generation
For synthetic test cases, we need ground truth. ProbeLLM uses:
Strategy:
Tool selection: Choose
web_search(factual) orpython_exec(computational)Evidence retrieval: Execute tool → gather context
LLM synthesis:
prompt = f"""Question: {question} Evidence: {tool_output} Provide concise answer (<= 3 sentences), cite sources if applicable.""" answer = llm.generate(prompt)
Validation: Check answer is substantive (not metadata/format specs)
Confidence Tracking:
Each generated answer includes:
answer: The actual answer textconfidence: Float 0-1 (LLM self-assessed)reasoning: Justification
Use case: Filter low-confidence questions or use confidence in scoring.
Checkpoint & Resume
Long searches can be interrupted. ProbeLLM supports resumable search:
Checkpoint Structure:
{
"metadata": {
"dataset_id": "mmlu",
"last_simulation": 42,
"timestamp": "2026-01-27T12:00:00"
},
"root_state": {
"visits": 100,
"error_count": 25,
"tree_layer_num": [5, 12, 8]
},
"nodes": [
{
"id": "syn_mmlu_1_0",
"parent_id": "root_mmlu",
"depth": 1,
"sample": {"query": "...", "ground_truth": "..."},
"visits": 10,
"error_count": 3,
"results": [...]
},
...
]
}
Usage:
from probellm import create_checkpoints, resume_from_checkpoint
# Create checkpoint from interrupted run
create_checkpoints("results/run_xxx/")
# Resume
resume_from_checkpoint("results/run_xxx/")
Features:
Preserves tree structure + statistics
Resumes from exact simulation count
Appends to existing results files
Rebuilds embeddings.pkl if missing
Token Usage Tracking
ProbeLLM tracks token consumption at every step:
Tracked Operations:
Tool Planning: Selecting which tool to use
Tool Execution: Running tool (e.g., LLM calls inside
perturbation)Candidate Generation: Synthesizing new question
Validation: Checking question validity
Answer Generation: Creating ground truth
Model Inference: Testing model-under-test
Judging: Comparing prediction vs ground truth
Result Format:
{
"id": "syn_mmlu_2_5",
"query": "...",
"prediction": "...",
"token_usage": {
"testcase_gen": {
"plan": {"prompt_tokens": 123, "completion_tokens": 45, "total_tokens": 168},
"candidate_generation": { ... },
"validation": { ... },
"total_tokens": 512
},
"answer_gen": {
"plan": { ... },
"answer_generation": { ... },
"total_tokens": 300
},
"model_inference": {"total_tokens": 150},
"judge": {"total_tokens": 200},
"total_tokens": 1162
}
}
Use case: Cost analysis, budget allocation, optimization
Error Classification
ProbeLLM records why a model failed:
Judge Output:
{
"correct": false,
"error_reason": "Model claimed Paris is in Germany, which is factually incorrect.",
"correct_reason": "" # Empty for errors
}
Analysis (pcaAnalysisEnhanced.py):
Clusters errors by embedding similarity
Identifies systematic failure patterns
Generates human-readable reports
Example Clusters:
Cluster 0 (23 errors): "Capital city confusion"
- Model consistently swaps European capitals
- Likely memorization issue
Cluster 1 (18 errors): "Unit conversion"
- Fails when converting meters → feet
- Suggests training data bias (metric vs imperial)
MCP Tool Protocol
ProbeLLM tools follow Model Context Protocol (MCP) conventions:
Tool Specification:
{
"name": "tool_name",
"description": "Human-readable description",
"inputSchema": {
"type": "object",
"properties": {
"arg1": {"type": "string", "description": "..."},
"arg2": {"type": "integer", "minimum": 0}
},
"required": ["arg1"]
}
}
Tool Response (JSON-RPC-like envelope):
# Success
{
"result": {...} # Tool-specific payload
}
# Error
{
"error": {
"code": -32603,
"message": "Tool execution failed",
"data": {"details": "..."}
}
}
Why MCP?
Standardized: Interoperable with other MCP-compliant systems
Typed: Input schemas enforce validation
Extensible: Easy to add new tools without modifying core
Next Steps
Quickstart: Hands-on tutorial
Tool System (MCP-Based): Deep dive into tool system
MCTS Search Engine: MCTS implementation details
Custom Tool Integration Guide: Build your own tools