Architecture Overview

ProbeLLM follows a modular, extensible architecture where each component can operate independently or be composed into complex workflows.

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                   ProbeLLM Toolkit                          │
└─────────────────────────────────────────────────────────────┘
                            │
     ┌──────────────────────┼──────────────────────┐
     │                      │                      │
┌────▼─────┐         ┌─────▼──────┐        ┌─────▼──────┐
│  Tools   │         │   Search   │        │ Validation │
│  Layer   │◄────────┤   Engine   │        │   System   │
└────┬─────┘         └─────┬──────┘        └────────────┘
     │                     │
     │              ┌──────▼──────────┐
     │              │   Data Loader   │
     │              └─────────────────┘
     │
┌────▼──────────────────────────────┐
│    MCP Tool Registry (Extensible) │
│  • perturbation  • python_exec    │
│  • web_search    • custom_tool    │
└───────────────────────────────────┘

Core Components

1. Tool Layer (MCP-Based)

Purpose: Provide pluggable test generation strategies

Location: probellm/tools/

Key Classes:

ToolRegistry: Central registry managing all available tools
LocalMCPTool: Wrapper conforming to MCP protocol
build_default_tool_registry(): Factory for built-in tools

Built-in Tools:

Tool Name	Purpose	Use Case
`perturbation`	Semantic-preserving rewording	Micro-search (local exploration)
`python_exec`	Execute Python code for computation	Math/algorithmic questions
`web_search`	Retrieve external knowledge	Factual questions, macro-search

Extension Point: Add custom tools by registering with ToolRegistry

from probellm.tools import ToolRegistry, LocalMCPTool

def my_tool(args: dict) -> dict:
    return {"result": "custom output"}

spec = {"name": "my_tool", "description": "...", ...}
registry = ToolRegistry()
registry.register(LocalMCPTool(spec, my_tool))

2. Search Engine (MCTS)

Purpose: Intelligently explore the space of test cases

Location: probellm/search.py

Key Classes:

VulnerabilityPipelineAsync: Main search orchestrator
RootNode: Represents a dataset root
SyntheticNode: Generated test case node
TestCaseGenerator: Uses tools to create new tests
AnswerGenerator: Generates ground truth for synthetic questions

Search Strategies:

MCTS Loop:
┌──────────────────────────────────────┐
│  1. Selection (UCB1)                 │
│     → Choose most promising node     │
├──────────────────────────────────────┤
│  2. Expansion (Tool Selection)       │
│     → Generate new test via tool     │
├──────────────────────────────────────┤
│  3. Simulation (Model Inference)     │
│     → Test model on new case         │
├──────────────────────────────────────┤
│  4. Backpropagation (Update Stats)   │
│     → Update visit counts & errors   │
└──────────────────────────────────────┘

Dual-Strategy Search:

Micro: perturbation tool → stay in trust region around failures
Macro: web_search + greedy-k-center sampling → explore distant semantic spaces

3. Data Loader

Purpose: Unified interface to benchmark datasets

Location: dataloader/

Key Components:

YAMLDatasetLoader: Reads datasets_config.yaml
DatasetInterface: Provides structured dataset access
HierarchicalSampler: Ensures balanced sampling across subsets
datasets_config.yaml: Declarative dataset configuration

Supported Datasets (out of the box):

MMLU (5 subjects)
SuperGLUE (5 tasks)
HellaSwag
TruthfulQA
MBPP (code generation)

Custom Dataset Support: See Custom Datasets Guide

4. Validation System

Purpose: Pre-flight checks before expensive searches

Location: probellm/validate.py, validate_config.py

Checks:

Check Type	Details
Dependencies	`openai`, `datasets`, `numpy`, `yaml`, etc.
Environment Variables	`OPENAI_API_KEY`, `OPENROUTER_API_KEY`
YAML Schema	Valid dataset configuration structure
Hard-coded Secrets	Warns if API keys found in code (heuristic)

Usage:

python validate_config.py --limit-datasets 5

Data Flow

Typical Search Flow:

┌──────────────┐
│ User Script  │
└──────┬───────┘
       │
┌──────▼─────────────────────────────────────────┐
│ VulnerabilityPipelineAsync                     │
│  • add_datasets_batch()                        │
│  • run() → _run_concurrent()                   │
└──────┬─────────────────────────────────────────┘
       │
┌──────▼──────────────────────────────────┐
│ DatasetInterface.load_dataset_structured│
│  → Returns (dataset_index, sample_store)│
└──────┬──────────────────────────────────┘
       │
┌──────▼────────────────────────────┐
│ HierarchicalSampler               │
│  → Balanced sampling plan         │
└──────┬────────────────────────────┘
       │
┌──────▼───────────────────────────────┐
│ init_tree_async()                    │
│  → Find initial failures → MCTS root │
└──────┬───────────────────────────────┘
       │
┌──────▼────────────────────────────────────┐
│ MCTS Loop (_mcts_search_async)            │
│  ┌─────────────────────────────────────┐  │
│  │ Select → Expand → Simulate → Back   │  │
│  │   ▲                        │         │  │
│  │   └────────────────────────┘         │  │
│  └─────────────────────────────────────┘  │
│                                            │
│  Expansion uses:                           │
│    • TestCaseGenerator (→ Tool Registry)  │
│    • AnswerGenerator (→ Tool Registry)    │
└──────┬─────────────────────────────────────┘
       │
┌──────▼────────────────────┐
│ Results JSON (per dataset)│
│  + Tree visualization     │
└───────────────────────────┘

Tool Invocation During Expansion:

TestCaseGenerator.generate_nearby(q, a)
       │
┌──────▼──────────────────────┐
│ _plan_tool_nearby()          │
│  → LLM selects tool          │
└──────┬──────────────────────┘
       │
┌──────▼──────────────────────────────┐
│ ToolRegistry.call_tool(name, args)  │
│  → Executes tool handler            │
└──────┬──────────────────────────────┘
       │
┌──────▼─────────────────────┐
│ Tool result → LLM synthesis│
│  → New question candidate  │
└────────────────────────────┘

Extensibility Points

Custom Tools (Custom Tool Integration Guide)
- Implement handler function
- Define MCP-compatible spec
- Register with ToolRegistry
Custom Datasets (Custom Datasets Guide)
- Add entry to datasets_config.yaml
- Specify load_params, key_mapping, ground_truth mapping
Custom Samplers (dataloader/sampler.py)
- Inherit from BaseSampler
- Implement _build_plan()
Custom Search Strategies
- Override _select() / _expand_async() in VulnerabilityPipelineAsync
- Or: Use tools to inject custom expansion logic

Design Principles

Modularity: Each component has a single responsibility
Extensibility: Easy to add new tools, datasets, samplers
Async-First: Concurrent execution for speed (asyncio)
MCP-Compatible: Tools follow standardized protocol
Library-First: Every feature accessible programmatically
Fail-Safe: Validation and error handling at all layers

Next Steps

Core Concepts: Understand core concepts (MCTS, tool selection, etc.)
Quickstart: Hands-on tutorial
Tool System (MCP-Based): Deep dive into tool system