Architecture Overview

ProbeLLM follows a modular, extensible architecture where each component can operate independently or be composed into complex workflows.

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                   ProbeLLM Toolkit                          │
└─────────────────────────────────────────────────────────────┘
                            │
     ┌──────────────────────┼──────────────────────┐
     │                      │                      │
┌────▼─────┐         ┌─────▼──────┐        ┌─────▼──────┐
│  Tools   │         │   Search   │        │ Validation │
│  Layer   │◄────────┤   Engine   │        │   System   │
└────┬─────┘         └─────┬──────┘        └────────────┘
     │                     │
     │              ┌──────▼──────────┐
     │              │   Data Loader   │
     │              └─────────────────┘
     │
┌────▼──────────────────────────────┐
│    MCP Tool Registry (Extensible) │
│  • perturbation  • python_exec    │
│  • web_search    • custom_tool    │
└───────────────────────────────────┘

Core Components

1. Tool Layer (MCP-Based)

Purpose: Provide pluggable test generation strategies

Location: probellm/tools/

Key Classes:

  • ToolRegistry: Central registry managing all available tools

  • LocalMCPTool: Wrapper conforming to MCP protocol

  • build_default_tool_registry(): Factory for built-in tools

Built-in Tools:

Tool Name

Purpose

Use Case

perturbation

Semantic-preserving rewording

Micro-search (local exploration)

python_exec

Execute Python code for computation

Math/algorithmic questions

web_search

Retrieve external knowledge

Factual questions, macro-search

Extension Point: Add custom tools by registering with ToolRegistry

from probellm.tools import ToolRegistry, LocalMCPTool

def my_tool(args: dict) -> dict:
    return {"result": "custom output"}

spec = {"name": "my_tool", "description": "...", ...}
registry = ToolRegistry()
registry.register(LocalMCPTool(spec, my_tool))

2. Search Engine (MCTS)

Purpose: Intelligently explore the space of test cases

Location: probellm/search.py

Key Classes:

  • VulnerabilityPipelineAsync: Main search orchestrator

  • RootNode: Represents a dataset root

  • SyntheticNode: Generated test case node

  • TestCaseGenerator: Uses tools to create new tests

  • AnswerGenerator: Generates ground truth for synthetic questions

Search Strategies:

MCTS Loop:
┌──────────────────────────────────────┐
│  1. Selection (UCB1)                 │
│     → Choose most promising node     │
├──────────────────────────────────────┤
│  2. Expansion (Tool Selection)       │
│     → Generate new test via tool     │
├──────────────────────────────────────┤
│  3. Simulation (Model Inference)     │
│     → Test model on new case         │
├──────────────────────────────────────┤
│  4. Backpropagation (Update Stats)   │
│     → Update visit counts & errors   │
└──────────────────────────────────────┘

Dual-Strategy Search:

  • Micro: perturbation tool → stay in trust region around failures

  • Macro: web_search + greedy-k-center sampling → explore distant semantic spaces

3. Data Loader

Purpose: Unified interface to benchmark datasets

Location: dataloader/

Key Components:

  • YAMLDatasetLoader: Reads datasets_config.yaml

  • DatasetInterface: Provides structured dataset access

  • HierarchicalSampler: Ensures balanced sampling across subsets

  • datasets_config.yaml: Declarative dataset configuration

Supported Datasets (out of the box):

  • MMLU (5 subjects)

  • SuperGLUE (5 tasks)

  • HellaSwag

  • TruthfulQA

  • MBPP (code generation)

Custom Dataset Support: See Custom Datasets Guide

4. Validation System

Purpose: Pre-flight checks before expensive searches

Location: probellm/validate.py, validate_config.py

Checks:

Check Type

Details

Dependencies

openai, datasets, numpy, yaml, etc.

Environment Variables

OPENAI_API_KEY, OPENROUTER_API_KEY

YAML Schema

Valid dataset configuration structure

Hard-coded Secrets

Warns if API keys found in code (heuristic)

Usage:

python validate_config.py --limit-datasets 5

Data Flow

Typical Search Flow:

┌──────────────┐
│ User Script  │
└──────┬───────┘
       │
┌──────▼─────────────────────────────────────────┐
│ VulnerabilityPipelineAsync                     │
│  • add_datasets_batch()                        │
│  • run() → _run_concurrent()                   │
└──────┬─────────────────────────────────────────┘
       │
┌──────▼──────────────────────────────────┐
│ DatasetInterface.load_dataset_structured│
│  → Returns (dataset_index, sample_store)│
└──────┬──────────────────────────────────┘
       │
┌──────▼────────────────────────────┐
│ HierarchicalSampler               │
│  → Balanced sampling plan         │
└──────┬────────────────────────────┘
       │
┌──────▼───────────────────────────────┐
│ init_tree_async()                    │
│  → Find initial failures → MCTS root │
└──────┬───────────────────────────────┘
       │
┌──────▼────────────────────────────────────┐
│ MCTS Loop (_mcts_search_async)            │
│  ┌─────────────────────────────────────┐  │
│  │ Select → Expand → Simulate → Back   │  │
│  │   ▲                        │         │  │
│  │   └────────────────────────┘         │  │
│  └─────────────────────────────────────┘  │
│                                            │
│  Expansion uses:                           │
│    • TestCaseGenerator (→ Tool Registry)  │
│    • AnswerGenerator (→ Tool Registry)    │
└──────┬─────────────────────────────────────┘
       │
┌──────▼────────────────────┐
│ Results JSON (per dataset)│
│  + Tree visualization     │
└───────────────────────────┘

Tool Invocation During Expansion:

TestCaseGenerator.generate_nearby(q, a)
       │
┌──────▼──────────────────────┐
│ _plan_tool_nearby()          │
│  → LLM selects tool          │
└──────┬──────────────────────┘
       │
┌──────▼──────────────────────────────┐
│ ToolRegistry.call_tool(name, args)  │
│  → Executes tool handler            │
└──────┬──────────────────────────────┘
       │
┌──────▼─────────────────────┐
│ Tool result → LLM synthesis│
│  → New question candidate  │
└────────────────────────────┘

Extensibility Points

  1. Custom Tools (Custom Tool Integration Guide)

    • Implement handler function

    • Define MCP-compatible spec

    • Register with ToolRegistry

  2. Custom Datasets (Custom Datasets Guide)

    • Add entry to datasets_config.yaml

    • Specify load_params, key_mapping, ground_truth mapping

  3. Custom Samplers (dataloader/sampler.py)

    • Inherit from BaseSampler

    • Implement _build_plan()

  4. Custom Search Strategies

    • Override _select() / _expand_async() in VulnerabilityPipelineAsync

    • Or: Use tools to inject custom expansion logic

Design Principles

  1. Modularity: Each component has a single responsibility

  2. Extensibility: Easy to add new tools, datasets, samplers

  3. Async-First: Concurrent execution for speed (asyncio)

  4. MCP-Compatible: Tools follow standardized protocol

  5. Library-First: Every feature accessible programmatically

  6. Fail-Safe: Validation and error handling at all layers

Next Steps