Custom Datasets Guide
This guide walks through adding custom datasets to ProbeLLM’s data loading system using a YAML-based configuration.
Example: Adding TriviaQA Dataset
We’ll add the TriviaQA question-answering dataset to demonstrate the complete workflow.
Step 1: Examine the Dataset Structure
First, inspect the dataset on HuggingFace to understand its schema:
from datasets import load_dataset
ds = load_dataset("trivia_qa", "rc")
print(ds['train'][0])
# Output:
# {
# 'question': 'What is the capital of France?',
# 'answer': {'aliases': ['Paris', 'paris'], 'value': 'Paris'},
# 'entity_pages': {...},
# ...
# }
Key fields identified:
question: The query textanswer.value: The ground truth answer
Step 2: Write the YAML Configuration
Add the dataset definition to dataloader/datasets_config.yaml:
datasets:
- id: "trivia_qa"
description: "TriviaQA - Reading Comprehension with Evidence"
load_params:
path: "trivia_qa"
name: "rc"
key_mapping:
query:
template: |
Question: {question}
Please provide a concise answer.
fields: ["question"]
ground_truth: "answer.value"
Configuration Components:
id: Unique identifier used in the pipelineload_params: Passed directly to HuggingFace’sload_dataset()key_mapping: Transforms raw fields to standardized format
Step 3: Validate the Configuration
Run the configuration validator:
python validate_config.py --datasets-config dataloader/datasets_config.yaml
Step 4: Test Data Loading
Verify the dataset loads correctly:
from dataloader import YAMLDatasetLoader, DatasetInterface
loader = YAMLDatasetLoader()
interface = DatasetInterface(loader)
# Load 10 samples
index, store = interface.load_dataset_structured("trivia_qa", num=10)
# Inspect results
print(f"Loaded {len(store)} samples")
for sample_id in list(store.keys())[:3]:
sample = store[sample_id]
print(f"\nQuery: {sample['query'][:80]}...")
print(f"Answer: {sample['ground_truth']}")
Step 5: Use in the Pipeline
Add the dataset to your vulnerability search:
from probellm import VulnerabilityPipelineAsync
pipeline = VulnerabilityPipelineAsync(
model_name="gpt-5.2",
test_model="gpt-4o-mini",
judge_model="gpt-5.2",
num_simulations=100,
num_samples=50
)
# Add your custom dataset by ID
pipeline.add_datasets_batch(['trivia_qa', 'mmlu'])
# Run vulnerability search
pipeline.run()
Advanced Configuration Patterns
Multiple-Choice Datasets
For datasets with labeled options (e.g., HellaSwag):
- id: "hellaswag"
description: "HellaSwag - Commonsense Reasoning"
load_params:
path: "Rowan/hellaswag"
key_mapping:
query:
template: |
Context: {ctx}
Which ending makes the most sense?
A. {endings[0]}
B. {endings[1]}
C. {endings[2]}
D. {endings[3]}
Please select A, B, C, or D.
fields: ["ctx", "endings"]
ground_truth:
field: "label"
mapping:
type: "index_to_letter"
map:
0: "A"
1: "B"
2: "C"
3: "D"
Features:
{endings[0]}: Array indexing in templatesmapping: Transform numeric labels to letters
Multi-Subset Datasets
For benchmarks with multiple subjects (e.g., MMLU):
- id: "mmlu"
description: "MMLU - Multiple subject areas"
subsets:
- name: "abstract_algebra"
load_params:
path: "cais/mmlu"
name: "abstract_algebra"
- name: "college_biology"
load_params:
path: "cais/mmlu"
name: "college_biology"
# Shared key_mapping for all subsets
key_mapping:
query:
template: |
Question: {question}
Choices:
A. {choices[0]}
B. {choices[1]}
C. {choices[2]}
D. {choices[3]}
fields: ["question", "choices"]
ground_truth:
field: "answer"
mapping:
type: "index_to_letter"
map: {0: "A", 1: "B", 2: "C", 3: "D"}
Per-Subset Overrides:
Override key_mapping for specific subsets:
subsets:
- name: "boolq"
load_params:
path: "super_glue"
name: "boolq"
key_mapping: # Subset-specific override
query:
template: |
Passage: {passage}
Question: {question}
Answer with True or False.
fields: ["passage", "question"]
ground_truth:
field: "label"
mapping:
type: "number_to_bool"
map: {0: "False", 1: "True"}
Sampling Strategies
Choose from three built-in samplers:
HierarchicalSampler (Default - Stratified sampling):
from dataloader.sampler import HierarchicalSampler
sampler = HierarchicalSampler(
dataset_index=index,
sample_store=store,
total_samples=100,
seed=42
)
SequentialSampler (Deterministic ordering):
from dataloader.sampler import SequentialSampler
sampler = SequentialSampler(
dataset_index=index,
sample_store=store,
total_samples=50
)
WeightedSampler (Prioritize subsets):
from dataloader.sampler import WeightedSampler
sampler = WeightedSampler(
dataset_index=index,
sample_store=store,
subset_weights={
"abstract_algebra": 2.0, # 2x sampling rate
"college_biology": 1.0,
},
total_samples=100,
seed=42
)
Configuration Reference
Query Mapping
Simple field:
query: "question"
Template format (recommended):
query:
template: "Context: {context}\nQuestion: {question}"
fields: ["context", "question"]
Ground Truth Mapping
Simple field:
ground_truth: "answer"
With transformation:
ground_truth:
field: "label"
mapping:
type: "index_to_letter" # or "number_to_bool", "number_to_label"
map: {0: "A", 1: "B", 2: "C", 3: "D"}
Mapping types:
Type |
Usage |
|---|---|
|
Convert 0→”A”, 1→”B” for multiple choice |
|
Convert 0→”False”, 1→”True” |
|
Custom numeric to text mapping |
Complete Example
Here’s a full configuration for a custom sentiment dataset:
- id: "my_sentiment"
description: "Custom sentiment analysis dataset"
load_params:
path: "my_org/my_dataset"
# For local files:
# path: "json"
# data_files: "path/to/data.json"
key_mapping:
query:
template: |
Analyze the sentiment of the following text.
Output: positive, negative, or neutral.
Text: {review_text}
fields: ["review_text"]
ground_truth:
field: "sentiment"
mapping:
type: "number_to_label"
map:
0: "negative"
1: "neutral"
2: "positive"
Test and run:
# Test loading
loader = YAMLDatasetLoader()
interface = DatasetInterface(loader)
index, store = interface.load_dataset_structured("my_sentiment", num=10)
# Use in pipeline
pipeline = VulnerabilityPipelineAsync(
model_name="gpt-5.2",
test_model="gpt-4o-mini",
judge_model="gpt-5.2",
num_simulations=50,
num_samples=20
)
pipeline.add_datasets_batch(['my_sentiment'])
pipeline.run()
Troubleshooting
Dataset Not Found
Dataset 'xyz' not found in configuration
→ Verify the id field matches exactly in datasets_config.yaml
Template Field Missing
KeyError: 'field_name'
→ Check field names with print(dataset[0].keys())
Empty Ground Truth
[WARNING] Skipping sample with empty ground_truth
→ Some samples have missing labels (automatically skipped)
Mapping Errors
If label mapping fails:
Verify the
fieldname is correctCheck mapping
typematches your dataEnsure all values are in the
map
Next Steps
Explore Examples: Check
datasets_config.yamlfor more patternsCustom Samplers: Implement weighted sampling for failure-prone areas
Add Multiple Datasets: Build a benchmark suite
Integrate with Tools: Combine datasets with custom tools for domain-specific testing
See Also
MCTS Search Engine: Dataset integration with MCTS
Custom Tool Integration Guide: Custom tool integration
dataloader/dataset_loader.py: DataLoader implementation