Custom Datasets Guide

This guide walks through adding custom datasets to ProbeLLM’s data loading system using a YAML-based configuration.

Example: Adding TriviaQA Dataset

We’ll add the TriviaQA question-answering dataset to demonstrate the complete workflow.

Step 1: Examine the Dataset Structure

First, inspect the dataset on HuggingFace to understand its schema:

from datasets import load_dataset

ds = load_dataset("trivia_qa", "rc")
print(ds['train'][0])
# Output:
# {
#   'question': 'What is the capital of France?',
#   'answer': {'aliases': ['Paris', 'paris'], 'value': 'Paris'},
#   'entity_pages': {...},
#   ...
# }

Key fields identified:

question: The query text
answer.value: The ground truth answer

Step 2: Write the YAML Configuration

Add the dataset definition to dataloader/datasets_config.yaml:

datasets:
  - id: "trivia_qa"
    description: "TriviaQA - Reading Comprehension with Evidence"
    load_params:
      path: "trivia_qa"
      name: "rc"
    key_mapping:
      query:
        template: |
          Question: {question}

          Please provide a concise answer.
        fields: ["question"]
      ground_truth: "answer.value"

Configuration Components:

id: Unique identifier used in the pipeline
load_params: Passed directly to HuggingFace’s load_dataset()
key_mapping: Transforms raw fields to standardized format

Step 3: Validate the Configuration

Run the configuration validator:

python validate_config.py --datasets-config dataloader/datasets_config.yaml

Step 4: Test Data Loading

Verify the dataset loads correctly:

from dataloader import YAMLDatasetLoader, DatasetInterface

loader = YAMLDatasetLoader()
interface = DatasetInterface(loader)

# Load 10 samples
index, store = interface.load_dataset_structured("trivia_qa", num=10)

# Inspect results
print(f"Loaded {len(store)} samples")
for sample_id in list(store.keys())[:3]:
    sample = store[sample_id]
    print(f"\nQuery: {sample['query'][:80]}...")
    print(f"Answer: {sample['ground_truth']}")

Step 5: Use in the Pipeline

Add the dataset to your vulnerability search:

from probellm import VulnerabilityPipelineAsync

pipeline = VulnerabilityPipelineAsync(
    model_name="gpt-5.2",
    test_model="gpt-4o-mini",
    judge_model="gpt-5.2",
    num_simulations=100,
    num_samples=50
)

# Add your custom dataset by ID
pipeline.add_datasets_batch(['trivia_qa', 'mmlu'])

# Run vulnerability search
pipeline.run()

Advanced Configuration Patterns

Multiple-Choice Datasets

For datasets with labeled options (e.g., HellaSwag):

- id: "hellaswag"
  description: "HellaSwag - Commonsense Reasoning"
  load_params:
    path: "Rowan/hellaswag"
  key_mapping:
    query:
      template: |
        Context: {ctx}

        Which ending makes the most sense?
        A. {endings[0]}
        B. {endings[1]}
        C. {endings[2]}
        D. {endings[3]}

        Please select A, B, C, or D.
      fields: ["ctx", "endings"]
    ground_truth:
      field: "label"
      mapping:
        type: "index_to_letter"
        map:
          0: "A"
          1: "B"
          2: "C"
          3: "D"

Features:

{endings[0]}: Array indexing in templates
mapping: Transform numeric labels to letters

Multi-Subset Datasets

For benchmarks with multiple subjects (e.g., MMLU):

- id: "mmlu"
  description: "MMLU - Multiple subject areas"
  subsets:
    - name: "abstract_algebra"
      load_params:
        path: "cais/mmlu"
        name: "abstract_algebra"
    - name: "college_biology"
      load_params:
        path: "cais/mmlu"
        name: "college_biology"
  # Shared key_mapping for all subsets
  key_mapping:
    query:
      template: |
        Question: {question}

        Choices:
        A. {choices[0]}
        B. {choices[1]}
        C. {choices[2]}
        D. {choices[3]}
      fields: ["question", "choices"]
    ground_truth:
      field: "answer"
      mapping:
        type: "index_to_letter"
        map: {0: "A", 1: "B", 2: "C", 3: "D"}

Per-Subset Overrides:

Override key_mapping for specific subsets:

subsets:
  - name: "boolq"
    load_params:
      path: "super_glue"
      name: "boolq"
    key_mapping:  # Subset-specific override
      query:
        template: |
          Passage: {passage}
          Question: {question}
          Answer with True or False.
        fields: ["passage", "question"]
      ground_truth:
        field: "label"
        mapping:
          type: "number_to_bool"
          map: {0: "False", 1: "True"}

Sampling Strategies

Choose from three built-in samplers:

HierarchicalSampler (Default - Stratified sampling):

from dataloader.sampler import HierarchicalSampler

sampler = HierarchicalSampler(
    dataset_index=index,
    sample_store=store,
    total_samples=100,
    seed=42
)

SequentialSampler (Deterministic ordering):

from dataloader.sampler import SequentialSampler

sampler = SequentialSampler(
    dataset_index=index,
    sample_store=store,
    total_samples=50
)

WeightedSampler (Prioritize subsets):

from dataloader.sampler import WeightedSampler

sampler = WeightedSampler(
    dataset_index=index,
    sample_store=store,
    subset_weights={
        "abstract_algebra": 2.0,    # 2x sampling rate
        "college_biology": 1.0,
    },
    total_samples=100,
    seed=42
)

Configuration Reference

Query Mapping

Simple field:

query: "question"

Template format (recommended):

query:
  template: "Context: {context}\nQuestion: {question}"
  fields: ["context", "question"]

Ground Truth Mapping

Simple field:

ground_truth: "answer"

With transformation:

ground_truth:
  field: "label"
  mapping:
    type: "index_to_letter"  # or "number_to_bool", "number_to_label"
    map: {0: "A", 1: "B", 2: "C", 3: "D"}

Mapping types:

Type	Usage
`index_to_letter`	Convert 0→”A”, 1→”B” for multiple choice
`number_to_bool`	Convert 0→”False”, 1→”True”
`number_to_label`	Custom numeric to text mapping

Complete Example

Here’s a full configuration for a custom sentiment dataset:

- id: "my_sentiment"
  description: "Custom sentiment analysis dataset"
  load_params:
    path: "my_org/my_dataset"
    # For local files:
    # path: "json"
    # data_files: "path/to/data.json"
  key_mapping:
    query:
      template: |
        Analyze the sentiment of the following text.
        Output: positive, negative, or neutral.

        Text: {review_text}
      fields: ["review_text"]
    ground_truth:
      field: "sentiment"
      mapping:
        type: "number_to_label"
        map:
          0: "negative"
          1: "neutral"
          2: "positive"

Test and run:

# Test loading
loader = YAMLDatasetLoader()
interface = DatasetInterface(loader)
index, store = interface.load_dataset_structured("my_sentiment", num=10)

# Use in pipeline
pipeline = VulnerabilityPipelineAsync(
    model_name="gpt-5.2",
    test_model="gpt-4o-mini",
    judge_model="gpt-5.2",
    num_simulations=50,
    num_samples=20
)
pipeline.add_datasets_batch(['my_sentiment'])
pipeline.run()

Troubleshooting

Dataset Not Found

Dataset 'xyz' not found in configuration

→ Verify the id field matches exactly in datasets_config.yaml

Template Field Missing

KeyError: 'field_name'

→ Check field names with print(dataset[0].keys())

Empty Ground Truth

[WARNING] Skipping sample with empty ground_truth

→ Some samples have missing labels (automatically skipped)

Mapping Errors

If label mapping fails:

Verify the field name is correct
Check mapping type matches your data
Ensure all values are in the map

Next Steps

Explore Examples: Check datasets_config.yaml for more patterns
Custom Samplers: Implement weighted sampling for failure-prone areas
Add Multiple Datasets: Build a benchmark suite
Integrate with Tools: Combine datasets with custom tools for domain-specific testing