Custom Datasets Guide
=====================

This guide walks through adding custom datasets to ProbeLLM's data loading system using a YAML-based configuration.

Example: Adding TriviaQA Dataset
---------------------------------

We'll add the `TriviaQA <https://huggingface.co/datasets/trivia_qa>`_ question-answering dataset to demonstrate the complete workflow.

Step 1: Examine the Dataset Structure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

First, inspect the dataset on HuggingFace to understand its schema:

.. code-block:: python

   from datasets import load_dataset
   
   ds = load_dataset("trivia_qa", "rc")
   print(ds['train'][0])
   # Output:
   # {
   #   'question': 'What is the capital of France?',
   #   'answer': {'aliases': ['Paris', 'paris'], 'value': 'Paris'},
   #   'entity_pages': {...},
   #   ...
   # }

**Key fields identified**:

- ``question``: The query text
- ``answer.value``: The ground truth answer

Step 2: Write the YAML Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Add the dataset definition to ``dataloader/datasets_config.yaml``:

.. code-block:: yaml

   datasets:
     - id: "trivia_qa"
       description: "TriviaQA - Reading Comprehension with Evidence"
       load_params:
         path: "trivia_qa"
         name: "rc"
       key_mapping:
         query:
           template: |
             Question: {question}
             
             Please provide a concise answer.
           fields: ["question"]
         ground_truth: "answer.value"

**Configuration Components**:

- ``id``: Unique identifier used in the pipeline
- ``load_params``: Passed directly to HuggingFace's ``load_dataset()``
- ``key_mapping``: Transforms raw fields to standardized format

Step 3: Validate the Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Run the configuration validator:

.. code-block:: bash

   python validate_config.py --datasets-config dataloader/datasets_config.yaml

Step 4: Test Data Loading
^^^^^^^^^^^^^^^^^^^^^^^^^^

Verify the dataset loads correctly:

.. code-block:: python

   from dataloader import YAMLDatasetLoader, DatasetInterface
   
   loader = YAMLDatasetLoader()
   interface = DatasetInterface(loader)
   
   # Load 10 samples
   index, store = interface.load_dataset_structured("trivia_qa", num=10)
   
   # Inspect results
   print(f"Loaded {len(store)} samples")
   for sample_id in list(store.keys())[:3]:
       sample = store[sample_id]
       print(f"\nQuery: {sample['query'][:80]}...")
       print(f"Answer: {sample['ground_truth']}")

Step 5: Use in the Pipeline
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Add the dataset to your vulnerability search:

.. code-block:: python

   from probellm import VulnerabilityPipelineAsync
   
   pipeline = VulnerabilityPipelineAsync(
       model_name="gpt-5.2",
       test_model="gpt-4o-mini",
       judge_model="gpt-5.2",
       num_simulations=100,
       num_samples=50
   )
   
   # Add your custom dataset by ID
   pipeline.add_datasets_batch(['trivia_qa', 'mmlu'])
   
   # Run vulnerability search
   pipeline.run()

Advanced Configuration Patterns
--------------------------------

Multiple-Choice Datasets
^^^^^^^^^^^^^^^^^^^^^^^^^

For datasets with labeled options (e.g., HellaSwag):

.. code-block:: yaml

   - id: "hellaswag"
     description: "HellaSwag - Commonsense Reasoning"
     load_params:
       path: "Rowan/hellaswag"
     key_mapping:
       query:
         template: |
           Context: {ctx}
           
           Which ending makes the most sense?
           A. {endings[0]}
           B. {endings[1]}
           C. {endings[2]}
           D. {endings[3]}
           
           Please select A, B, C, or D.
         fields: ["ctx", "endings"]
       ground_truth: 
         field: "label"
         mapping:
           type: "index_to_letter"
           map:
             0: "A"
             1: "B"
             2: "C"
             3: "D"

**Features**:

- ``{endings[0]}``: Array indexing in templates
- ``mapping``: Transform numeric labels to letters

Multi-Subset Datasets
^^^^^^^^^^^^^^^^^^^^^^

For benchmarks with multiple subjects (e.g., MMLU):

.. code-block:: yaml

   - id: "mmlu"
     description: "MMLU - Multiple subject areas"
     subsets:
       - name: "abstract_algebra"
         load_params:
           path: "cais/mmlu"
           name: "abstract_algebra"
       - name: "college_biology"
         load_params:
           path: "cais/mmlu"
           name: "college_biology"
     # Shared key_mapping for all subsets
     key_mapping:
       query:
         template: |
           Question: {question}
           
           Choices:
           A. {choices[0]}
           B. {choices[1]}
           C. {choices[2]}
           D. {choices[3]}
         fields: ["question", "choices"]
       ground_truth:
         field: "answer"
         mapping:
           type: "index_to_letter"
           map: {0: "A", 1: "B", 2: "C", 3: "D"}

**Per-Subset Overrides**:

Override ``key_mapping`` for specific subsets:

.. code-block:: yaml

   subsets:
     - name: "boolq"
       load_params:
         path: "super_glue"
         name: "boolq"
       key_mapping:  # Subset-specific override
         query:
           template: |
             Passage: {passage}
             Question: {question}
             Answer with True or False.
           fields: ["passage", "question"]
         ground_truth: 
           field: "label"
           mapping:
             type: "number_to_bool"
             map: {0: "False", 1: "True"}

Sampling Strategies
^^^^^^^^^^^^^^^^^^^

Choose from three built-in samplers:

**HierarchicalSampler** (Default - Stratified sampling):

.. code-block:: python

   from dataloader.sampler import HierarchicalSampler
   
   sampler = HierarchicalSampler(
       dataset_index=index,
       sample_store=store,
       total_samples=100,
       seed=42
   )

**SequentialSampler** (Deterministic ordering):

.. code-block:: python

   from dataloader.sampler import SequentialSampler
   
   sampler = SequentialSampler(
       dataset_index=index,
       sample_store=store,
       total_samples=50
   )

**WeightedSampler** (Prioritize subsets):

.. code-block:: python

   from dataloader.sampler import WeightedSampler
   
   sampler = WeightedSampler(
       dataset_index=index,
       sample_store=store,
       subset_weights={
           "abstract_algebra": 2.0,    # 2x sampling rate
           "college_biology": 1.0,
       },
       total_samples=100,
       seed=42
   )

Configuration Reference
-----------------------

Query Mapping
^^^^^^^^^^^^^

**Simple field**:

.. code-block:: yaml

   query: "question"

**Template format** (recommended):

.. code-block:: yaml

   query:
     template: "Context: {context}\nQuestion: {question}"
     fields: ["context", "question"]

Ground Truth Mapping
^^^^^^^^^^^^^^^^^^^^^

**Simple field**:

.. code-block:: yaml

   ground_truth: "answer"

**With transformation**:

.. code-block:: yaml

   ground_truth:
     field: "label"
     mapping:
       type: "index_to_letter"  # or "number_to_bool", "number_to_label"
       map: {0: "A", 1: "B", 2: "C", 3: "D"}

**Mapping types**:

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Type
     - Usage
   * - ``index_to_letter``
     - Convert 0→"A", 1→"B" for multiple choice
   * - ``number_to_bool``
     - Convert 0→"False", 1→"True"
   * - ``number_to_label``
     - Custom numeric to text mapping

Complete Example
----------------

Here's a full configuration for a custom sentiment dataset:

.. code-block:: yaml

   - id: "my_sentiment"
     description: "Custom sentiment analysis dataset"
     load_params:
       path: "my_org/my_dataset"
       # For local files:
       # path: "json"
       # data_files: "path/to/data.json"
     key_mapping:
       query:
         template: |
           Analyze the sentiment of the following text.
           Output: positive, negative, or neutral.
           
           Text: {review_text}
         fields: ["review_text"]
       ground_truth: 
         field: "sentiment"
         mapping:
           type: "number_to_label"
           map:
             0: "negative"
             1: "neutral"
             2: "positive"

**Test and run**:

.. code-block:: python

   # Test loading
   loader = YAMLDatasetLoader()
   interface = DatasetInterface(loader)
   index, store = interface.load_dataset_structured("my_sentiment", num=10)
   
   # Use in pipeline
   pipeline = VulnerabilityPipelineAsync(
       model_name="gpt-5.2",
       test_model="gpt-4o-mini",
       judge_model="gpt-5.2",
       num_simulations=50,
       num_samples=20
   )
   pipeline.add_datasets_batch(['my_sentiment'])
   pipeline.run()

Troubleshooting
---------------

Dataset Not Found
^^^^^^^^^^^^^^^^^

.. code-block:: text

   Dataset 'xyz' not found in configuration

→ Verify the ``id`` field matches exactly in ``datasets_config.yaml``

Template Field Missing
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: text

   KeyError: 'field_name'

→ Check field names with ``print(dataset[0].keys())``

Empty Ground Truth
^^^^^^^^^^^^^^^^^^

.. code-block:: text

   [WARNING] Skipping sample with empty ground_truth

→ Some samples have missing labels (automatically skipped)

Mapping Errors
^^^^^^^^^^^^^^

If label mapping fails:

1. Verify the ``field`` name is correct
2. Check mapping ``type`` matches your data
3. Ensure all values are in the ``map``

Next Steps
----------

1. **Explore Examples**: Check ``datasets_config.yaml`` for more patterns
2. **Custom Samplers**: Implement weighted sampling for failure-prone areas
3. **Add Multiple Datasets**: Build a benchmark suite
4. **Integrate with Tools**: Combine datasets with custom tools for domain-specific testing

See Also
--------

- :doc:`/modules/search`: Dataset integration with MCTS
- :doc:`custom_tools`: Custom tool integration
- ``dataloader/dataset_loader.py``: DataLoader implementation