Custom Datasets Guide ===================== This guide walks through adding custom datasets to ProbeLLM's data loading system using a YAML-based configuration. Example: Adding TriviaQA Dataset --------------------------------- We'll add the `TriviaQA `_ question-answering dataset to demonstrate the complete workflow. Step 1: Examine the Dataset Structure ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ First, inspect the dataset on HuggingFace to understand its schema: .. code-block:: python from datasets import load_dataset ds = load_dataset("trivia_qa", "rc") print(ds['train'][0]) # Output: # { # 'question': 'What is the capital of France?', # 'answer': {'aliases': ['Paris', 'paris'], 'value': 'Paris'}, # 'entity_pages': {...}, # ... # } **Key fields identified**: - ``question``: The query text - ``answer.value``: The ground truth answer Step 2: Write the YAML Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Add the dataset definition to ``dataloader/datasets_config.yaml``: .. code-block:: yaml datasets: - id: "trivia_qa" description: "TriviaQA - Reading Comprehension with Evidence" load_params: path: "trivia_qa" name: "rc" key_mapping: query: template: | Question: {question} Please provide a concise answer. fields: ["question"] ground_truth: "answer.value" **Configuration Components**: - ``id``: Unique identifier used in the pipeline - ``load_params``: Passed directly to HuggingFace's ``load_dataset()`` - ``key_mapping``: Transforms raw fields to standardized format Step 3: Validate the Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Run the configuration validator: .. code-block:: bash python validate_config.py --datasets-config dataloader/datasets_config.yaml Step 4: Test Data Loading ^^^^^^^^^^^^^^^^^^^^^^^^^^ Verify the dataset loads correctly: .. code-block:: python from dataloader import YAMLDatasetLoader, DatasetInterface loader = YAMLDatasetLoader() interface = DatasetInterface(loader) # Load 10 samples index, store = interface.load_dataset_structured("trivia_qa", num=10) # Inspect results print(f"Loaded {len(store)} samples") for sample_id in list(store.keys())[:3]: sample = store[sample_id] print(f"\nQuery: {sample['query'][:80]}...") print(f"Answer: {sample['ground_truth']}") Step 5: Use in the Pipeline ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Add the dataset to your vulnerability search: .. code-block:: python from probellm import VulnerabilityPipelineAsync pipeline = VulnerabilityPipelineAsync( model_name="gpt-5.2", test_model="gpt-4o-mini", judge_model="gpt-5.2", num_simulations=100, num_samples=50 ) # Add your custom dataset by ID pipeline.add_datasets_batch(['trivia_qa', 'mmlu']) # Run vulnerability search pipeline.run() Advanced Configuration Patterns -------------------------------- Multiple-Choice Datasets ^^^^^^^^^^^^^^^^^^^^^^^^^ For datasets with labeled options (e.g., HellaSwag): .. code-block:: yaml - id: "hellaswag" description: "HellaSwag - Commonsense Reasoning" load_params: path: "Rowan/hellaswag" key_mapping: query: template: | Context: {ctx} Which ending makes the most sense? A. {endings[0]} B. {endings[1]} C. {endings[2]} D. {endings[3]} Please select A, B, C, or D. fields: ["ctx", "endings"] ground_truth: field: "label" mapping: type: "index_to_letter" map: 0: "A" 1: "B" 2: "C" 3: "D" **Features**: - ``{endings[0]}``: Array indexing in templates - ``mapping``: Transform numeric labels to letters Multi-Subset Datasets ^^^^^^^^^^^^^^^^^^^^^^ For benchmarks with multiple subjects (e.g., MMLU): .. code-block:: yaml - id: "mmlu" description: "MMLU - Multiple subject areas" subsets: - name: "abstract_algebra" load_params: path: "cais/mmlu" name: "abstract_algebra" - name: "college_biology" load_params: path: "cais/mmlu" name: "college_biology" # Shared key_mapping for all subsets key_mapping: query: template: | Question: {question} Choices: A. {choices[0]} B. {choices[1]} C. {choices[2]} D. {choices[3]} fields: ["question", "choices"] ground_truth: field: "answer" mapping: type: "index_to_letter" map: {0: "A", 1: "B", 2: "C", 3: "D"} **Per-Subset Overrides**: Override ``key_mapping`` for specific subsets: .. code-block:: yaml subsets: - name: "boolq" load_params: path: "super_glue" name: "boolq" key_mapping: # Subset-specific override query: template: | Passage: {passage} Question: {question} Answer with True or False. fields: ["passage", "question"] ground_truth: field: "label" mapping: type: "number_to_bool" map: {0: "False", 1: "True"} Sampling Strategies ^^^^^^^^^^^^^^^^^^^ Choose from three built-in samplers: **HierarchicalSampler** (Default - Stratified sampling): .. code-block:: python from dataloader.sampler import HierarchicalSampler sampler = HierarchicalSampler( dataset_index=index, sample_store=store, total_samples=100, seed=42 ) **SequentialSampler** (Deterministic ordering): .. code-block:: python from dataloader.sampler import SequentialSampler sampler = SequentialSampler( dataset_index=index, sample_store=store, total_samples=50 ) **WeightedSampler** (Prioritize subsets): .. code-block:: python from dataloader.sampler import WeightedSampler sampler = WeightedSampler( dataset_index=index, sample_store=store, subset_weights={ "abstract_algebra": 2.0, # 2x sampling rate "college_biology": 1.0, }, total_samples=100, seed=42 ) Configuration Reference ----------------------- Query Mapping ^^^^^^^^^^^^^ **Simple field**: .. code-block:: yaml query: "question" **Template format** (recommended): .. code-block:: yaml query: template: "Context: {context}\nQuestion: {question}" fields: ["context", "question"] Ground Truth Mapping ^^^^^^^^^^^^^^^^^^^^^ **Simple field**: .. code-block:: yaml ground_truth: "answer" **With transformation**: .. code-block:: yaml ground_truth: field: "label" mapping: type: "index_to_letter" # or "number_to_bool", "number_to_label" map: {0: "A", 1: "B", 2: "C", 3: "D"} **Mapping types**: .. list-table:: :header-rows: 1 :widths: 30 70 * - Type - Usage * - ``index_to_letter`` - Convert 0→"A", 1→"B" for multiple choice * - ``number_to_bool`` - Convert 0→"False", 1→"True" * - ``number_to_label`` - Custom numeric to text mapping Complete Example ---------------- Here's a full configuration for a custom sentiment dataset: .. code-block:: yaml - id: "my_sentiment" description: "Custom sentiment analysis dataset" load_params: path: "my_org/my_dataset" # For local files: # path: "json" # data_files: "path/to/data.json" key_mapping: query: template: | Analyze the sentiment of the following text. Output: positive, negative, or neutral. Text: {review_text} fields: ["review_text"] ground_truth: field: "sentiment" mapping: type: "number_to_label" map: 0: "negative" 1: "neutral" 2: "positive" **Test and run**: .. code-block:: python # Test loading loader = YAMLDatasetLoader() interface = DatasetInterface(loader) index, store = interface.load_dataset_structured("my_sentiment", num=10) # Use in pipeline pipeline = VulnerabilityPipelineAsync( model_name="gpt-5.2", test_model="gpt-4o-mini", judge_model="gpt-5.2", num_simulations=50, num_samples=20 ) pipeline.add_datasets_batch(['my_sentiment']) pipeline.run() Troubleshooting --------------- Dataset Not Found ^^^^^^^^^^^^^^^^^ .. code-block:: text Dataset 'xyz' not found in configuration → Verify the ``id`` field matches exactly in ``datasets_config.yaml`` Template Field Missing ^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: text KeyError: 'field_name' → Check field names with ``print(dataset[0].keys())`` Empty Ground Truth ^^^^^^^^^^^^^^^^^^ .. code-block:: text [WARNING] Skipping sample with empty ground_truth → Some samples have missing labels (automatically skipped) Mapping Errors ^^^^^^^^^^^^^^ If label mapping fails: 1. Verify the ``field`` name is correct 2. Check mapping ``type`` matches your data 3. Ensure all values are in the ``map`` Next Steps ---------- 1. **Explore Examples**: Check ``datasets_config.yaml`` for more patterns 2. **Custom Samplers**: Implement weighted sampling for failure-prone areas 3. **Add Multiple Datasets**: Build a benchmark suite 4. **Integrate with Tools**: Combine datasets with custom tools for domain-specific testing See Also -------- - :doc:`/modules/search`: Dataset integration with MCTS - :doc:`custom_tools`: Custom tool integration - ``dataloader/dataset_loader.py``: DataLoader implementation