Dataset Creation Results
DatasetCreationResults is returned by DataDesigner.create(). It provides access to persisted creation artifacts, including the generated dataset, profiling analysis, processor outputs, task traces, dataset metadata, and Hugging Face Hub upload support.
Preview generation uses the in-memory data_designer.config.preview_results.PreviewResults object returned by DataDesigner.preview(). Persisted dataset creation uses DatasetCreationResults.
DatasetCreationResults
Bases: WithRecordSamplerMixin
Results container for a Data Designer dataset creation run.
This class provides access to the generated dataset, profiling analysis, and visualization utilities. It is returned by the DataDesigner.create() method and implements ResultsProtocol of the DataDesigner interface.
Creates a new instance with results based on a dataset creation run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
artifact_storage
|
ArtifactStorage
|
Storage manager for accessing generated artifacts. |
required |
analysis
|
DatasetProfilerResults
|
Profiling results for the generated dataset. |
required |
config_builder
|
DataDesignerConfigBuilder
|
Configuration builder used to create the dataset. |
required |
dataset_metadata
|
DatasetMetadata
|
Metadata about the generated dataset (e.g., seed column names). |
required |
task_traces
|
list[TaskTrace] | None
|
Optional list of TaskTrace objects from the async scheduler. |
None
|
Methods:
| Name | Description |
|---|---|
count_records |
Return the total number of records in the generated dataset. |
export |
Export the generated dataset to a single file by streaming batch files. |
get_path_to_processor_artifacts |
Get the path to the artifacts generated by a processor. |
load_analysis |
Load the profiling analysis results for the generated dataset. |
load_dataset |
Load the generated dataset as a pandas DataFrame. |
load_processor_dataset |
Load the dataset generated by a processor. |
push_to_hub |
Push dataset to HuggingFace Hub. |
Source code in packages/data-designer/src/data_designer/interface/results.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | |
count_records()
Return the total number of records in the generated dataset.
Counts rows by reading Parquet file metadata only — no data pages are loaded, so memory usage is constant regardless of dataset size.
Returns:
| Type | Description |
|---|---|
int
|
Total row count across all batch parquet files. |
Source code in packages/data-designer/src/data_designer/interface/results.py
77 78 79 80 81 82 83 84 85 86 87 | |
export(path, *, format=None)
Export the generated dataset to a single file by streaming batch files.
The output format is inferred from the file extension when format is
omitted. Pass format explicitly to override the extension (e.g. write a
.txt file as JSONL).
Unlike :meth:load_dataset, this method never materialises the full dataset
in memory — it reads batch parquet files one at a time and appends each to
the output file, keeping peak memory proportional to a single batch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Output file path. The exact path is used as-is; the extension is not rewritten. |
required |
format
|
ExportFormat | None
|
Output format. One of |
None
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the written file. |
Raises:
| Type | Description |
|---|---|
InvalidFileFormatError
|
If the format cannot be determined or is not one of the supported values. |
ArtifactStorageError
|
If no batch parquet files are found. |
Example
results = data_designer.create(config, num_records=1000) results.export("output.jsonl") PosixPath('output.jsonl') results.export("output.csv") PosixPath('output.csv') results.export("output.txt", format="jsonl") PosixPath('output.txt')
Source code in packages/data-designer/src/data_designer/interface/results.py
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
get_path_to_processor_artifacts(processor_name)
Get the path to the artifacts generated by a processor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor_name
|
str
|
The name of the processor to load the artifact from. |
required |
Returns:
| Type | Description |
|---|---|
Path
|
The path to the artifacts. |
Source code in packages/data-designer/src/data_designer/interface/results.py
102 103 104 105 106 107 108 109 110 111 112 113 | |
load_analysis()
Load the profiling analysis results for the generated dataset.
Returns:
| Type | Description |
|---|---|
DatasetProfilerResults
|
DatasetProfilerResults containing statistical analysis and quality metrics for configured columns in the generated dataset. |
Source code in packages/data-designer/src/data_designer/interface/results.py
60 61 62 63 64 65 66 67 | |
load_dataset()
Load the generated dataset as a pandas DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas DataFrame containing the full generated dataset. |
Source code in packages/data-designer/src/data_designer/interface/results.py
69 70 71 72 73 74 75 | |
load_processor_dataset(processor_name)
Load the dataset generated by a processor.
This only works for processors that write their artifacts in Parquet format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processor_name
|
str
|
The name of the processor to load the dataset from. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas DataFrame containing the dataset generated by the processor. |
Source code in packages/data-designer/src/data_designer/interface/results.py
89 90 91 92 93 94 95 96 97 98 99 100 | |
push_to_hub(repo_id, description, *, token=None, private=False, tags=None)
Push dataset to HuggingFace Hub.
Uploads all artifacts including: - Main parquet batch files (data subset) - Processor output batch files ({processor_name} subsets) - Configuration (builder_config.json) - Metadata (metadata.json) - Auto-generated dataset card (README.md)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
HuggingFace repo ID (e.g., "username/my-dataset") |
required |
description
|
str
|
Custom description text for the dataset card. Appears after the title. |
required |
token
|
str | None
|
HuggingFace API token. If None, the token is automatically
resolved from HF_TOKEN environment variable or cached credentials
from |
None
|
private
|
bool
|
Create private repo |
False
|
tags
|
list[str] | None
|
Additional custom tags for the dataset. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
URL to the uploaded dataset |
Example
results = data_designer.create(config, num_records=1000) description = "This dataset contains synthetic conversations for training chatbots." results.push_to_hub("username/my-synthetic-dataset", description, tags=["chatbot", "conversation"]) 'https://huggingface.co/datasets/username/my-synthetic-dataset'
Source code in packages/data-designer/src/data_designer/interface/results.py
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | |