Skip to content

Dataset Creation Results

DatasetCreationResults is returned by DataDesigner.create(). It provides access to persisted creation artifacts, including the generated dataset, profiling analysis, processor outputs, task traces, dataset metadata, and Hugging Face Hub upload support.

Preview generation uses the in-memory data_designer.config.preview_results.PreviewResults object returned by DataDesigner.preview(). Persisted dataset creation uses DatasetCreationResults.

DatasetCreationResults

Bases: WithRecordSamplerMixin

Results container for a Data Designer dataset creation run.

This class provides access to the generated dataset, profiling analysis, and visualization utilities. It is returned by the DataDesigner.create() method and implements ResultsProtocol of the DataDesigner interface.

Creates a new instance with results based on a dataset creation run.

Parameters:

Name Type Description Default
artifact_storage ArtifactStorage

Storage manager for accessing generated artifacts.

required
analysis DatasetProfilerResults

Profiling results for the generated dataset.

required
config_builder DataDesignerConfigBuilder

Configuration builder used to create the dataset.

required
dataset_metadata DatasetMetadata

Metadata about the generated dataset (e.g., seed column names).

required
task_traces list[TaskTrace] | None

Optional list of TaskTrace objects from the async scheduler.

None

Methods:

Name Description
count_records

Return the total number of records in the generated dataset.

export

Export the generated dataset to a single file by streaming batch files.

get_path_to_processor_artifacts

Get the path to the artifacts generated by a processor.

load_analysis

Load the profiling analysis results for the generated dataset.

load_dataset

Load the generated dataset as a pandas DataFrame.

load_processor_dataset

Load the dataset generated by a processor.

push_to_hub

Push dataset to HuggingFace Hub.

Source code in packages/data-designer/src/data_designer/interface/results.py
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def __init__(
    self,
    *,
    artifact_storage: ArtifactStorage,
    analysis: DatasetProfilerResults,
    config_builder: DataDesignerConfigBuilder,
    dataset_metadata: DatasetMetadata,
    task_traces: list[TaskTrace] | None = None,
):
    """Creates a new instance with results based on a dataset creation run.

    Args:
        artifact_storage: Storage manager for accessing generated artifacts.
        analysis: Profiling results for the generated dataset.
        config_builder: Configuration builder used to create the dataset.
        dataset_metadata: Metadata about the generated dataset (e.g., seed column names).
        task_traces: Optional list of TaskTrace objects from the async scheduler.
    """
    self.artifact_storage = artifact_storage
    self._analysis = analysis
    self._config_builder = config_builder
    self.dataset_metadata = dataset_metadata
    self.task_traces: list[TaskTrace] = task_traces or []

count_records()

Return the total number of records in the generated dataset.

Counts rows by reading Parquet file metadata only — no data pages are loaded, so memory usage is constant regardless of dataset size.

Returns:

Type Description
int

Total row count across all batch parquet files.

Source code in packages/data-designer/src/data_designer/interface/results.py
77
78
79
80
81
82
83
84
85
86
87
def count_records(self) -> int:
    """Return the total number of records in the generated dataset.

    Counts rows by reading Parquet file metadata only — no data pages are
    loaded, so memory usage is constant regardless of dataset size.

    Returns:
        Total row count across all batch parquet files.
    """
    batch_files = sorted(self.artifact_storage.final_dataset_path.glob("batch_*.parquet"))
    return sum(lazy.pq.read_metadata(f).num_rows for f in batch_files)

export(path, *, format=None)

Export the generated dataset to a single file by streaming batch files.

The output format is inferred from the file extension when format is omitted. Pass format explicitly to override the extension (e.g. write a .txt file as JSONL).

Unlike :meth:load_dataset, this method never materialises the full dataset in memory — it reads batch parquet files one at a time and appends each to the output file, keeping peak memory proportional to a single batch.

Parameters:

Name Type Description Default
path Path | str

Output file path. The exact path is used as-is; the extension is not rewritten.

required
format ExportFormat | None

Output format. One of 'jsonl', 'csv', or 'parquet'. When omitted, the format is inferred from the file extension.

None

Returns:

Type Description
Path

Path to the written file.

Raises:

Type Description
InvalidFileFormatError

If the format cannot be determined or is not one of the supported values.

ArtifactStorageError

If no batch parquet files are found.

Example

results = data_designer.create(config, num_records=1000) results.export("output.jsonl") PosixPath('output.jsonl') results.export("output.csv") PosixPath('output.csv') results.export("output.txt", format="jsonl") PosixPath('output.txt')

Source code in packages/data-designer/src/data_designer/interface/results.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
def export(self, path: Path | str, *, format: ExportFormat | None = None) -> Path:
    """Export the generated dataset to a single file by streaming batch files.

    The output format is inferred from the file extension when *format* is
    omitted.  Pass *format* explicitly to override the extension (e.g. write a
    ``.txt`` file as JSONL).

    Unlike :meth:`load_dataset`, this method never materialises the full dataset
    in memory — it reads batch parquet files one at a time and appends each to
    the output file, keeping peak memory proportional to a single batch.

    Args:
        path: Output file path. The exact path is used as-is; the extension is
            not rewritten.
        format: Output format. One of ``'jsonl'``, ``'csv'``, or ``'parquet'``.
            When omitted, the format is inferred from the file extension.

    Returns:
        Path to the written file.

    Raises:
        InvalidFileFormatError: If the format cannot be determined or is not
            one of the supported values.
        ArtifactStorageError: If no batch parquet files are found.

    Example:
        >>> results = data_designer.create(config, num_records=1000)
        >>> results.export("output.jsonl")
        PosixPath('output.jsonl')
        >>> results.export("output.csv")
        PosixPath('output.csv')
        >>> results.export("output.txt", format="jsonl")
        PosixPath('output.txt')
    """
    path = Path(path)
    resolved_format: str = format if format is not None else path.suffix.lstrip(".").lower()
    if resolved_format not in SUPPORTED_EXPORT_FORMATS:
        raise InvalidFileFormatError(
            f"Unsupported export format: {resolved_format!r}. Choose one of: {', '.join(SUPPORTED_EXPORT_FORMATS)}."
        )
    batch_files = sorted(self.artifact_storage.final_dataset_path.glob("batch_*.parquet"))
    if not batch_files:
        raise ArtifactStorageError("No batch parquet files found to export.")
    if resolved_format == "jsonl":
        _export_jsonl(batch_files, path)
    elif resolved_format == "csv":
        _export_csv(batch_files, path)
    elif resolved_format == "parquet":
        _export_parquet(batch_files, path)
    return path

get_path_to_processor_artifacts(processor_name)

Get the path to the artifacts generated by a processor.

Parameters:

Name Type Description Default
processor_name str

The name of the processor to load the artifact from.

required

Returns:

Type Description
Path

The path to the artifacts.

Source code in packages/data-designer/src/data_designer/interface/results.py
102
103
104
105
106
107
108
109
110
111
112
113
def get_path_to_processor_artifacts(self, processor_name: str) -> Path:
    """Get the path to the artifacts generated by a processor.

    Args:
        processor_name: The name of the processor to load the artifact from.

    Returns:
        The path to the artifacts.
    """
    if not self.artifact_storage.processors_outputs_path.exists():
        raise ArtifactStorageError(f"Processor {processor_name} has no artifacts.")
    return self.artifact_storage.processors_outputs_path / processor_name

load_analysis()

Load the profiling analysis results for the generated dataset.

Returns:

Type Description
DatasetProfilerResults

DatasetProfilerResults containing statistical analysis and quality metrics for configured columns in the generated dataset.

Source code in packages/data-designer/src/data_designer/interface/results.py
60
61
62
63
64
65
66
67
def load_analysis(self) -> DatasetProfilerResults:
    """Load the profiling analysis results for the generated dataset.

    Returns:
        DatasetProfilerResults containing statistical analysis and quality metrics
            for configured columns in the generated dataset.
    """
    return self._analysis

load_dataset()

Load the generated dataset as a pandas DataFrame.

Returns:

Type Description
DataFrame

A pandas DataFrame containing the full generated dataset.

Source code in packages/data-designer/src/data_designer/interface/results.py
69
70
71
72
73
74
75
def load_dataset(self) -> pd.DataFrame:
    """Load the generated dataset as a pandas DataFrame.

    Returns:
        A pandas DataFrame containing the full generated dataset.
    """
    return self.artifact_storage.load_dataset()

load_processor_dataset(processor_name)

Load the dataset generated by a processor.

This only works for processors that write their artifacts in Parquet format.

Parameters:

Name Type Description Default
processor_name str

The name of the processor to load the dataset from.

required

Returns:

Type Description
DataFrame

A pandas DataFrame containing the dataset generated by the processor.

Source code in packages/data-designer/src/data_designer/interface/results.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def load_processor_dataset(self, processor_name: str) -> pd.DataFrame:
    """Load the dataset generated by a processor.

    This only works for processors that write their artifacts in Parquet format.

    Args:
        processor_name: The name of the processor to load the dataset from.

    Returns:
        A pandas DataFrame containing the dataset generated by the processor.
    """
    return self.artifact_storage.load_processor_dataset(processor_name)

push_to_hub(repo_id, description, *, token=None, private=False, tags=None)

Push dataset to HuggingFace Hub.

Uploads all artifacts including: - Main parquet batch files (data subset) - Processor output batch files ({processor_name} subsets) - Configuration (builder_config.json) - Metadata (metadata.json) - Auto-generated dataset card (README.md)

Parameters:

Name Type Description Default
repo_id str

HuggingFace repo ID (e.g., "username/my-dataset")

required
description str

Custom description text for the dataset card. Appears after the title.

required
token str | None

HuggingFace API token. If None, the token is automatically resolved from HF_TOKEN environment variable or cached credentials from hf auth login.

None
private bool

Create private repo

False
tags list[str] | None

Additional custom tags for the dataset.

None

Returns:

Type Description
str

URL to the uploaded dataset

Example

results = data_designer.create(config, num_records=1000) description = "This dataset contains synthetic conversations for training chatbots." results.push_to_hub("username/my-synthetic-dataset", description, tags=["chatbot", "conversation"]) 'https://huggingface.co/datasets/username/my-synthetic-dataset'

Source code in packages/data-designer/src/data_designer/interface/results.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def push_to_hub(
    self,
    repo_id: str,
    description: str,
    *,
    token: str | None = None,
    private: bool = False,
    tags: list[str] | None = None,
) -> str:
    """Push dataset to HuggingFace Hub.

    Uploads all artifacts including:
    - Main parquet batch files (data subset)
    - Processor output batch files ({processor_name} subsets)
    - Configuration (builder_config.json)
    - Metadata (metadata.json)
    - Auto-generated dataset card (README.md)

    Args:
        repo_id: HuggingFace repo ID (e.g., "username/my-dataset")
        description: Custom description text for the dataset card.
            Appears after the title.
        token: HuggingFace API token. If None, the token is automatically
            resolved from HF_TOKEN environment variable or cached credentials
            from `hf auth login`.
        private: Create private repo
        tags: Additional custom tags for the dataset.

    Returns:
        URL to the uploaded dataset

    Example:
        >>> results = data_designer.create(config, num_records=1000)
        >>> description = "This dataset contains synthetic conversations for training chatbots."
        >>> results.push_to_hub("username/my-synthetic-dataset", description, tags=["chatbot", "conversation"])
        'https://huggingface.co/datasets/username/my-synthetic-dataset'
    """
    client = HuggingFaceHubClient(token=token)
    return client.upload_dataset(
        repo_id=repo_id,
        base_dataset_path=self.artifact_storage.base_dataset_path,
        private=private,
        description=description,
        tags=tags,
    )