Dataset Creation Results

DatasetCreationResults is returned by DataDesigner.create(). It provides access to persisted creation artifacts, including the generated dataset, profiling analysis, processor outputs, task traces, dataset metadata, and Hugging Face Hub upload support.

Preview generation uses the in-memory data_designer.config.preview_results.PreviewResults object returned by DataDesigner.preview(). Persisted dataset creation uses DatasetCreationResults.

`DatasetCreationResults`

Bases: WithRecordSamplerMixin

Results container for a Data Designer dataset creation run.

This class provides access to the generated dataset, profiling analysis, and visualization utilities. It is returned by the DataDesigner.create() method and implements ResultsProtocol of the DataDesigner interface.

Creates a new instance with results based on a dataset creation run.

Parameters:

Name	Type	Description	Default
`artifact_storage`	`ArtifactStorage`	Storage manager for accessing generated artifacts.	required
`analysis`	`DatasetProfilerResults`	Profiling results for the generated dataset.	required
`config_builder`	`DataDesignerConfigBuilder`	Configuration builder used to create the dataset.	required
`dataset_metadata`	`DatasetMetadata`	Metadata about the generated dataset (e.g., seed column names).	required
`task_traces`	`list[TaskTrace] \| None`	Optional list of TaskTrace objects from the async scheduler.	`None`

Methods:

Name	Description
`count_records`	Return the total number of records in the generated dataset.
`export`	Export the generated dataset to a single file by streaming batch files.
`get_path_to_processor_artifacts`	Get the path to the artifacts generated by a processor.
`load_analysis`	Load the profiling analysis results for the generated dataset.
`load_dataset`	Load the generated dataset as a pandas DataFrame.
`load_processor_dataset`	Load the dataset generated by a processor.
`push_to_hub`	Push dataset to HuggingFace Hub.

Source code in packages/data-designer/src/data_designer/interface/results.py

def __init__(
    self,
    *,
    artifact_storage: ArtifactStorage,
    analysis: DatasetProfilerResults,
    config_builder: DataDesignerConfigBuilder,
    dataset_metadata: DatasetMetadata,
    task_traces: list[TaskTrace] | None = None,
):
    """Creates a new instance with results based on a dataset creation run.

    Args:
        artifact_storage: Storage manager for accessing generated artifacts.
        analysis: Profiling results for the generated dataset.
        config_builder: Configuration builder used to create the dataset.
        dataset_metadata: Metadata about the generated dataset (e.g., seed column names).
        task_traces: Optional list of TaskTrace objects from the async scheduler.
    """
    self.artifact_storage = artifact_storage
    self._analysis = analysis
    self._config_builder = config_builder
    self.dataset_metadata = dataset_metadata
    self.task_traces: list[TaskTrace] = task_traces or []

`count_records()`

Return the total number of records in the generated dataset.

Counts rows by reading Parquet file metadata only — no data pages are loaded, so memory usage is constant regardless of dataset size.

Returns:

Type	Description
`int`	Total row count across all batch parquet files.

Source code in packages/data-designer/src/data_designer/interface/results.py

def count_records(self) -> int:
    """Return the total number of records in the generated dataset.

    Counts rows by reading Parquet file metadata only — no data pages are
    loaded, so memory usage is constant regardless of dataset size.

    Returns:
        Total row count across all batch parquet files.
    """
    batch_files = sorted(self.artifact_storage.final_dataset_path.glob("batch_*.parquet"))
    return sum(lazy.pq.read_metadata(f).num_rows for f in batch_files)

`export(path, *, format=None)`

Export the generated dataset to a single file by streaming batch files.

The output format is inferred from the file extension when format is omitted. Pass format explicitly to override the extension (e.g. write a .txt file as JSONL).

Unlike :meth:load_dataset, this method never materialises the full dataset in memory — it reads batch parquet files one at a time and appends each to the output file, keeping peak memory proportional to a single batch.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Output file path. The exact path is used as-is; the extension is not rewritten.	required
`format`	`ExportFormat \| None`	Output format. One of `'jsonl'`, `'csv'`, or `'parquet'`. When omitted, the format is inferred from the file extension.	`None`

Returns:

Type	Description
`Path`	Path to the written file.

Raises:

Type	Description
`InvalidFileFormatError`	If the format cannot be determined or is not one of the supported values.
`ArtifactStorageError`	If no batch parquet files are found.

Example

results = data_designer.create(config, num_records=1000) results.export("output.jsonl") PosixPath('output.jsonl') results.export("output.csv") PosixPath('output.csv') results.export("output.txt", format="jsonl") PosixPath('output.txt')

Source code in packages/data-designer/src/data_designer/interface/results.py

def export(self, path: Path | str, *, format: ExportFormat | None = None) -> Path:
    """Export the generated dataset to a single file by streaming batch files.

    The output format is inferred from the file extension when *format* is
    omitted.  Pass *format* explicitly to override the extension (e.g. write a
    ``.txt`` file as JSONL).

    Unlike :meth:`load_dataset`, this method never materialises the full dataset
    in memory — it reads batch parquet files one at a time and appends each to
    the output file, keeping peak memory proportional to a single batch.

    Args:
        path: Output file path. The exact path is used as-is; the extension is
            not rewritten.
        format: Output format. One of ``'jsonl'``, ``'csv'``, or ``'parquet'``.
            When omitted, the format is inferred from the file extension.

    Returns:
        Path to the written file.

    Raises:
        InvalidFileFormatError: If the format cannot be determined or is not
            one of the supported values.
        ArtifactStorageError: If no batch parquet files are found.

    Example:
        >>> results = data_designer.create(config, num_records=1000)
        >>> results.export("output.jsonl")
        PosixPath('output.jsonl')
        >>> results.export("output.csv")
        PosixPath('output.csv')
        >>> results.export("output.txt", format="jsonl")
        PosixPath('output.txt')
    """
    path = Path(path)
    resolved_format: str = format if format is not None else path.suffix.lstrip(".").lower()
    if resolved_format not in SUPPORTED_EXPORT_FORMATS:
        raise InvalidFileFormatError(
            f"Unsupported export format: {resolved_format!r}. Choose one of: {', '.join(SUPPORTED_EXPORT_FORMATS)}."
        )
    batch_files = sorted(self.artifact_storage.final_dataset_path.glob("batch_*.parquet"))
    if not batch_files:
        raise ArtifactStorageError("No batch parquet files found to export.")
    if resolved_format == "jsonl":
        _export_jsonl(batch_files, path)
    elif resolved_format == "csv":
        _export_csv(batch_files, path)
    elif resolved_format == "parquet":
        _export_parquet(batch_files, path)
    return path

`get_path_to_processor_artifacts(processor_name)`

Get the path to the artifacts generated by a processor.

Parameters:

Name	Type	Description	Default
`processor_name`	`str`	The name of the processor to load the artifact from.	required

Returns:

Type	Description
`Path`	The path to the artifacts.

Source code in packages/data-designer/src/data_designer/interface/results.py

def get_path_to_processor_artifacts(self, processor_name: str) -> Path:
    """Get the path to the artifacts generated by a processor.

    Args:
        processor_name: The name of the processor to load the artifact from.

    Returns:
        The path to the artifacts.
    """
    if not self.artifact_storage.processors_outputs_path.exists():
        raise ArtifactStorageError(f"Processor {processor_name} has no artifacts.")
    return self.artifact_storage.processors_outputs_path / processor_name

`load_analysis()`

Load the profiling analysis results for the generated dataset.

Returns:

Type	Description
`DatasetProfilerResults`	DatasetProfilerResults containing statistical analysis and quality metrics for configured columns in the generated dataset.

Source code in packages/data-designer/src/data_designer/interface/results.py

def load_analysis(self) -> DatasetProfilerResults:
    """Load the profiling analysis results for the generated dataset.

    Returns:
        DatasetProfilerResults containing statistical analysis and quality metrics
            for configured columns in the generated dataset.
    """
    return self._analysis

`load_dataset()`

Load the generated dataset as a pandas DataFrame.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame containing the full generated dataset.

Source code in packages/data-designer/src/data_designer/interface/results.py

def load_dataset(self) -> pd.DataFrame:
    """Load the generated dataset as a pandas DataFrame.

    Returns:
        A pandas DataFrame containing the full generated dataset.
    """
    return self.artifact_storage.load_dataset()

`load_processor_dataset(processor_name)`

Load the dataset generated by a processor.

This only works for processors that write their artifacts in Parquet format.

Parameters:

Name	Type	Description	Default
`processor_name`	`str`	The name of the processor to load the dataset from.	required

Returns:

Type	Description
`DataFrame`	A pandas DataFrame containing the dataset generated by the processor.

Source code in packages/data-designer/src/data_designer/interface/results.py

def load_processor_dataset(self, processor_name: str) -> pd.DataFrame:
    """Load the dataset generated by a processor.

    This only works for processors that write their artifacts in Parquet format.

    Args:
        processor_name: The name of the processor to load the dataset from.

    Returns:
        A pandas DataFrame containing the dataset generated by the processor.
    """
    return self.artifact_storage.load_processor_dataset(processor_name)

`push_to_hub(repo_id, description, *, token=None, private=False, tags=None)`

Push dataset to HuggingFace Hub.

Uploads all artifacts including: - Main parquet batch files (data subset) - Processor output batch files ({processor_name} subsets) - Configuration (builder_config.json) - Metadata (metadata.json) - Auto-generated dataset card (README.md)

Parameters:

Name	Type	Description	Default
`repo_id`	`str`	HuggingFace repo ID (e.g., "username/my-dataset")	required
`description`	`str`	Custom description text for the dataset card. Appears after the title.	required
`token`	`str \| None`	HuggingFace API token. If None, the token is automatically resolved from HF_TOKEN environment variable or cached credentials from `hf auth login`.	`None`
`private`	`bool`	Create private repo	`False`
`tags`	`list[str] \| None`	Additional custom tags for the dataset.	`None`

Returns:

Type	Description
`str`	URL to the uploaded dataset

Example

results = data_designer.create(config, num_records=1000) description = "This dataset contains synthetic conversations for training chatbots." results.push_to_hub("username/my-synthetic-dataset", description, tags=["chatbot", "conversation"]) 'https://huggingface.co/datasets/username/my-synthetic-dataset'

Source code in packages/data-designer/src/data_designer/interface/results.py

def push_to_hub(
    self,
    repo_id: str,
    description: str,
    *,
    token: str | None = None,
    private: bool = False,
    tags: list[str] | None = None,
) -> str:
    """Push dataset to HuggingFace Hub.

    Uploads all artifacts including:
    - Main parquet batch files (data subset)
    - Processor output batch files ({processor_name} subsets)
    - Configuration (builder_config.json)
    - Metadata (metadata.json)
    - Auto-generated dataset card (README.md)

    Args:
        repo_id: HuggingFace repo ID (e.g., "username/my-dataset")
        description: Custom description text for the dataset card.
            Appears after the title.
        token: HuggingFace API token. If None, the token is automatically
            resolved from HF_TOKEN environment variable or cached credentials
            from `hf auth login`.
        private: Create private repo
        tags: Additional custom tags for the dataset.

    Returns:
        URL to the uploaded dataset

    Example:
        >>> results = data_designer.create(config, num_records=1000)
        >>> description = "This dataset contains synthetic conversations for training chatbots."
        >>> results.push_to_hub("username/my-synthetic-dataset", description, tags=["chatbot", "conversation"])
        'https://huggingface.co/datasets/username/my-synthetic-dataset'
    """
    client = HuggingFaceHubClient(token=token)
    return client.upload_dataset(
        repo_id=repo_id,
        base_dataset_path=self.artifact_storage.base_dataset_path,
        private=private,
        description=description,
        tags=tags,
    )