Seed Readers

Seed readers are engine-side adapters that turn a configured seed source into tabular seed rows. The engine attaches a SeedSource and secret resolver, asks the reader for column names and dataset size, then streams batches into generation.

Related pages: seeds, Seed Datasets, and Build Your Own.

Core Contracts

`SeedReader`

Bases: ABC, Generic[SourceT]

Base class for reading a seed dataset.

Seeds are read using duckdb. Reader implementations define duckdb connection setup details and how to get a URI that can be queried with duckdb (i.e. "... FROM ...").

The Data Designer engine automatically supplies the appropriate SeedSource and a SecretResolver to use for any secret fields in the config via attach(...). Subclasses that need per-attachment setup can override on_attach(...) without needing to call super().

Methods:

Name	Description
`attach`	Attach a source and secret resolver to the instance.
`create_filesystem_context`	Create a rooted filesystem context for directory-backed seed readers.
`get_column_names`	Returns the seed dataset's column names
`get_seed_type`	Return the seed_type of the source class this reader is generic over.
`on_attach`	Hook for subclasses that need per-attachment setup.

`attach(source, secret_resolver)`

Attach a source and secret resolver to the instance.

This is called internally by the engine so that these objects do not need to be provided in the reader's constructor.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py

def attach(self, source: SourceT, secret_resolver: SecretResolver) -> None:
    """Attach a source and secret resolver to the instance.

    This is called internally by the engine so that these objects do not
    need to be provided in the reader's constructor.
    """
    self._reset_attachment_state()
    self.source = source
    self.secret_resolver = secret_resolver
    self.on_attach()

`create_filesystem_context(root_path)`

Create a rooted filesystem context for directory-backed seed readers.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py

def create_filesystem_context(self, root_path: Path | str) -> SeedReaderFileSystemContext:
    """Create a rooted filesystem context for directory-backed seed readers."""
    resolved_root_path = Path(root_path).expanduser().resolve()
    rooted_fs = DirFileSystem(path=str(resolved_root_path), fs=LocalFileSystem())
    return SeedReaderFileSystemContext(fs=rooted_fs, root_path=resolved_root_path)

`get_column_names()`

Returns the seed dataset's column names

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py

def get_column_names(self) -> list[str]:
    """Returns the seed dataset's column names"""
    self._ensure_attached()
    conn = self._get_duckdb_connection()
    describe_query = f"DESCRIBE SELECT * FROM '{self.get_dataset_uri()}'"
    column_descriptions = conn.execute(describe_query).fetchall()
    return [col[0] for col in column_descriptions]

`get_seed_type()`

Return the seed_type of the source class this reader is generic over.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py

def get_seed_type(self) -> str:
    """Return the seed_type of the source class this reader is generic over."""
    # Get the generic type arguments from the reader class
    # Check __orig_bases__ for the generic base class
    for base in getattr(type(self), "__orig_bases__", []):
        origin = get_origin(base)
        if isinstance(origin, type) and issubclass(origin, SeedReader):
            args = get_args(base)
            if args:
                source_cls = get_origin(args[0]) or args[0]
                # Extract seed_type from the source class
                if hasattr(source_cls, "model_fields") and "seed_type" in source_cls.model_fields:
                    field = source_cls.model_fields["seed_type"]
                    default_value = field.default
                    if isinstance(default_value, str):
                        return default_value

    raise SeedReaderError("Reader does not have a valid generic source type with seed_type")

`on_attach()`

Hook for subclasses that need per-attachment setup.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py

def on_attach(self) -> None:
    """Hook for subclasses that need per-attachment setup."""

`FileSystemSeedReader`

Bases: SeedReader[FileSystemSourceT], ABC

Base class for filesystem-derived seed readers.

Plugin authors implement build_manifest(...) to describe the cheap logical rows available under the configured filesystem root. Readers that need expensive enrichment can optionally override hydrate_row(...) to emit one record dict or an iterable of record dicts per manifest row. When emitted records change the manifest schema, output_columns must declare the exact hydrated output schema for each emitted record. The framework owns attachment-scoped filesystem context reuse, manifest sampling, partitioning, randomization, batching, and DuckDB registration details.

`SeedReaderFileSystemContext`

Filesystem and root path available to filesystem seed-reader plugins.

`SeedReaderBatch`

Bases: Protocol

Batch object returned by seed readers and convertible to a DataFrame.

`SeedReaderBatchReader`

Bases: Protocol

Reader that yields seed batches until exhausted.

`PandasSeedReaderBatch`

Seed-reader batch backed by an in-memory pandas DataFrame.

Methods:

Name	Description
`to_pandas`	Return the batch as a pandas DataFrame.

`to_pandas()`

Return the batch as a pandas DataFrame.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py

def to_pandas(self) -> pd.DataFrame:
    """Return the batch as a pandas DataFrame."""
    return self.dataframe

`create_seed_reader_output_dataframe`

Create a DataFrame and verify hydrated records match the declared output schema.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py

def create_seed_reader_output_dataframe(
    *,
    records: list[dict[str, Any]],
    output_columns: list[str],
) -> pd.DataFrame:
    """Create a DataFrame and verify hydrated records match the declared output schema."""
    if not records:
        return lazy.pd.DataFrame(records, columns=output_columns)

    expected_columns = set(output_columns)
    for row_index, record in enumerate(records):
        record_columns = set(record)
        extra_columns = sorted(record_columns - expected_columns)
        missing_columns = [column for column in output_columns if column not in record]
        if not extra_columns and not missing_columns:
            continue

        message_parts: list[str] = [
            f"Hydrated record at index {row_index} does not match output_columns {output_columns!r}."
        ]
        if missing_columns:
            message_parts.append(f"Missing columns: {missing_columns!r}.")
        if extra_columns:
            message_parts.append(f"Undeclared columns: {extra_columns!r}.")
        message_parts.append("Ensure each record emitted by hydrate_row() matches the declared output schema.")
        raise SeedReaderError(" ".join(message_parts))

    return lazy.pd.DataFrame(records, columns=output_columns)

Built-In Readers

`LocalFileSeedReader`

Bases: SeedReader[LocalFileSeedSource]

`HuggingFaceSeedReader`

Bases: SeedReader[HuggingFaceSeedSource]

`DataFrameSeedReader`

Bases: SeedReader[DataFrameSeedSource]

`DirectorySeedReader`

Bases: FileSystemSeedReader[DirectorySeedSource]

`FileContentsSeedReader`

Bases: FileSystemSeedReader[FileContentsSeedSource]

`AgentRolloutSeedReader`

Bases: FileSystemSeedReader[AgentRolloutSeedSource]

Registry and Errors

`SeedReaderRegistry`

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py

def __init__(self, readers: Sequence[SeedReader]):
    self._readers: dict[str, SeedReader] = {}
    for reader in readers:
        self.add_reader(reader)

`SeedReaderError`

Bases: DataDesignerError