Skip to content

Seed Readers

Seed readers are engine-side adapters that turn a configured seed source into tabular seed rows. The engine attaches a SeedSource and secret resolver, asks the reader for column names and dataset size, then streams batches into generation.

Related pages: seeds, Seed Datasets, and Build Your Own.

Core Contracts

SeedReader

Bases: ABC, Generic[SourceT]

Base class for reading a seed dataset.

Seeds are read using duckdb. Reader implementations define duckdb connection setup details and how to get a URI that can be queried with duckdb (i.e. "... FROM ...").

The Data Designer engine automatically supplies the appropriate SeedSource and a SecretResolver to use for any secret fields in the config via attach(...). Subclasses that need per-attachment setup can override on_attach(...) without needing to call super().

Methods:

Name Description
attach

Attach a source and secret resolver to the instance.

create_filesystem_context

Create a rooted filesystem context for directory-backed seed readers.

get_column_names

Returns the seed dataset's column names

get_seed_type

Return the seed_type of the source class this reader is generic over.

on_attach

Hook for subclasses that need per-attachment setup.

attach(source, secret_resolver)

Attach a source and secret resolver to the instance.

This is called internally by the engine so that these objects do not need to be provided in the reader's constructor.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
195
196
197
198
199
200
201
202
203
204
def attach(self, source: SourceT, secret_resolver: SecretResolver) -> None:
    """Attach a source and secret resolver to the instance.

    This is called internally by the engine so that these objects do not
    need to be provided in the reader's constructor.
    """
    self._reset_attachment_state()
    self.source = source
    self.secret_resolver = secret_resolver
    self.on_attach()

create_filesystem_context(root_path)

Create a rooted filesystem context for directory-backed seed readers.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
244
245
246
247
248
def create_filesystem_context(self, root_path: Path | str) -> SeedReaderFileSystemContext:
    """Create a rooted filesystem context for directory-backed seed readers."""
    resolved_root_path = Path(root_path).expanduser().resolve()
    rooted_fs = DirFileSystem(path=str(resolved_root_path), fs=LocalFileSystem())
    return SeedReaderFileSystemContext(fs=rooted_fs, root_path=resolved_root_path)

get_column_names()

Returns the seed dataset's column names

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
276
277
278
279
280
281
282
def get_column_names(self) -> list[str]:
    """Returns the seed dataset's column names"""
    self._ensure_attached()
    conn = self._get_duckdb_connection()
    describe_query = f"DESCRIBE SELECT * FROM '{self.get_dataset_uri()}'"
    column_descriptions = conn.execute(describe_query).fetchall()
    return [col[0] for col in column_descriptions]

get_seed_type()

Return the seed_type of the source class this reader is generic over.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
def get_seed_type(self) -> str:
    """Return the seed_type of the source class this reader is generic over."""
    # Get the generic type arguments from the reader class
    # Check __orig_bases__ for the generic base class
    for base in getattr(type(self), "__orig_bases__", []):
        origin = get_origin(base)
        if isinstance(origin, type) and issubclass(origin, SeedReader):
            args = get_args(base)
            if args:
                source_cls = get_origin(args[0]) or args[0]
                # Extract seed_type from the source class
                if hasattr(source_cls, "model_fields") and "seed_type" in source_cls.model_fields:
                    field = source_cls.model_fields["seed_type"]
                    default_value = field.default
                    if isinstance(default_value, str):
                        return default_value

    raise SeedReaderError("Reader does not have a valid generic source type with seed_type")

on_attach()

Hook for subclasses that need per-attachment setup.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
206
207
def on_attach(self) -> None:
    """Hook for subclasses that need per-attachment setup."""

FileSystemSeedReader

Bases: SeedReader[FileSystemSourceT], ABC

Base class for filesystem-derived seed readers.

Plugin authors implement build_manifest(...) to describe the cheap logical rows available under the configured filesystem root. Readers that need expensive enrichment can optionally override hydrate_row(...) to emit one record dict or an iterable of record dicts per manifest row. When emitted records change the manifest schema, output_columns must declare the exact hydrated output schema for each emitted record. The framework owns attachment-scoped filesystem context reuse, manifest sampling, partitioning, randomization, batching, and DuckDB registration details.

SeedReaderFileSystemContext

Filesystem and root path available to filesystem seed-reader plugins.

SeedReaderBatch

Bases: Protocol

Batch object returned by seed readers and convertible to a DataFrame.

SeedReaderBatchReader

Bases: Protocol

Reader that yields seed batches until exhausted.

PandasSeedReaderBatch

Seed-reader batch backed by an in-memory pandas DataFrame.

Methods:

Name Description
to_pandas

Return the batch as a pandas DataFrame.

to_pandas()

Return the batch as a pandas DataFrame.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
78
79
80
def to_pandas(self) -> pd.DataFrame:
    """Return the batch as a pandas DataFrame."""
    return self.dataframe

create_seed_reader_output_dataframe

Create a DataFrame and verify hydrated records match the declared output schema.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def create_seed_reader_output_dataframe(
    *,
    records: list[dict[str, Any]],
    output_columns: list[str],
) -> pd.DataFrame:
    """Create a DataFrame and verify hydrated records match the declared output schema."""
    if not records:
        return lazy.pd.DataFrame(records, columns=output_columns)

    expected_columns = set(output_columns)
    for row_index, record in enumerate(records):
        record_columns = set(record)
        extra_columns = sorted(record_columns - expected_columns)
        missing_columns = [column for column in output_columns if column not in record]
        if not extra_columns and not missing_columns:
            continue

        message_parts: list[str] = [
            f"Hydrated record at index {row_index} does not match output_columns {output_columns!r}."
        ]
        if missing_columns:
            message_parts.append(f"Missing columns: {missing_columns!r}.")
        if extra_columns:
            message_parts.append(f"Undeclared columns: {extra_columns!r}.")
        message_parts.append("Ensure each record emitted by hydrate_row() matches the declared output schema.")
        raise SeedReaderError(" ".join(message_parts))

    return lazy.pd.DataFrame(records, columns=output_columns)

Built-In Readers

LocalFileSeedReader

Bases: SeedReader[LocalFileSeedSource]

HuggingFaceSeedReader

Bases: SeedReader[HuggingFaceSeedSource]

DataFrameSeedReader

Bases: SeedReader[DataFrameSeedSource]

DirectorySeedReader

Bases: FileSystemSeedReader[DirectorySeedSource]

FileContentsSeedReader

Bases: FileSystemSeedReader[FileContentsSeedSource]

AgentRolloutSeedReader

Bases: FileSystemSeedReader[AgentRolloutSeedSource]

Registry and Errors

SeedReaderRegistry

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
661
662
663
664
def __init__(self, readers: Sequence[SeedReader]):
    self._readers: dict[str, SeedReader] = {}
    for reader in readers:
        self.add_reader(reader)

SeedReaderError

Bases: DataDesignerError