Transform Function

Parameters

When specifying a transform function, all of its parameters are automatically populated with the appropriate dataframes of type polars.LazyFrame.

# input is of type LazyFrame
def transform(input):
    ...

Return Value

A transform function must return a value. The return type can be one of the following:

  • polars.LazyFrame

  • polars.DataFrame

  • pandas.DataFrame

circle-info

The recommended return type is polars.LazyFrame.

A LazyFrame can be optimized by the query planner and can leverage the streaming engine to perform out-of-core computations. This results in significantly faster execution compared to the immediate mode of a polars.DataFrame, and is orders of magnitude faster than pandas.DataFrame.

Additionally, DataSpace uses the query plan of the LazyFrame to deduce column lineage, something that’s only possible when returning a LazyFrame.

Metadata

Input parameters also expose a special attribute called ds_meta. This attribute contains metadata about the DataSnapshot being used. The following fields are available through this attribute:

Attribute
Type
Description

transform_id

str

The Transform ID of the dataframe

artifact_dir

str

The directory path where the artifacts are stored

data_snapshot_id

str

The DataSnapshot ID of the dataframe

build_id

str

The Build ID of the dataframe

row_count

int

Number of rows

column_count

int

Number of columns

file_size

int

The size of the parquet file

creation_date

str

The date when the dataset was created

columns

list

The columns of the dataframe

This is useful when pulling artifacts from upstream transforms. In this case, you can specify the upstream transform and call the artifact_dir:

Environment Variables

DataSpace injects certain system environment variables to communicate with the runner on where to store certain files.

Name
Description

TRANSFORM_ID

The transformId of the current transform

ARTIFACT_FOLDER

The artifact folder of the current build. Should be used to persist artifacts after build

META_FOLDER

The meta folder of the current build. Will be populated with metadata about the dataset if generated

DATASET_FOLDER

The dataset folder of the current build. Will be populated with the generated parquet file

Last updated