Transform Function
Parameters
When specifying a transform function, all of its parameters are automatically populated with the appropriate dataframes of type polars.LazyFrame.
# input is of type LazyFrame
def transform(input):
...Return Value
A transform function must return a value. The return type can be one of the following:
polars.LazyFramepolars.DataFramepandas.DataFrame
Metadata
Input parameters also expose a special attribute called ds_meta. This attribute contains metadata about the DataSnapshot being used. The following fields are available through this attribute:
transform_id
str
The Transform ID of the dataframe
artifact_dir
str
The directory path where the artifacts are stored
data_snapshot_id
str
The DataSnapshot ID of the dataframe
build_id
str
The Build ID of the dataframe
row_count
int
Number of rows
column_count
int
Number of columns
file_size
int
The size of the parquet file
creation_date
str
The date when the dataset was created
columns
list
The columns of the dataframe
This is useful when pulling artifacts from upstream transforms. In this case, you can specify the upstream transform and call the artifact_dir:
Environment Variables
DataSpace injects certain system environment variables to communicate with the runner on where to store certain files.
TRANSFORM_ID
The transformId of the current transform
ARTIFACT_FOLDER
The artifact folder of the current build. Should be used to persist artifacts after build
META_FOLDER
The meta folder of the current build. Will be populated with metadata about the dataset if generated
DATASET_FOLDER
The dataset folder of the current build. Will be populated with the generated parquet file
Last updated