# Transform Function

### Parameters

When specifying a transform function, all of its parameters are automatically populated with the appropriate dataframes of type `polars.LazyFrame`.

```python
# input is of type LazyFrame
def transform(input):
    ...
```

### Return Value

A transform function must return a value. The return type can be one of the following:

* `polars.LazyFrame`
* `polars.DataFrame`
* `pandas.DataFrame`

{% hint style="info" %}
The recommended return type is `polars.LazyFrame`.

A `LazyFrame` can be optimized by the query planner and can leverage the streaming engine to perform out-of-core computations. This results in significantly faster execution compared to the immediate mode of a `polars.DataFrame`, and is orders of magnitude faster than `pandas.DataFrame`.

Additionally, DataSpace uses the query plan of the `LazyFrame` to deduce column lineage, something that’s only possible when returning a `LazyFrame`.
{% endhint %}

### Metadata

Input parameters also expose a special attribute called `ds_meta`. This attribute contains metadata about the `DataSnapshot` being used. The following fields are available through this attribute:

<table><thead><tr><th width="195.82421875">Attribute</th><th width="94.84375">Type</th><th>Description</th></tr></thead><tbody><tr><td>transform_id</td><td>str</td><td>The Transform ID of the dataframe</td></tr><tr><td>artifact_dir</td><td>str</td><td>The directory path where the artifacts are stored</td></tr><tr><td>data_snapshot_id</td><td>str</td><td>The DataSnapshot ID of the dataframe</td></tr><tr><td>build_id</td><td>str</td><td>The Build ID of the dataframe</td></tr><tr><td>row_count</td><td>int</td><td>Number of rows</td></tr><tr><td>column_count</td><td>int</td><td>Number of columns</td></tr><tr><td>file_size</td><td>int</td><td>The size of the parquet file</td></tr><tr><td>creation_date</td><td>str</td><td>The date when the dataset was created</td></tr><tr><td>columns</td><td>list</td><td>The columns of the dataframe</td></tr></tbody></table>

This is useful when pulling artifacts from upstream transforms. In this case, you can specify the upstream transform and call the `artifact_dir`:

```python
def transform(excel_ingest):
    artifact_path = excel_ingest.ds_meta.artifact_dir
```

### Environment Variables

DataSpace injects certain system environment variables to communicate with the runner on where to store certain files.

| Name                       | Description                                                                                          |
| -------------------------- | ---------------------------------------------------------------------------------------------------- |
| TRANSFORM\_ID              | The transformId of the current transform                                                             |
| ARTIFACT\_FOLDER           | The artifact folder of the current build. Should be used to persist artifacts after build            |
| META\_FOLDER               | The meta folder of the current build. Will be populated with metadata about the dataset if generated |
| DATASET\_FOLDER            | The dataset folder of the current build. Will be populated with the generated parquet file           |
| PREVIOUS\_BUILD\_FOLDER    | The current transforms previous build folder                                                         |
| PREVIOUS\_DATASET\_PATH    | The current transforms previous generated dataset file path                                          |
| PREVIOUS\_ARTIFACT\_FOLDER | The current transforms previous generated artifact folder                                            |
