# Transform Function

### Parameters

When specifying a transform function, all of its parameters are automatically populated with the appropriate dataframes of type `polars.LazyFrame`.

```python
# input is of type LazyFrame
def transform(input):
    ...
```

### Return Value

A transform function must return a value. The return type can be one of the following:

* `polars.LazyFrame`
* `polars.DataFrame`
* `pandas.DataFrame`

{% hint style="info" %}
The recommended return type is `polars.LazyFrame`.

A `LazyFrame` can be optimized by the query planner and can leverage the streaming engine to perform out-of-core computations. This results in significantly faster execution compared to the immediate mode of a `polars.DataFrame`, and is orders of magnitude faster than `pandas.DataFrame`.

Additionally, DataSpace uses the query plan of the `LazyFrame` to deduce column lineage, something that’s only possible when returning a `LazyFrame`.
{% endhint %}

### Metadata

Input parameters also expose a special attribute called `ds_meta`. This attribute contains metadata about the `DataSnapshot` being used. The following fields are available through this attribute:

<table><thead><tr><th width="195.82421875">Attribute</th><th width="94.84375">Type</th><th>Description</th></tr></thead><tbody><tr><td>transform_id</td><td>str</td><td>The Transform ID of the dataframe</td></tr><tr><td>artifact_dir</td><td>str</td><td>The directory path where the artifacts are stored</td></tr><tr><td>data_snapshot_id</td><td>str</td><td>The DataSnapshot ID of the dataframe</td></tr><tr><td>build_id</td><td>str</td><td>The Build ID of the dataframe</td></tr><tr><td>row_count</td><td>int</td><td>Number of rows</td></tr><tr><td>column_count</td><td>int</td><td>Number of columns</td></tr><tr><td>file_size</td><td>int</td><td>The size of the parquet file</td></tr><tr><td>creation_date</td><td>str</td><td>The date when the dataset was created</td></tr><tr><td>columns</td><td>list</td><td>The columns of the dataframe</td></tr></tbody></table>

This is useful when pulling artifacts from upstream transforms. In this case, you can specify the upstream transform and call the `artifact_dir`:

```python
def transform(excel_ingest):
    artifact_path = excel_ingest.ds_meta.artifact_dir
```

### Environment Variables

DataSpace injects certain system environment variables to communicate with the runner on where to store certain files.

| Name                       | Description                                                                                          |
| -------------------------- | ---------------------------------------------------------------------------------------------------- |
| TRANSFORM\_ID              | The transformId of the current transform                                                             |
| ARTIFACT\_FOLDER           | The artifact folder of the current build. Should be used to persist artifacts after build            |
| META\_FOLDER               | The meta folder of the current build. Will be populated with metadata about the dataset if generated |
| DATASET\_FOLDER            | The dataset folder of the current build. Will be populated with the generated parquet file           |
| PREVIOUS\_BUILD\_FOLDER    | The current transforms previous build folder                                                         |
| PREVIOUS\_DATASET\_PATH    | The current transforms previous generated dataset file path                                          |
| PREVIOUS\_ARTIFACT\_FOLDER | The current transforms previous generated artifact folder                                            |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.dataspace.ch/api-reference/transform-function.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
