Skip to content

Dataset Version Processing Pipeline

This guide explains how to create, customize, test, and deploy a processing pipeline that runs on a Dataset Version using the pxl-pipeline cli with the dataset_version template.

These pipelines receive a dataset version as their target and can apply transformations, filters, or any custom logic to the data.


1. Initialize your pipeline

pxl-pipeline init my_dataset_pipeline --type processing --template dataset_version

This generates a pipeline folder with standard files. See project structure for details.

2. Customize your pipeline logic

steps.py

Contains the process() step where your core logic lives. The context gives you access to the target dataset version, parameters, and any inputs you've declared.

@step
def process():
    context: PicselliaDatasetProcessingContext = Pipeline.get_active_context()
    parameters = context.processing_parameters
    dataset_version = context.target

    # If you want to process only selected assets:
    asset_ids_to_process = context.asset_ids

    # Your logic goes here ...

utils/parameters.py

Define custom parameters using a class that inherits from Parameters:

class ProcessingParameters(Parameters):
    def __init__(self, log_data):
        super().__init__(log_data)
        self.example_parameter = self.extract_parameter(
            ["example_parameter"], expected_type=str, default="default"
        )

See Working with pipeline parameters for more.

utils/inputs.py

Define the inputs your processing expects. Inputs are registered on the Picsellia platform when you deploy and are validated at launch time.

from picsellia.types.enums import ProcessingInputType
from picsellia_pipelines_cli.utils.inputs import PipelineInputs


class ProcessingInputs(PipelineInputs):
    def __init__(self):
        super().__init__()
        self.define_input(
            name="example_input",
            input_type=ProcessingInputType.TEXT,
            required=True,
        )

See Working with pipeline inputs for the full guide.

3. Configure run_config.toml for local testing

When you run pxl-pipeline init, a run_config.toml is generated. It contains the target, inputs, and parameters needed to run locally:

override_outputs = true

target_id = ""

[job]
type = "DATASET_VERSION_CREATION"

[inputs]
example_input = "example_value"

[parameters]
example_parameter = "default"

Fill in the target_id with the UUID of the dataset version you want to process.

4. Manage dependencies with uv

uv add opencv-python --project my_dataset_pipeline

Dependencies are declared in pyproject.toml. See dependency management with uv.

5. Test your pipeline locally

pxl-pipeline test my_dataset_pipeline

This will:

  • Prompt for the target dataset version if not set in the run config
  • Scaffold any missing inputs with empty defaults
  • Run the pipeline via pipeline.py --mode local
  • Save everything under runs/runX/

To reuse the same folder and avoid re-downloading assets:

pxl-pipeline test my_dataset_pipeline --reuse-dir

See how runs/ work for more details.

6. Deploy to Picsellia

pxl-pipeline deploy my_dataset_pipeline

This will:

  • Build and push the Docker image
  • Register the pipeline in Picsellia
  • Sync the declared inputs to the platform (add new inputs, update existing ones, remove stale ones)

See deployment lifecycle.