Day4 - ML Pipelines

Day4 - ML Pipelines#

Pipeline#

It is a declarative description of a workflow or process, typically in code or configuration, that specifies:

What tasks should be run
How they are connected or ordered
What inputs and outputs are involved
Any conditions, retries, or resources needed

Orchestrator#

An orchestrator is a system or tool that manages and coordinates the execution of complex workflows or tasks, especially when those tasks involve multiple steps, dependencies, and resources.

Comparison Between Popular ML Pipelines Orchestrators#

Feature	Dagster	Airflow	Kubeflow	Prefect	Metaflow
Primary Language	Python	Python	Python	Python	Python
Designed For	Data & ML pipelines	General workflow orchestration	ML pipelines on Kubernetes	General workflow orchestration	ML pipelines, prototyping to prod
ML Native Features	✅ Strong ML support (IO management, type systems)	❌ Limited ML support	✅ Tight integration with TF, K8s	⚠️ Minimal built-in ML features	✅ ML-focused abstractions
Kubernetes Native	✅ (via Dagster K8s executor)	✅ (with Helm, K8sExecutor)	✅ Fully K8s-native	✅ Optional	✅ Optional
Local Dev Experience	✅ Very good (CLI & UI)	⚠️ Okay but clunky	❌ Heavyweight (needs K8s)	✅ Excellent (easy local → cloud)	✅ Excellent (local-first)
UI / Observability	✅ Excellent UI & asset tracking	✅ Basic but mature	✅ Full UI	✅ Good (flow run UI)	✅ Great (incl. lineage, retry)
Type Safety / IO Mgmt	✅ Strong typing & asset materialization	❌ Minimal support	⚠️ Basic through component specs	⚠️ Basic typing	✅ Simple but effective
Data Lineage	✅ First-class asset lineage	⚠️ Custom plugins needed	✅ via ML Metadata store	⚠️ Partial	✅ Built-in
Execution Flexibility	✅ Local, multiprocess, K8s, etc.	✅ Executors: Celery, K8s, etc.	❌ K8s only	✅ Cloud or local agents	✅ AWS, K8s, local
Extensibility	✅ Modular, Pythonic design	✅ Strong DAG customization	⚠️ Custom container components	✅ Flows as Python code	✅ Highly extensible Python code
Community / Maturity	⭐⭐ Growing fast	⭐⭐⭐ Very mature (but older)	⭐⭐ Kubernetes/Google centric	⭐⭐ Fast-growing	⭐⭐ Strong in enterprise/data science
Best Use Case	ML + data pipelines, asset-driven	ETL, batch jobs	Large-scale ML on K8s	Lightweight orchestration	ML workflows from notebook to prod

🟦 Dagster

Pros: Modern, type-safe, asset-centric, good dev UX, great observability.
Cons: Learning curve around asset concepts.
Best for: ML teams looking for data lineage and strong development ergonomics.

🟩 Airflow

Pros: Battle-tested, huge ecosystem, flexible.
Cons: DSL is clunky for ML; weak typing; hard to trace ML artifacts.
Best for: Traditional ETL and teams with existing Airflow setups.

🟥 Kubeflow

Pros: Cloud-native, scalable, integrates well with K8s and TensorFlow ecosystem.
Cons: Complex setup, Kubernetes-only, poor local dev UX.
Best for: Teams deploying ML at scale on Kubernetes.

🟨 Prefect

Pros: Simple, Pythonic, cloud-native agents, strong developer experience.
Cons: Not ML-specific; less focus on artifact tracking.
Best for: Lightweight workflows, dataops, hybrid cloud/local orchestration.

🟪 Metaflow

Pros: Very ML-friendly, notebook integration, supports branching, versioning, retries.
Cons: Less customizable for general workflows.
Best for: ML teams needing reproducibility from notebook → prod.

Dagster#

Dagster

An open-source data orchestrator for ML, analytics and (Extract, Transform, Load) ETL. It enables to build, run and monitor complex data pipelines. Dagster offers:

Declarative pipeline definitions (data dependencies and configuration).
Type-safe operations.
Native support for assets, schedules, and sensors.
Integration with popular data tools (e.g., dbt, Spark, MLFlow).

Core Concepts#

Concept	Description
`@op`	A function that performs a unit of work
`@job`	A directed graph of `op`s
`@asset`	A first-class, versioned data product
`Graph`	A reusable composition of ops
`Resource`	External dependency like S3, DB, API
`Sensor` / `Schedule`	Triggers jobs by event/time

Getting Started#

Install Dagster:
```
pip install dagster dagit
```

Initialize a new Dagster project:

dagster project scaffold --name dagster_tutorial
cd dagster_tutorial

Run the Dagster development server:
```
dagster dev
```
Open the Dagit UI in your browser at http://localhost:3000.

Create a new file ops.py in the subdirectory dagster_tutorial and add the following code:

from dagster import op
@op
def get_numbers():
    return [1, 2, 3]
@op
def multiply(numbers):
    return [x * 10 for x in numbers]

Create a new file jobs.py in the subdirectory dagster_tutorial and add the following code:

from dagster import job
from .ops import get_numbers, multiply
@job
def process_job():
    multiply(get_numbers())

In the definitions.py file, import the job and add it to the repository:

from dagster import Definitions, load_assets_from_modules
from dagster_tutorial import assets  # noqa: TID252
from dagster_tutorial.jobs import process_job
all_assets = load_assets_from_modules([assets])
defs = Definitions(
    assets=all_assets,
    jobs=[process_job],
)

Modify the multiply function in ops.py to get runtime config:

from typing import List
from dagster import op, Config
class MultiplyConfig(Config):
    factor: int
@op
def multiply(config:MultiplyConfig,numbers:List[int]):
    return [x * config.factor for x in numbers]

In the launchpad, you can now run the process_job with a configuration:
```
ops:
  multiply:
    config:
      factor: 10
```

You can enable logging in your Dagster project by adding this anywhere you want to add logging:

from dagster import get_dagster_logger
logger = get_dagster_logger()
logger.info("This is an info message")

You can also use assets to define, persist and version your data products. For example, you can create a new file assets.py in the subdirectory dagster_tutorial and add the following code:
```
from dagster import asset
@asset
def raw_data():
    return [1, 2, 3, 4]
@asset
def squared_data(raw_data):
    return [x**2 for x in raw_data]
```

You can add a scheduler to run a job at a specific time. To do that, add the following code to the definitions.py file:

from dagster import ScheduleDefinition
hourly_schedule = ScheduleDefinition(
    job=process_job,
    cron_schedule="0 * * * *",  # Every hour
)
defs = Definitions(
  assets=all_assets,
  jobs=[process_job],
  schedules=[hourly_schedule],

)

Mlflow#

Mlflow

Open-source platform for managing the ML lifecycle, including experimentation, reproducibility, and deployment.
It provides a central repository for tracking experiments, packaging code into reproducible runs, and sharing and deploying models.
It has four main components:
- Tracking: Log and query experiments.
- Projects: Package code in a reusable and reproducible way.
- Models: Manage and deploy models from various ML libraries.
- Registry: Store and manage models in a central repository.
Dagster can be used to orchestrate ML workflows and integrate with MLflow for tracking experiments and managing models.
You can use Dagster to define and run ML pipelines, and use MLflow to log and track experiments, models, and artifacts.
You can use Dagster’s @op decorator to define MLflow operations, and use MLflow’s Python API to log and track experiments.
You can use Dagster’s @job decorator to define MLflow jobs, and use MLflow’s Python API to log and track experiments.
You can use Dagster’s @asset decorator to define MLflow assets, and use MLflow’s Python API to log and track experiments.
You can use Dagster’s @schedule decorator to define MLflow schedules, and use MLflow’s Python API to log and track experiments.
You can use Dagster’s @sensor decorator to define MLflow sensors, and use MLflow’s Python API to log and track experiments.

To test MLFlow access, you can run the following python code:

import os
from dotenv import load_dotenv
import mlflow

load_dotenv("../../.env")

MLFLOW_SERVER_URL = os.getenv("MLFLOW_SERVER_URL", "http://localhost:5000")
MLFLOW_TRACKING_USERNAME = os.getenv("MLFLOW_TRACKING_USERNAME")
MLFLOW_TRACKING_PASSWORD = os.getenv("MLFLOW_TRACKING_PASSWORD")

MY_PREFIX = "mohanad-experiment"
os.environ["MLFLOW_TRACKING_USERNAME"] = MLFLOW_TRACKING_USERNAME
os.environ["MLFLOW_TRACKING_PASSWORD"] = MLFLOW_TRACKING_PASSWORD


mlflow.set_tracking_uri(MLFLOW_SERVER_URL)
mlflow.set_experiment(f"/{MY_PREFIX}/classification")
with mlflow.start_run():
    mlflow.log_metric("metric1", 1.0)

Dagster + Xarray + Dask to Train ERA5 Forecasting Model#

You can run the advanced Dagster project era5_forecast that integrates xarray, Dask with Dagster.

Instructions#

1- Create the dagster project:

dagster project scaffold -n era5_forecast
cd era5_forecast

2- Assuming the dependencies are installed, run the dagster server:

dagster dev

3- Open the Dagit UI in your browser at http://localhost:3000. 4- On jobs tab, select the era5_forecast job and click on the launchpad button to run it.

5- Open the Dask UI in your browser at http://localhost:8787/status.

ops.py

import os
from dagster import op, Out, get_dagster_logger
import dask
import numpy as np
import xarray as xr
from xgboost.dask import DaskDMatrix, train
from .resources import DaskResource

logger = get_dagster_logger()

# ----------- CONFIG -----------
ZARR_URL = "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3"
LAGS = list(range(1, 7))  # 6 days autoregressive lag features




def create_lagged_features(da, lags):
    lagged = [da.shift(time=lag).rename(f"lag_{lag}") for lag in lags]
    return xr.merge(lagged + [da.rename("target")])


# ----------- OPS -----------
@op
def load_and_preprocess_data():
    ds = xr.open_zarr(
        ZARR_URL,
        chunks={"time": 1},  # type: ignore
        storage_options={"token": "anon"},
    )
    temp = ds["2m_temperature"].sel(
        time=slice("2018-01-01", "2018-01-31"),
        # latitude=slice(40, 50),
        # longitude=slice(0, 10),
    )
    daily = temp.resample(time="1D").mean()
    daily = daily.persist()
    return daily


@op(out={"features": Out(), "labels": Out()})
def generate_training_data(daily):
    ds_lagged = create_lagged_features(daily, lags=LAGS)
    X = xr.concat([ds_lagged[f"lag_{lag}"] for lag in LAGS], dim="feature")
    X = (
        X.stack(sample=("time", "latitude", "longitude"))
        .transpose("sample", "feature")
        .data
    )
    y = ds_lagged["target"].chunk({"time": -1, "latitude": 10, "longitude": 10})
    y = y.stack(sample=("time", "latitude", "longitude")).data
    return X, y


# @op
# def train_model(my_dask_resource: DaskResource, X, y):
#     client = my_dask_resource.make_dask_cluster()
#     logger.info("Dask dashboard link %s", client.dashboard_link)
#     sample_frac = 0.05  # or whatever % you'd like
#     logger.info(f"Sampling {sample_frac*100:.0f}% of data before rechunking")

#     n_samples = X.shape[0]
#     sample_size = int(n_samples * sample_frac)
#     random_indices = np.random.permutation(n_samples)[:sample_size]
#     X = X[random_indices]
#     y = y[random_indices]
#     logger.info("X.shape: %s", X.shape)
#     logger.info("y.shape: %s", y.shape)
#     chunk_size = 32  # or whatever fits your cluster
#     logger.info("Rechunking X and y...")
#     X = X.rechunk((chunk_size, -1))  # all columns together in each chunk
#     y = y.rechunk((chunk_size,))  # chunk by rows for labels
#     logger.info("Creating DaskDMatrix...")
#     dtrain = DaskDMatrix(client, X, y)
#     logger.info("Starting training...")
#     output = train(
#         client,
#         {"verbosity": 2, "tree_method": "hist", "objective": "reg:squarederror"},
#         dtrain,
#         num_boost_round=100,
#         evals=[(dtrain, "train")],
#         early_stopping_rounds=4,
#     )
#     booster = output["booster"]
#     os.makedirs("models", exist_ok=True)
#     booster.save_model("models/xgb_model.json")
#     client.close()

@op
def train_model(my_dask_resource: DaskResource, X, y):
    client = my_dask_resource.make_dask_cluster()
    logger.info("Dask dashboard link %s", client.dashboard_link)

    sample_frac = 0.05
    logger.info(f"Sampling {sample_frac * 100:.0f}% of data before rechunking")

    n_samples = X.shape[0]
    sample_size = int(n_samples * sample_frac)
    random_indices = np.random.permutation(n_samples)[:sample_size]

    # Turn indices into a Dask array for indexing
    random_indices = dask.array.from_array(random_indices, chunks=(sample_size,))

    # Sample X and y using Dask indexing
    X = X[random_indices]
    y = y[random_indices]

    logger.info("X.shape: %s", X.shape)
    logger.info("y.shape: %s", y.shape)

    logger.info("Splitting sampled data into train/val sets using Dask...")
    val_frac = 0.2
    val_size = int(sample_size * val_frac)
    train_size = sample_size - val_size


    X_train = X[:train_size]
    y_train = y[:train_size]
    X_val = X[train_size:]
    y_val = y[train_size:]

    chunk_size = 32
    X_train = X_train.rechunk((chunk_size, -1))
    y_train = y_train.rechunk((chunk_size,))
    X_val = X_val.rechunk((chunk_size, -1))
    y_val = y_val.rechunk((chunk_size,))

    logger.info("Creating DaskDMatrix for training and validation...")
    dtrain = DaskDMatrix(client, X_train, y_train)
    dval = DaskDMatrix(client, X_val, y_val)

    logger.info("Starting training with validation...")
    output = train(
        client,
        {
            "verbosity": 2,
            "tree_method": "hist",
            "objective": "reg:squarederror",
        },
        dtrain,
        num_boost_round=100,
        evals=[(dtrain, "train"), (dval, "validation")],
        early_stopping_rounds=4,
    )
    booster = output["booster"]
    best_iteration = booster.best_iteration
    logger.info(f"Training stopped at iteration: {best_iteration}")
    os.makedirs("models", exist_ok=True)
    booster.save_model("models/xgb_model.json")
    logger.info("Model saved to models/xgb_model.json")

    client.close()

resources.py

from dagster import ConfigurableResource
from dask.distributed import Client, LocalCluster


class DaskResource(ConfigurableResource):
    n_workers: int

    def make_dask_cluster(self) -> Client:
        client = Client(LocalCluster(n_workers=self.n_workers))
        return client

jobs.py

from dagster import job
from .ops import generate_training_data, load_and_preprocess_data, train_model


@job
def training_pipeline():
    daily = load_and_preprocess_data()
    X, y = generate_training_data(daily)
    train_model(X, y)  # pylint:disable=E1120

definitions.py

from dagster import Definitions, load_assets_from_modules

from era5_forecast import assets  # noqa: TID252
from era5_forecast.jobs import training_pipeline
from era5_forecast.resources import DaskResource  # noqa: TID252

all_assets = load_assets_from_modules([assets])

defs = Definitions(
    assets=all_assets,
    jobs=[training_pipeline],
    resources={"my_dask_resource": DaskResource(n_workers=4)},
)

Dagster + Stackstac + Dask + MLflow to Train Sentinel-2 Land Cover Classification Model#

You can run the advanced Dagster project esa_worldcover_classification that integrates xarray, Dask with Dagster and MLflow.

Note

The ESA WorldCover dataset is a global land cover map produced by the European Space Agency, offering 10-meter resolution classification based on Sentinel-1 and Sentinel-2 satellite imagery. Released in 2021, it provides detailed and consistent information on land cover types such as forests, croplands, urban areas, and water bodies. Designed to support environmental monitoring, climate change studies, and sustainable land management, WorldCover is freely accessible and regularly updated, making it a valuable resource for researchers, policymakers, and Earth observation applications worldwide. The ESA WorldCover dataset includes 11 land cover classes, based on the UN FAO Land Cover Classification System (LCCS). Here’s the list of classes:

Class ID	Land Cover Class
10	Tree cover
20	Shrubland
30	Grassland
40	Cropland
50	Built-up
60	Bare / sparse vegetation
70	Snow and ice
80	Permanent water bodies
90	Herbaceous wetland
95	Mangroves
100	Moss and lichen

WorldCover Documentation

Instructions#

1- Run the mlflow server:

mlflow server \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root ./mlruns \
  --host 0.0.0.0 \
  --port 8000

2- Create the dagster project:

dagster project scaffold -n eas_worldcover_classification
cd eas_worldcover_classification

3- Assuming the dependencies are installed, run the dagster server:

dagster dev

4- Open the Dagit UI in your browser at http://localhost:3000.

5- On jobs tab, select the esa_worldcover_classification job and click on the launchpad button to run it. Use the following configuration:

Note

Change the mlflow experiment name with your name, e.g. /mohanad_s2_classification. Make sure to have the mlflow server URL, username and password set in the .env file.

ops:
  fetch_s2_stack:
    config:
      bbox: [21.0, 38.0, 21.5, 38.5]
      time_range: "2020-01-01/2020-01-31"

  fetch_worldcover_stack:
    config:
      bbox: [21.0, 38.0, 21.5, 38.5]
      time_range: "2020-01-01/2020-01-31"

  save_to_zarr:
    config:
      zarr_cache_dir: "cache"

  train_unet:
    config:
      patch_size: 64
      stride: 32
      batch_size: 16
      model: "unet"
      epochs: 1
      learning_rate: 0.001
      loss: "cross_entropy"
      num_workers: 4
      in_channels: 5
      out_classes: 4
      mlflow_tracking_uri: null
      mlflow_experiment_name: "s2_classification"
      model_path: "models/unet_model.pth"
      zarr_cache_dir: "cache"
      device: "mps"

6- Open the MLflow UI in your browser at http://localhost:8000.

ops.py

import gc
import os
from typing import Optional
from dagster import op, Out, Config, get_dagster_logger
from pystac_client import Client
import planetary_computer
import requests
import stackstac
import xarray as xr
from rasterio.enums import Resampling
from .utils import clean_attrs, clean_coords, map_labels, transform_bbox
from .configurations import (
    AWS_STAC_API,
    CLASS_MAPPING,
    PLANETARY_COMPUTER_STAC_API,
    PLANETARY_COMPUTER_TOKEN_URL,
)
from .train import train_model


logger = get_dagster_logger()


class StackConfig(Config):
    bbox: list[float]
    time_range: str


class ZarrConfig(Config):
    zarr_cache_dir: str


class TrainUNetConfig(Config):
    patch_size: int
    stride: int
    num_workers: int
    batch_size: int
    model: str
    epochs: int
    learning_rate: float
    loss: str
    in_channels: int
    out_classes: int
    mlflow_tracking_uri: Optional[str] = None
    mlflow_experiment_name: str
    model_path: str
    zarr_cache_dir: str
    device: str


@op(out={"pixels_output": Out(), "proj_output": Out(), "valid_mask": Out()})
def fetch_s2_stack(config: StackConfig) -> tuple[xr.DataArray, str, xr.DataArray]:
    catalog = Client.open(AWS_STAC_API)
    search = catalog.search(
        collections=["sentinel-2-l2a"],
        bbox=config.bbox,
        datetime=config.time_range,
        query={
            "eo:cloud_cover": {"lt": 10},
            "s2:nodata_pixel_percentage": {"lt": 10},
        },
        max_items=5000,
    )
    all_items = search.item_collection()
    items = []
    granules = []
    for item in all_items:
        if item.properties["s2:granule_id"] not in granules:
            items.append(item)
            granules.append(item.properties["s2:granule_id"])

    logger.info("Found %d Sentinel-2-L2A items", len(items))
    proj = items[0].properties["proj:code"]
    bbox_utm = transform_bbox(config.bbox, items[0].properties["proj:code"])
    assets = ["blue", "green", "red", "nir", "swir16", "scl"]
    ds = stackstac.stack(
        items,
        assets=assets,
        bounds=bbox_utm,  # pyright: ignore
        resolution=20,
        epsg=int(proj.split(":")[-1]),
        dtype="float64",  # pyright: ignore
        rescale=False,
        snap_bounds=True,
        resampling=Resampling.nearest,
        chunksize=(1, 1, 512, 512),
    )

    cloud_values = [0, 1, 2, 3, 8, 9, 10]  # cloud, shadows, cirrus, etc.
    scl_mask = ds.sel(band="scl")
    bands = ds.sel(band=["blue", "green", "red", "nir", "swir16"])
    valid_mask = ~scl_mask.isin(cloud_values)
    bands_masked = bands.where(valid_mask)
    median_ds = bands_masked.groupby("time.month").median("time", skipna=True)
    valid_mask_median = ~xr.ufuncs.isnan(median_ds.isel(band=0))
    median_ds = median_ds.fillna(0)
    logger.info("Median dataset shape: %s", median_ds.shape)
    return (median_ds, proj, valid_mask_median)


@op
def fetch_worldcover_stack(
    config: StackConfig, proj: str, valid_mask: xr.DataArray
) -> xr.DataArray:
    logger.info("Projection: %s", proj)
    catalog = Client.open(PLANETARY_COMPUTER_STAC_API)
    response = requests.get(
        f"{PLANETARY_COMPUTER_TOKEN_URL}/esa-worldcover", timeout=10
    )

    if response.status_code == 200:
        response = response.json()
        token = response["token"]
        _ = {"Authorization": f"Bearer {token}"}
    else:
        print(f"Failed to get token. Status code: {response.status_code}")
        exit()
    search = search = catalog.search(
        collections=["esa-worldcover"],
        bbox=config.bbox,
        query={
            "start_datetime": {"eq": "2020-01-01T00:00:00Z"},
            "end_datetime": {"eq": "2020-12-31T23:59:59Z"},
        },
    )
    all_items = search.item_collection()
    items = [planetary_computer.sign_item(item) for item in all_items]
    logger.info("Found %d ESA World Cover items", len(items))
    bbox_utm = transform_bbox(config.bbox, proj)
    ds = stackstac.stack(
        items,
        assets=["map"],
        bounds=bbox_utm,  # pyright: ignore
        resolution=20,
        epsg=int(proj.split(":")[-1]),
        dtype="float64",  # pyright: ignore
        rescale=False,
        snap_bounds=True,
        resampling=Resampling.nearest,
        chunksize=(1, 1, 512, 512),
    )
    ds = ds.sel(time=ds.notnull().any(dim=["x", "y", "band"]))
    ds = ds.where(valid_mask)
    ds = ds.fillna(0)
    logger.info("WorldCover dataset shape: %s", ds.shape)
    return ds


@op
def save_to_zarr(
    config: ZarrConfig, features: xr.DataArray, labels: xr.DataArray
) -> str:
    features = features.squeeze()
    labels = labels.squeeze()
    features = features.drop_vars(
        [dim for dim in ["time", "month"] if dim in features.dims]
    )
    labels = labels.drop_vars([dim for dim in ["time", "month"] if dim in labels.dims])
    features, labels = xr.align(features, labels, join="inner")
    labels = map_labels(labels, CLASS_MAPPING)
    img_zarr_path = os.path.join(config.zarr_cache_dir, "img.zarr")
    mask_zarr_path = os.path.join(config.zarr_cache_dir, "mask.zarr")
    if not os.path.exists(img_zarr_path) and not os.path.exists(mask_zarr_path):
        features = features.chunk({"x": 512, "y": 512})
        labels = labels.chunk({"x": 512, "y": 512})
        features.data = features.data.compute_chunk_sizes()
        labels.data = labels.data.compute_chunk_sizes()
        features = clean_attrs(features)
        features = clean_coords(features)
        features.to_zarr(img_zarr_path)

        labels = clean_attrs(labels)
        labels = clean_coords(labels)
        labels.to_zarr(mask_zarr_path)
        del features
        del labels
        gc.collect()
    return config.zarr_cache_dir


@op
def train_unet(config: TrainUNetConfig, zarr_dir: str):  # pylint:disable=W0613
    train_model(config)

model.py

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers.models.segformer import (
    SegformerForSemanticSegmentation,
    SegformerConfig,
)


class UNetBaseLine(nn.Module):
    def __init__(self, in_channels=4, out_classes=11, dropout=0.2):
        super(UNetBaseLine, self).__init__()

        def conv_block(in_ch, out_ch):
            return nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 3, padding=1),
                nn.BatchNorm2d(out_ch),
                nn.ReLU(inplace=True),
                nn.Conv2d(out_ch, out_ch, 3, padding=1),
                nn.BatchNorm2d(out_ch),
                nn.ReLU(inplace=True),
                nn.Dropout2d(dropout),
            )

        self.enc1 = conv_block(in_channels, 64)
        self.enc2 = conv_block(64, 128)
        self.enc3 = conv_block(128, 256)

        self.bottleneck = conv_block(256, 512)

        self.pool = nn.MaxPool2d(2)

        self.dec2 = conv_block(512 + 128, 128)
        self.dec1 = conv_block(128 + 64, 64)

        self.final = nn.Conv2d(64, out_classes, kernel_size=1)

    def forward(self, x):
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        e3 = self.enc3(self.pool(e2))
        b = self.bottleneck(self.pool(e3))

        d2 = self._upsample_concat(b, e2)
        d2 = self.dec2(d2)

        d1 = self._upsample_concat(d2, e1)
        d1 = self.dec1(d1)

        return self.final(d1)  # (B, 11, H, W)

    def _upsample_concat(self, x, skip):
        x = F.interpolate(x, size=skip.shape[2:], mode="bilinear", align_corners=False)
        return torch.cat([x, skip], dim=1)


class UNet(nn.Module):
    def __init__(self, in_channels=4, out_classes=11, dropout=0.1, init_weights=True):
        super(UNet, self).__init__()

        def conv_block(in_ch, out_ch):
            return nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 3, padding=1),
                nn.BatchNorm2d(out_ch),
                nn.ReLU(inplace=True),
                nn.Conv2d(out_ch, out_ch, 3, padding=1),
                nn.BatchNorm2d(out_ch),
                nn.ReLU(inplace=True),
                nn.Dropout2d(dropout),
            )

        # Encoder
        self.enc1 = conv_block(in_channels, 64)
        self.enc2 = conv_block(64, 128)
        self.enc3 = conv_block(128, 256)

        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bottleneck = conv_block(256, 512)

        # Decoder
        self.dec3 = conv_block(512 + 256, 256)
        self.dec2 = conv_block(256 + 128, 128)
        self.dec1 = conv_block(128 + 64, 64)
        self.dec0 = conv_block(64, 32)

        # Final classifier
        self.final = nn.Conv2d(32, out_classes, kernel_size=1)

        if init_weights:
            self._init_weights()

    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
                if m.bias is not None:
                    nn.init.zeros_(m.bias)

    def forward(self, x):
        e1 = self.enc1(x)  # (B, 64, H, W)
        e2 = self.enc2(self.pool(e1))  # (B, 128, H/2, W/2)
        e3 = self.enc3(self.pool(e2))  # (B, 256, H/4, W/4)
        b = self.bottleneck(self.pool(e3))  # (B, 512, H/8, W/8)

        d3 = self._upsample_concat(b, e3)  # (B, 512+256, H/4, W/4)
        d3 = self.dec3(d3)

        d2 = self._upsample_concat(d3, e2)  # (B, 256+128, H/2, W/2)
        d2 = self.dec2(d2)

        d1 = self._upsample_concat(d2, e1)  # (B, 128+64, H, W)
        d1 = self.dec1(d1)

        d0 = self.dec0(d1)  # (B, 32, H, W)

        return self.final(d0)  # logits: (B, out_classes, H, W)

    def _upsample_concat(self, x, skip):
        x = F.interpolate(x, size=skip.shape[2:], mode="bilinear", align_corners=False)
        return torch.cat([x, skip], dim=1)


class SegformerB0FourBand(nn.Module):
    def __init__(self, in_channels=4, out_classes=11):
        super().__init__()

        # Step 1: Load pretrained model to extract 3-channel conv weights
        pretrained_model = SegformerForSemanticSegmentation.from_pretrained(
            "nvidia/segformer-b0-finetuned-ade-512-512"
        )
        pretrained_conv = pretrained_model.segformer.encoder.patch_embeddings[0].proj
        pretrained_weights = pretrained_conv.weight  # type: ignore # shape: [32, 3, 7, 7]

        # Step 2: Create modified config
        config = SegformerConfig.from_pretrained(
            "nvidia/segformer-b0-finetuned-ade-512-512"
        )
        config.num_channels = in_channels
        config.num_labels = out_classes

        # Step 3: Create your model
        self.model = SegformerForSemanticSegmentation(config)

        # Step 4: Replace first conv with 4-band version and copy weights
        old_conv = self.model.segformer.encoder.patch_embeddings[0].proj
        new_conv = nn.Conv2d(
            4,
            old_conv.out_channels,  # type: ignore
            kernel_size=old_conv.kernel_size,  # type: ignore
            stride=old_conv.stride,  # type: ignore
            padding=old_conv.padding,  # type: ignore
            bias=old_conv.bias is not None,  # type: ignore
        )
        with torch.no_grad():
            new_conv.weight[:, :3] = pretrained_weights  # type: ignore # use original RGB weights
            new_conv.weight[:, 3] = pretrained_weights[:, 0]  # type: ignore # init 4th like Red
            if old_conv.bias is not None:  # type: ignore
                new_conv.bias.copy_(pretrained_conv.bias)  # type: ignore
        self.model.segformer.encoder.patch_embeddings[0].proj = new_conv

        # Step 5: Load other pretrained weights (except the mismatched ones)
        state_dict = pretrained_model.state_dict()
        for key in [
            "segformer.encoder.patch_embeddings.0.proj.weight",
            "decode_head.classifier.weight",
            "decode_head.classifier.bias",
        ]:
            state_dict.pop(key, None)
        self.model.load_state_dict(state_dict, strict=False)

    def forward(self, x):
        logits = self.model(pixel_values=x).logits
        logits = F.interpolate(
            logits, size=x.shape[-2:], mode="bilinear", align_corners=False
        )
        logits = logits.contiguous()
        return logits

dataset.py

import os
import torch
from torch.utils.data import Dataset
import xarray as xr
from .utils import compute_valid_patch_indices


class XarrayPatchDataset(Dataset):
    def __init__(self, patch_size: int, stride: int, zarr_cache_dir: str,indices=None):
        """
        Dataset for extracting patches from xarray images, using precomputed valid patch indices.

        Args:
            patch_size (int): Size of square patches.
            stride (int): Stride of sliding window.
            zarr_cache_dir (str): Directory to store the cached data in Zarr format.
        """
        self.patch_size = patch_size
        self.stride = stride
        self.zarr_cache_dir = zarr_cache_dir
        self.img_zarr_path = os.path.join(zarr_cache_dir, "img.zarr")
        self.mask_zarr_path = os.path.join(zarr_cache_dir, "mask.zarr")
        self.img = xr.open_zarr(self.img_zarr_path)
        self.img = next(iter(self.img.data_vars.values()))
        self.mask = xr.open_zarr(self.mask_zarr_path)
        self.mask = next(iter(self.mask.data_vars.values()))
        self.mask.data.compute_chunk_sizes()
        if indices is None:
            full_indices = compute_valid_patch_indices(self.mask, patch_size, stride, threshold=0.1)
        self.valid_indices = (
            indices if indices is not None else full_indices
        )

    def __len__(self):
        """Return the number of valid patches."""
        return len(self.valid_indices)

    def __getitem__(self, idx):
        """Get the patch at the given index."""
        # Get the top-left coordinates of the patch
        i, j = self.valid_indices[idx]

        # Use Dask arrays directly for patch extraction
        img_patch = self.img[:, i : i + self.patch_size, j : j + self.patch_size]
        mask_patch = self.mask[i : i + self.patch_size, j : j + self.patch_size]

        # Convert to Dask arrays (lazily loaded)
        img_patch = img_patch.data  # Dask array
        mask_patch = mask_patch.data  # Dask array

        # Compute and convert to PyTorch tensors
        x = torch.tensor(img_patch.compute(), dtype=torch.float32)
        y = torch.tensor(mask_patch.compute(), dtype=torch.long)

        # Scale and apply mask
        x = x / 10000.0
        x = x.clip(min=0.0, max=1.0)

        return x, y

train.py

import os
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader
import mlflow
from dagster import get_dagster_logger
import xarray as xr
from sklearn.model_selection import train_test_split
from torchmetrics import JaccardIndex
from torchmetrics.segmentation import DiceScore, MeanIoU
from dotenv import load_dotenv
from .loss import FocalLoss
from .model import UNet, UNetBaseLine, SegformerB0FourBand
from .dataset import XarrayPatchDataset
from .utils import compute_valid_patch_indices

load_dotenv("../../.env")

MLFLOW_SERVER_URL = os.getenv("MLFLOW_SERVER_URL")
MLFLOW_TRACKING_USERNAME = os.getenv("MLFLOW_TRACKING_USERNAME")
MLFLOW_TRACKING_PASSWORD = os.getenv("MLFLOW_TRACKING_PASSWORD")
S3_ACCESS_KEY=os.getenv("S3_ACCESS_KEY")
S3_SECRET_ACCESS_KEY=os.getenv("S3_SECRET_ACCESS_KEY")
S3_END_POINT=os.getenv("S3_END_POINT")
logger = get_dagster_logger()


def train_model(config):
    os.makedirs(os.path.dirname(config.model_path), exist_ok=True)
    if not config.mlflow_tracking_uri:
        mlflow_tracking_uri = MLFLOW_SERVER_URL
    else:
        mlflow_tracking_uri = config.mlflow_tracking_uri
    if config.model == "unet":
        model = UNet(
            in_channels=config.in_channels,
            out_classes=config.out_classes,
            dropout=0.2,
        )
    elif config.model == "segformer":
        model = SegformerB0FourBand(
            in_channels=config.in_channels,
            out_classes=config.out_classes,
        )
    else:
        model = UNetBaseLine(
            in_channels=config.in_channels, out_classes=config.out_classes
        )
    mask = xr.open_zarr(os.path.join(config.zarr_cache_dir, "mask.zarr"))
    mask = next(iter(mask.data_vars.values()))
    full_indices = compute_valid_patch_indices(
        mask,
        config.patch_size,
        config.stride,
        threshold=0.1
    )
    # Shuffle + split
    train_indices, val_indices = train_test_split(
        full_indices, test_size=0.2, random_state=42
    )
    train_dataset = XarrayPatchDataset(
        config.patch_size,
        config.stride,
        config.zarr_cache_dir,
        indices=train_indices
    )
    valid_dataset = XarrayPatchDataset(
        config.patch_size,
        config.stride,
        config.zarr_cache_dir,
        indices=val_indices
    )
    if config.loss == "focal":
        criterion = FocalLoss(alpha=None, gamma=2.0, ignore_index=0)
    else:
        criterion = torch.nn.CrossEntropyLoss(ignore_index=0)
    optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)
    jaccard = JaccardIndex(
        task="multiclass",
        num_classes=config.out_classes,
        average="weighted",
        zero_division=0.0,
        ignore_index=0,
    ).to(config.device)
    dice = DiceScore(
        num_classes=config.out_classes, input_format="index", zero_division=0.0
    ).to(config.device)
    iou = MeanIoU(num_classes=config.out_classes, input_format="index").to(
        config.device
    )
    model.to(config.device)
    model.train()
    train_loader = DataLoader(
        train_dataset,
        batch_size=config.batch_size,
        shuffle=True,
        num_workers=config.num_workers,
        prefetch_factor=2,
    )
    val_loader = DataLoader(
        valid_dataset,
        batch_size=config.batch_size,
        shuffle=False,
        num_workers=config.num_workers,
        prefetch_factor=2,
    )
    n_train = len(train_loader)
    n_val = len(val_loader)
    best_val_loss = float("inf")
    best_model_path = os.path.join(os.path.dirname(config.model_path), f"best_model.pth")
    os.environ["MLFLOW_TRACKING_USERNAME"] = MLFLOW_TRACKING_USERNAME
    os.environ["MLFLOW_TRACKING_PASSWORD"] = MLFLOW_TRACKING_PASSWORD
    os.environ["AWS_ACCESS_KEY_ID"] = S3_ACCESS_KEY
    os.environ["AWS_SECRET_ACCESS_KEY"] = S3_SECRET_ACCESS_KEY
    os.environ["MLFLOW_S3_ENDPOINT_URL"] = S3_END_POINT
    mlflow.set_tracking_uri(mlflow_tracking_uri)
    mlflow.set_experiment(config.mlflow_experiment_name)
    with mlflow.start_run():
        mlflow.log_params(
            {
                "model": config.model,
                "num_epochs": config.epochs,
                "learning_rate": config.learning_rate,
                "loss": config.loss,
                "batch_size": config.batch_size,
                "patch_size": config.patch_size,
                "stride": config.stride,
                "num_workers": config.num_workers,
                "in_channels": config.in_channels,
                "out_classes": config.out_classes,
                "zarr_cache_dir": config.zarr_cache_dir,
                "model_path": config.model_path,
                "best_model_path": best_model_path,
            }
        )
        for epoch in range(config.epochs):  # pylint:disable=W0612
            progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}", leave=False)
            total_train_loss = 0
            total_train_dice = 0
            total_train_jaccard = 0
            total_train_iou = 0
            total_val_loss = 0
            total_val_dice = 0
            total_val_jaccard = 0
            total_val_iou = 0
            model.train()
            for _, (X_batch, y_batch) in enumerate(progress_bar):
                X_batch = X_batch.to(config.device)
                y_batch = y_batch.to(config.device)
                out = model(X_batch)
                loss = criterion(out, y_batch)
                preds = torch.argmax(out, dim=1)
                dice_coef = dice(preds, y_batch)
                jaccard_index = jaccard(preds, y_batch)
                iou_index = iou(preds, y_batch)
                optimizer.zero_grad()
                # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                loss.backward()
                optimizer.step()
                total_train_loss += loss.item()
                total_train_dice += dice_coef.item()
                total_train_jaccard += jaccard_index.item()
                total_train_iou += iou_index.item()

            mlflow.log_metric("loss", total_train_loss / n_train, step=epoch)
            mlflow.log_metric("dice", total_train_dice / n_train, step=epoch)
            mlflow.log_metric("jaccard", total_train_jaccard / n_train, step=epoch)
            mlflow.log_metric("iou", total_train_iou / n_train, step=epoch)
            model.eval()
            with torch.no_grad():
                for X_batch, y_batch in val_loader:
                    X_batch = X_batch.to(config.device)
                    y_batch = y_batch.to(config.device)
                    out = model(X_batch)
                    loss = criterion(out, y_batch)
                    preds = torch.argmax(out, dim=1)
                    total_val_loss += loss.item()
                    total_val_dice += dice(preds, y_batch).item()
                    total_val_jaccard += jaccard(preds, y_batch).item()
                    total_val_iou += iou(preds, y_batch).item()
            avg_val_loss = total_val_loss / n_val
            mlflow.log_metric("val_loss", avg_val_loss, step=epoch)
            mlflow.log_metric("val_dice", total_val_dice / n_val, step=epoch)
            mlflow.log_metric("val_jaccard", total_val_jaccard / n_val, step=epoch)
            mlflow.log_metric("val_iou", total_val_iou / n_val, step=epoch)
            if avg_val_loss < best_val_loss:
                best_val_loss = avg_val_loss
                torch.save(model.state_dict(),best_model_path)
                logger.info(f"Saved new best model at epoch {epoch+1} with val_loss={avg_val_loss:.4f}")
        torch.save(model.state_dict(), config.model_path)
        input_example = X_batch[0].unsqueeze(0).cpu().numpy()
        mlflow.pytorch.log_model(model, artifact_path="last_model",input_example=input_example)
        model.load_state_dict(torch.load(best_model_path))
        mlflow.pytorch.log_model(model, artifact_path="best_model",input_example=input_example)

utils.py

import logging
import xarray as xr
import numpy as np
from pyproj import Transformer
import dask.array as da
from numpy.lib.stride_tricks import sliding_window_view


def transform_bbox(bbox_wgs84, dest_crs):
    transformer = Transformer.from_crs("epsg:4326", dest_crs, always_xy=True)
    minx, miny = transformer.transform(bbox_wgs84[0], bbox_wgs84[1])
    maxx, maxy = transformer.transform(bbox_wgs84[2], bbox_wgs84[3])
    bbox_utm = [minx, miny, maxx, maxy]
    return bbox_utm


def map_labels(labels: xr.DataArray, mapping: dict[int, int]) -> xr.DataArray:
    data = labels.values
    mapped = np.full_like(data, fill_value=0, dtype=np.int32)
    for original, new in mapping.items():
        mapped[data == original] = new
    return xr.DataArray(mapped, coords=labels.coords, dims=labels.dims)


def clean_attrs(da_or_ds):
    # Only keep simple, JSON-serializable entries in attrs
    da_or_ds.attrs = {
        k: v
        for k, v in da_or_ds.attrs.items()
        if isinstance(v, (str, int, float, list, dict, bool, type(None)))
    }
    return da_or_ds


def clean_coords(  # pylint:disable=W0102
    ds: xr.DataArray, keep: list[str] = []
) -> xr.DataArray:
    """
    Remove non-dimension coordinates and attributes from an xarray Dataset,
    except those explicitly listed in `keep`.

    Parameters:
    ----------
    ds : xr.Dataset
        The dataset to clean.
    keep : list of str, optional
        List of coordinate names to keep, even if they are non-dimension coords.

    Returns:
    -------
    xr.Dataset
        Cleaned dataset with only dimension coordinates and selected metadata.
    """
    # Get all dimension coordinates
    dim_coords = list(ds.dims)

    # Identify coordinates to drop (non-dim, not explicitly kept)
    drop_coords = [c for c in ds.coords if c not in dim_coords and c not in keep]

    # Drop extra coordinates
    ds = ds.drop_vars(drop_coords)

    return ds


def compute_valid_patch_indices(
    mask: xr.DataArray, patch_size: int, stride: int, threshold: float = 0.1
):
    """
    Compute top-left coordinates of valid patches based on non-NaN ratio in each patch.

    Args:
        mask (xr.DataArray): 2D xarray DataArray (e.g. land cover labels or cloud mask).
        patch_size (int): Size of square patches.
        stride (int): Stride of sliding window.
        threshold (float): Minimum fraction of valid (non-NaN) pixels in a patch.

    Returns:
        List of (i, j) tuples (top-left corners of valid patches).
    """
    # Convert to dask array and binarize: 1 = valid, 0 = NaN
    binary_mask = (~da.isnan(mask.data)).astype("uint8")  # type: ignore

    # Convert to NumPy (only metadata) to get shape info
    _, _ = binary_mask.shape
    logging.info("Shape of binary mask: %s", binary_mask.shape)
    # Sliding windows over 2D: shape becomes (num_patches_y, num_patches_x, patch_size, patch_size)
    sw = sliding_window_view(binary_mask, (patch_size, patch_size))[::stride, ::stride]

    # Compute mean valid fraction per patch (still lazy)
    patch_valid_fraction = sw.mean(axis=(-1, -2))

    # Evaluate only the valid locations
    patch_valid_fraction = patch_valid_fraction.compute()

    # Get coordinates where patch validity exceeds threshold
    valid_coords = np.argwhere(patch_valid_fraction >= threshold)
    # Map patch indices back to full image pixel coords
    indices = [(i * stride, j * stride) for i, j in valid_coords]

    return indices

jobs.py

from dagster import job
from .ops import fetch_s2_stack, fetch_worldcover_stack, save_to_zarr, train_unet


@job
def s2_worldcover_landcover_classification_pipeline():
    features, proj, mask = fetch_s2_stack()  # pylint:disable=E1120
    labels = fetch_worldcover_stack(proj, mask)  # pylint:disable=E1120
    zarr_dir = save_to_zarr(features, labels)  # pylint:disable=E1120
    train_unet(zarr_dir)  # pylint:disable=E1120

definitions.py

from dagster import Definitions, load_assets_from_modules

from esa_worldcover_classification import assets  # noqa: TID252
from esa_worldcover_classification.jobs import (
    s2_worldcover_landcover_classification_pipeline,
)

all_assets = load_assets_from_modules([assets])

defs = Definitions(
    assets=all_assets,
    jobs=[s2_worldcover_landcover_classification_pipeline],  # pylint:disable=E1120
)

Day4 - ML Pipelines

Contents

Day4 - ML Pipelines#

Pipeline#

Orchestrator#

Comparison Between Popular ML Pipelines Orchestrators#

Dagster#

Core Concepts#

Getting Started#

Mlflow#

Dagster + Xarray + Dask to Train ERA5 Forecasting Model#

Instructions#

Dagster + Stackstac + Dask + MLflow to Train Sentinel-2 Land Cover Classification Model#

Instructions#