Create Cloud Native Geospatial Formats#

This notebook provides step-by-step instructions for converting geospatial data into cloud-native formats, which are optimized for efficient storage, access, and processing in cloud environments. The notebook covers the following workflows:

  1. Convert JP2000 to Cloud Optimized GeoTIFF (COG): Demonstrates how to use the GDAL library to transform raster data from the JP2000 format into the COG format, enabling better performance for cloud-based applications.

  2. Convert SHP to GeoParquet: Explains how to use GeoPandas to convert vector data from the SHP format to the GeoParquet format, which supports efficient querying and storage.

  3. Convert NetCDF to Zarr: Shows how to use the xarray library to convert NetCDF files into the Zarr format, a cloud-friendly format designed for scalable and efficient data storage.

Convert a raster JP2000 to Cloud Optimized GeoTiff (COG) using GDAL#

import glob
from osgeo import gdal

list_input_jp2 = glob.glob("../data/*.jp2")
print (list_input_jp2)
for input_jp2 in list_input_jp2:
    print (input_jp2)
    output_cog = input_jp2.replace('.jp2', '_cog.tif')
    print (output_cog)
    options = gdal.TranslateOptions(
        format='COG',
        creationOptions=[
            'COMPRESS=LZW',        # Compression
            'BLOCKSIZE=512',       # Block size for better cloud access
            'OVERVIEWS=IGNORE_EXISTING'  # Force creation of overviews
        ]
    )
    gdal.Translate(destName=output_cog, srcDS=input_jp2, options=options)
['../data/T34SGH_20240608T090601_TCI_10m.jp2', '../data/T34SFG_20240621T092031_TCI_10m.jp2', '../data/T34SEG_20240601T092031_TCI_10m.jp2', '../data/T34SFF_20220209T091131_TCI_10m.jp2']
../data/T34SGH_20240608T090601_TCI_10m.jp2
../data/T34SGH_20240608T090601_TCI_10m_cog.tif
/Users/syam/virtualenvs/myvenv/lib/python3.13/site-packages/osgeo/gdal.py:330: FutureWarning: Neither gdal.UseExceptions() nor gdal.DontUseExceptions() has been explicitly called. In GDAL 4.0, exceptions will be enabled by default.
  warnings.warn(
../data/T34SFG_20240621T092031_TCI_10m.jp2
../data/T34SFG_20240621T092031_TCI_10m_cog.tif
../data/T34SEG_20240601T092031_TCI_10m.jp2
../data/T34SEG_20240601T092031_TCI_10m_cog.tif
../data/T34SFF_20220209T091131_TCI_10m.jp2
../data/T34SFF_20220209T091131_TCI_10m_cog.tif

Convert a SHP file to GeoParquet using Geopandas (copy-paste code in a .py file)#

# import geopandas as gpd


# input_shp = "../data/sentinel-2-tiles-greece.shp"
# output_geoparquet = "../data/test.parquet"

# gdf = gpd.read_file(input_shp)
# gdf.to_parquet(output_geoparquet, engine="pyarrow")

# geoparquet = gpd.read_parquet(output_geoparquet)
# print(geoparquet.head())
# print(geoparquet.columns)
# print(geoparquet.dtypes)
# print(geoparquet.crs)
# print(geoparquet.geometry.head())
# print(geoparquet.geometry.dtypes)
# print(geoparquet.geometry.iloc[0].wkt)
# print(geoparquet.geometry.iloc[0].type)
# print(geoparquet.geometry.iloc[0].bounds)
# print(geoparquet.geometry.iloc[0].area)

Convert NetCDF to Zarr using xarray#

import xarray as xr

netcdf_file = "../data/era5.nc"
ds = xr.open_dataset(netcdf_file)

zarr_file = "../data/era5.zarr"
ds.to_zarr(zarr_file, mode='w')

print(f"Conversion complete: {zarr_file}")

ds_zarr = xr.open_zarr("../data/era5.zarr")

# Check metadata and structure
print("dataset",ds_zarr)
print("dimenstions",ds_zarr.dims)
print("variables",ds_zarr.variables)
/Users/syam/virtualenvs/myvenv/lib/python3.13/site-packages/zarr/codecs/vlen_utf8.py:44: UserWarning: The codec `vlen-utf8` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  return cls(**configuration_parsed)
/Users/syam/virtualenvs/myvenv/lib/python3.13/site-packages/zarr/core/array.py:3989: UserWarning: The dtype `<U4` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  meta = AsyncArray._create_metadata_v3(
/Users/syam/virtualenvs/myvenv/lib/python3.13/site-packages/zarr/codecs/vlen_utf8.py:44: UserWarning: The codec `vlen-utf8` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  return cls(**configuration_parsed)
/Users/syam/virtualenvs/myvenv/lib/python3.13/site-packages/zarr/api/asynchronous.py:203: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(
Conversion complete: ../data/era5.zarr
dataset <xarray.Dataset> Size: 4MB
Dimensions:         (pressure_level: 1, valid_time: 1, latitude: 721,
                     longitude: 1440)
Coordinates:
  * pressure_level  (pressure_level) float64 8B 1e+03
    number          int64 8B ...
  * latitude        (latitude) float64 6kB 90.0 89.75 89.5 ... -89.75 -90.0
    expver          object 8B ...
  * longitude       (longitude) float64 12kB -180.0 -179.8 ... 179.5 179.8
  * valid_time      (valid_time) datetime64[ns] 8B 2023-02-01T13:00:00
Data variables:
    z               (valid_time, pressure_level, latitude, longitude) float32 4MB dask.array<chunksize=(1, 1, 181, 720), meta=np.ndarray>
Attributes:
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2025-04-11T08:09 GRIB to CDM+CF via cfgrib-0.9.1...
dimenstions FrozenMappingWarningOnValuesAccess({'pressure_level': 1, 'valid_time': 1, 'latitude': 721, 'longitude': 1440})
variables Frozen({'pressure_level': <xarray.IndexVariable 'pressure_level' (pressure_level: 1)> Size: 8B
array([1000.])
Attributes:
    long_name:         pressure
    units:             hPa
    positive:          down
    stored_direction:  decreasing
    standard_name:     air_pressure, 'number': <xarray.Variable ()> Size: 8B
[1 values with dtype=int64]
Attributes:
    long_name:      ensemble member numerical id
    units:          1
    standard_name:  realization, 'z': <xarray.Variable (valid_time: 1, pressure_level: 1, latitude: 721,
                  longitude: 1440)> Size: 4MB
dask.array<open_dataset-z, shape=(1, 1, 721, 1440), dtype=float32, chunksize=(1, 1, 181, 720), chunktype=numpy.ndarray>
Attributes: (12/31)
    GRIB_paramId:                             129
    GRIB_dataType:                            an
    GRIB_numberOfPoints:                      1038240
    GRIB_typeOfLevel:                         isobaricInhPa
    GRIB_stepUnits:                           1
    GRIB_stepType:                            instant
    ...                                       ...
    GRIB_shortName:                           z
    GRIB_totalNumber:                         0
    GRIB_units:                               m**2 s**-2
    long_name:                                Geopotential
    units:                                    m**2 s**-2
    standard_name:                            geopotential, 'latitude': <xarray.IndexVariable 'latitude' (latitude: 721)> Size: 6kB
array([ 90.  ,  89.75,  89.5 , ..., -89.5 , -89.75, -90.  ], shape=(721,))
Attributes:
    units:             degrees_north
    standard_name:     latitude
    long_name:         latitude
    stored_direction:  decreasing, 'expver': <xarray.Variable ()> Size: 8B
[1 values with dtype=object], 'longitude': <xarray.IndexVariable 'longitude' (longitude: 1440)> Size: 12kB
array([-180.  , -179.75, -179.5 , ...,  179.25,  179.5 ,  179.75],
      shape=(1440,))
Attributes:
    units:          degrees_east
    standard_name:  longitude
    long_name:      longitude, 'valid_time': <xarray.IndexVariable 'valid_time' (valid_time: 1)> Size: 8B
array(['2023-02-01T13:00:00.000000000'], dtype='datetime64[ns]')
Attributes:
    long_name:      time
    standard_name:  time})
/Users/syam/virtualenvs/myvenv/lib/python3.13/site-packages/zarr/codecs/vlen_utf8.py:44: UserWarning: The codec `vlen-utf8` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  return cls(**configuration_parsed)