Using Zarr with Xarray for Scalable Scientific Data#

1. What is Zarr and Why Use It?#

Zarr is a format for storing N-dimensional arrays in a chunked, compressed, and lazily-loadable manner. Unlike NetCDF:

  • Zarr stores data as directory structures with many small .npy-like chunks.

  • Designed for parallel and cloud-friendly access.

  • Works seamlessly with Xarray + Dask for large out-of-core datasets.

Use Zarr if:

  • You’re working with > memory datasets.

  • You want to scale across HPC or cloud storage (e.g., S3).

  • You need faster writes and flexible chunking.


2. Creating a Zarr Store from NetCDF#

import xarray as xr

# Open NetCDF file
ds = xr.open_dataset("example.nc", chunks={"time": 100})  # use dask chunks!

# Save to Zarr
ds.to_zarr("example.zarr", mode="w")

To use compression:

import numcodecs

compressor = numcodecs.Blosc(cname="zstd", clevel=5, shuffle=numcodecs.Blosc.SHUFFLE)
encoding = {var: {"compressor": compressor} for var in ds.data_vars}

ds.to_zarr("compressed.zarr", encoding=encoding, mode="w")

3. Opening Zarr Stores with open_zarr#

ds = xr.open_zarr("example.zarr", chunks={})  # Auto chunk
print(ds)

You can control chunks:

ds = xr.open_zarr("example.zarr", chunks={"time": 500, "lat": 100, "lon": 100})

open_zarr() is lazy and uses Dask under the hood for parallelism.


4. Writing Data to Zarr in Pieces (Streaming / Sharded)#

To write data over time slices:

ds_1980 = xr.open_dataset("data_1980.nc", chunks={"time": 10})
ds_1980.to_zarr("multi_year.zarr", append_dim="time", mode="w")

ds_1981 = xr.open_dataset("data_1981.nc", chunks={"time": 10})
ds_1981.to_zarr("multi_year.zarr", append_dim="time", mode="a")  # append

This avoids loading full datasets into memory.


5. Opening and Combining Multiple Zarr Stores#

If you’ve partitioned your dataset into multiple Zarrs (e.g., by year):

import xarray as xr
from glob import glob

paths = sorted(glob("data/*.zarr"))

datasets = [xr.open_zarr(p, chunks={"time": 100}) for p in paths]
combined = xr.combine_by_coords(datasets)

For consistent time ordering, sort paths and check for duplicate coordinates.


6. Performance Tips (HPC / Cloud)#

  • Always chunk your data! Use ds.chunk() and match to your access pattern.

  • Use numcodecs.Blosc for fast compression; Zstd is a good default.

  • Avoid too many small variables/files.

  • If running on cloud (e.g., S3), use fsspec and gcsfs, s3fs, etc.

fs_map = fsspec.get_mapper("s3://my-bucket/data.zarr", anon=False)
ds = xr.open_zarr(fs_map, consolidated=True)

7. Advanced Tools: Kerchunk + Zarr v3#

Kerchunk: Virtual Zarr index from NetCDF/HDF5 files — great for cloud or read-only stores:

import fsspec
import xarray as xr

reference_path = "reference.json"  # created by kerchunk
fs = fsspec.filesystem("reference", fo=reference_path)
mapper = fs.get_mapper("")

ds = xr.open_dataset(mapper, engine="zarr", backend_kwargs={"consolidated": False})

Zarr v3 (experimental): Enables sharding and better cloud compatibility. You can enable it in recent versions of zarr-python and xarray.


Summary: Zarr with Xarray#

Task

Method

Open Zarr

xr.open_zarr()

Save Zarr

ds.to_zarr()

Compress

Use numcodecs in encoding

Combine Zarrs

xr.combine_by_coords()

Use with Cloud

Use fsspec.get_mapper()

Virtual Zarr from NetCDF

Use kerchunk

Parallel I/O

Use Dask with chunks