Using Zarr with Xarray for Scalable Scientific Data#
1. What is Zarr and Why Use It?#
Zarr is a format for storing N-dimensional arrays in a chunked, compressed, and lazily-loadable manner. Unlike NetCDF:
Zarr stores data as directory structures with many small
.npy
-like chunks.Designed for parallel and cloud-friendly access.
Works seamlessly with Xarray + Dask for large out-of-core datasets.
Use Zarr if:
You’re working with > memory datasets.
You want to scale across HPC or cloud storage (e.g., S3).
You need faster writes and flexible chunking.
2. Creating a Zarr Store from NetCDF#
import xarray as xr
# Open NetCDF file
ds = xr.open_dataset("example.nc", chunks={"time": 100}) # use dask chunks!
# Save to Zarr
ds.to_zarr("example.zarr", mode="w")
To use compression:
import numcodecs
compressor = numcodecs.Blosc(cname="zstd", clevel=5, shuffle=numcodecs.Blosc.SHUFFLE)
encoding = {var: {"compressor": compressor} for var in ds.data_vars}
ds.to_zarr("compressed.zarr", encoding=encoding, mode="w")
3. Opening Zarr Stores with open_zarr
#
ds = xr.open_zarr("example.zarr", chunks={}) # Auto chunk
print(ds)
You can control chunks:
ds = xr.open_zarr("example.zarr", chunks={"time": 500, "lat": 100, "lon": 100})
open_zarr()
is lazy and uses Dask under the hood for parallelism.
4. Writing Data to Zarr in Pieces (Streaming / Sharded)#
To write data over time slices:
ds_1980 = xr.open_dataset("data_1980.nc", chunks={"time": 10})
ds_1980.to_zarr("multi_year.zarr", append_dim="time", mode="w")
ds_1981 = xr.open_dataset("data_1981.nc", chunks={"time": 10})
ds_1981.to_zarr("multi_year.zarr", append_dim="time", mode="a") # append
This avoids loading full datasets into memory.
5. Opening and Combining Multiple Zarr Stores#
If you’ve partitioned your dataset into multiple Zarrs (e.g., by year):
import xarray as xr
from glob import glob
paths = sorted(glob("data/*.zarr"))
datasets = [xr.open_zarr(p, chunks={"time": 100}) for p in paths]
combined = xr.combine_by_coords(datasets)
For consistent time ordering, sort paths and check for duplicate coordinates.
6. Performance Tips (HPC / Cloud)#
Always chunk your data! Use
ds.chunk()
and match to your access pattern.Use
numcodecs.Blosc
for fast compression; Zstd is a good default.Avoid too many small variables/files.
If running on cloud (e.g., S3), use
fsspec
andgcsfs
,s3fs
, etc.
fs_map = fsspec.get_mapper("s3://my-bucket/data.zarr", anon=False)
ds = xr.open_zarr(fs_map, consolidated=True)
7. Advanced Tools: Kerchunk + Zarr v3#
Kerchunk: Virtual Zarr index from NetCDF/HDF5 files — great for cloud or read-only stores:
import fsspec
import xarray as xr
reference_path = "reference.json" # created by kerchunk
fs = fsspec.filesystem("reference", fo=reference_path)
mapper = fs.get_mapper("")
ds = xr.open_dataset(mapper, engine="zarr", backend_kwargs={"consolidated": False})
Zarr v3 (experimental): Enables sharding and better cloud compatibility. You can enable it in recent versions of zarr-python
and xarray
.
Summary: Zarr with Xarray#
Task |
Method |
---|---|
Open Zarr |
|
Save Zarr |
|
Compress |
Use |
Combine Zarrs |
|
Use with Cloud |
Use |
Virtual Zarr from NetCDF |
Use |
Parallel I/O |
Use Dask with chunks |