Best Practices for Storing NetCDF and Zarr Datasets

Best Practices for Storing NetCDF and Zarr Datasets#

Managing large scientific datasets efficiently is essential in climate research and model simulation workflows. Two common formats for multidimensional array data—NetCDF and Zarr—each have their own strengths, but both require careful storage strategies to maximize performance, reduce costs, and preserve system stability.

This post explores:

How to store NetCDF and Zarr datasets effectively
Why you should avoid storing thousands of tiny NetCDF files
How to compress them
How to consolidate data to save inode space and improve scalability

1. Understanding NetCDF vs. Zarr#

Feature	NetCDF	Zarr
File format	Binary (HDF5-based for NetCDF4)	Directory-based or ZIP-like
Read/write support	Widely supported in C, Fortran, Python	Native to Python; growing support in other languages
Parallel I/O	NetCDF4: limited MPI-based I/O	Zarr: supports cloud-native parallel access
Chunking	Yes	Yes
Compression	Yes	Yes (via codecs like Blosc, Zstd)
Storage layout	Monolithic file	Nested directories

2. Why You Should Avoid Storing Thousands of Tiny NetCDF Files#

A very common anti-pattern in scientific workflows is generating one NetCDF file per timestep—e.g., hourly outputs for multiple years. While this may seem modular and easy to manage, it leads to serious problems:

Problems with Many Small Files:#

Inode exhaustion: HPCs and large file systems have a finite number of inodes (metadata structures for files). Creating millions of files exhausts them quickly.
Slow file system traversal: Accessing directories with many files is slow for users and programs.
NetCDF metadata overhead: Each file has a header, and the cumulative cost of repeatedly reading metadata adds up.
Harder to parallelize: Reading many small files creates an I/O bottleneck when parallel workers fight for disk access.

Better Practice:#

Instead, concatenate individual timesteps into a single file per time period (e.g., one NetCDF file per year or variable), or convert your dataset into a chunked Zarr store for optimized parallel access.

3. Concatenating NetCDF Files#

To merge multiple NetCDF files along the time dimension:

Using Xarray (Python):#

import xarray as xr

ds = xr.open_mfdataset("file_*.nc", combine='by_coords')
ds.to_netcdf("merged.nc")

This drastically reduces the number of files, making the dataset easier to manage and more efficient to access.

Using NCO (NetCDF Operators):#

ncrcat file_*.nc merged.nc

Or for chronological order:

ncrcat -h -O -o merged.nc file_1980.nc file_1981.nc file_1982.nc

4. Compressing NetCDF and Zarr Data#

Compression is essential to reduce disk usage and I/O time. Both NetCDF4 and Zarr support chunked compression, which stores data in compressed blocks (chunks).

NetCDF4 Compression via Xarray:#

comp = dict(zlib=True, complevel=4)
encoding = {var: comp for var in ds.data_vars}

ds.to_netcdf("compressed.nc", encoding=encoding)

Zarr Compression with Blosc (Recommended for Zarr):#

import numcodecs
compressor = numcodecs.Blosc(cname='zstd', clevel=5, shuffle=numcodecs.Blosc.SHUFFLE)

encoding = {var: {"compressor": compressor} for var in ds.data_vars}
ds.chunk({'time': 100}).to_zarr("output.zarr", encoding=encoding)

where clevel/complevel can be 1 - 9, 1 being low-compression, 9 being high-compression. The higher the compression the longer it takes to save the data but the lower the storage required.

Tips:#

Use Zstandard (zstd) for best speed/compression balance.
Avoid compressing coordinates—they’re small and often accessed.
Chunk wisely: too many small chunks = overhead; too few = memory pressure.

5. Zarr-Specific Storage Considerations#

Zarr stores each chunk as a separate file. This makes it perfect for parallel cloud or object storage access but can still run into the many-small-file problem on traditional file systems.

Mitigations:#

Combine chunks along multiple dimensions to reduce file count (e.g., time × lat)
Use sharded Zarr V3 (groups multiple chunks into shard files), this is in beta
Write to Tar archives if storing on traditional file systems (e.g., zarr.ZipStore or TarStore)

ds.chunk({'time': 500, 'lat': 100, 'lon': 100})  # reduce file count

6. Final Tips and Summary#

Recommendation	Format	Tool / Method
Avoid small files	NetCDF	Use `ncrcat` / `xarray.open_mfdataset()`
Compress chunks	Both	Use `zlib` / `Blosc + zstd`
Store locally	NetCDF	Prefer single files per year
Use shards	Zarr v3	Coming soon to Zarrv3/Xarray