Best Practices for Storing NetCDF and Zarr Datasets#

Managing large scientific datasets efficiently is essential in climate research and model simulation workflows. Two common formats for multidimensional array data—NetCDF and Zarr—each have their own strengths, but both require careful storage strategies to maximize performance, reduce costs, and preserve system stability.

This post explores:

  • How to store NetCDF and Zarr datasets effectively

  • Why you should avoid storing thousands of tiny NetCDF files

  • How to compress them

  • How to consolidate data to save inode space and improve scalability


1. Understanding NetCDF vs. Zarr#

Feature

NetCDF

Zarr

File format

Binary (HDF5-based for NetCDF4)

Directory-based or ZIP-like

Read/write support

Widely supported in C, Fortran, Python

Native to Python; growing support in other languages

Parallel I/O

NetCDF4: limited MPI-based I/O

Zarr: supports cloud-native parallel access

Chunking

Yes

Yes

Compression

Yes

Yes (via codecs like Blosc, Zstd)

Storage layout

Monolithic file

Nested directories


2. Why You Should Avoid Storing Thousands of Tiny NetCDF Files#

A very common anti-pattern in scientific workflows is generating one NetCDF file per timestep—e.g., hourly outputs for multiple years. While this may seem modular and easy to manage, it leads to serious problems:

Problems with Many Small Files:#

  • Inode exhaustion: HPCs and large file systems have a finite number of inodes (metadata structures for files). Creating millions of files exhausts them quickly.

  • Slow file system traversal: Accessing directories with many files is slow for users and programs.

  • NetCDF metadata overhead: Each file has a header, and the cumulative cost of repeatedly reading metadata adds up.

  • Harder to parallelize: Reading many small files creates an I/O bottleneck when parallel workers fight for disk access.

Better Practice:#

Instead, concatenate individual timesteps into a single file per time period (e.g., one NetCDF file per year or variable), or convert your dataset into a chunked Zarr store for optimized parallel access.


3. Concatenating NetCDF Files#

To merge multiple NetCDF files along the time dimension:

Using Xarray (Python):#

import xarray as xr

ds = xr.open_mfdataset("file_*.nc", combine='by_coords')
ds.to_netcdf("merged.nc")

This drastically reduces the number of files, making the dataset easier to manage and more efficient to access.

Using NCO (NetCDF Operators):#

ncrcat file_*.nc merged.nc

Or for chronological order:

ncrcat -h -O -o merged.nc file_1980.nc file_1981.nc file_1982.nc

4. Compressing NetCDF and Zarr Data#

Compression is essential to reduce disk usage and I/O time. Both NetCDF4 and Zarr support chunked compression, which stores data in compressed blocks (chunks).

NetCDF4 Compression via Xarray:#

comp = dict(zlib=True, complevel=4)
encoding = {var: comp for var in ds.data_vars}

ds.to_netcdf("compressed.nc", encoding=encoding)

Tips:#

  • Use Zstandard (zstd) for best speed/compression balance.

  • Avoid compressing coordinates—they’re small and often accessed.

  • Chunk wisely: too many small chunks = overhead; too few = memory pressure.


5. Zarr-Specific Storage Considerations#

Zarr stores each chunk as a separate file. This makes it perfect for parallel cloud or object storage access but can still run into the many-small-file problem on traditional file systems.

Mitigations:#

  • Combine chunks along multiple dimensions to reduce file count (e.g., time × lat)

  • Use sharded Zarr V3 (groups multiple chunks into shard files), this is in beta

  • Write to Tar archives if storing on traditional file systems (e.g., zarr.ZipStore or TarStore)

ds.chunk({'time': 500, 'lat': 100, 'lon': 100})  # reduce file count

6. Final Tips and Summary#

Recommendation

Format

Tool / Method

Avoid small files

NetCDF

Use ncrcat / xarray.open_mfdataset()

Compress chunks

Both

Use zlib / Blosc + zstd

Store locally

NetCDF

Prefer single files per year

Use shards

Zarr v3

Coming soon to Zarrv3/Xarray