Best Practices for Storing NetCDF and Zarr Datasets#
Managing large scientific datasets efficiently is essential in climate research and model simulation workflows. Two common formats for multidimensional array data—NetCDF and Zarr—each have their own strengths, but both require careful storage strategies to maximize performance, reduce costs, and preserve system stability.
This post explores:
How to store NetCDF and Zarr datasets effectively
Why you should avoid storing thousands of tiny NetCDF files
How to compress them
How to consolidate data to save inode space and improve scalability
1. Understanding NetCDF vs. Zarr#
Feature |
NetCDF |
Zarr |
---|---|---|
File format |
Binary (HDF5-based for NetCDF4) |
Directory-based or ZIP-like |
Read/write support |
Widely supported in C, Fortran, Python |
Native to Python; growing support in other languages |
Parallel I/O |
NetCDF4: limited MPI-based I/O |
Zarr: supports cloud-native parallel access |
Chunking |
Yes |
Yes |
Compression |
Yes |
Yes (via codecs like Blosc, Zstd) |
Storage layout |
Monolithic file |
Nested directories |
2. Why You Should Avoid Storing Thousands of Tiny NetCDF Files#
A very common anti-pattern in scientific workflows is generating one NetCDF file per timestep—e.g., hourly outputs for multiple years. While this may seem modular and easy to manage, it leads to serious problems:
Problems with Many Small Files:#
Inode exhaustion: HPCs and large file systems have a finite number of inodes (metadata structures for files). Creating millions of files exhausts them quickly.
Slow file system traversal: Accessing directories with many files is slow for users and programs.
NetCDF metadata overhead: Each file has a header, and the cumulative cost of repeatedly reading metadata adds up.
Harder to parallelize: Reading many small files creates an I/O bottleneck when parallel workers fight for disk access.
Better Practice:#
Instead, concatenate individual timesteps into a single file per time period (e.g., one NetCDF file per year or variable), or convert your dataset into a chunked Zarr store for optimized parallel access.
3. Concatenating NetCDF Files#
To merge multiple NetCDF files along the time
dimension:
Using Xarray (Python):#
import xarray as xr
ds = xr.open_mfdataset("file_*.nc", combine='by_coords')
ds.to_netcdf("merged.nc")
This drastically reduces the number of files, making the dataset easier to manage and more efficient to access.
Using NCO (NetCDF Operators):#
ncrcat file_*.nc merged.nc
Or for chronological order:
ncrcat -h -O -o merged.nc file_1980.nc file_1981.nc file_1982.nc
4. Compressing NetCDF and Zarr Data#
Compression is essential to reduce disk usage and I/O time. Both NetCDF4 and Zarr support chunked compression, which stores data in compressed blocks (chunks).
NetCDF4 Compression via Xarray:#
comp = dict(zlib=True, complevel=4)
encoding = {var: comp for var in ds.data_vars}
ds.to_netcdf("compressed.nc", encoding=encoding)
Zarr Compression with Blosc (Recommended for Zarr):#
import numcodecs
compressor = numcodecs.Blosc(cname='zstd', clevel=5, shuffle=numcodecs.Blosc.SHUFFLE)
encoding = {var: {"compressor": compressor} for var in ds.data_vars}
ds.chunk({'time': 100}).to_zarr("output.zarr", encoding=encoding)
where clevel/complevel
can be 1 - 9, 1 being low-compression, 9 being high-compression. The higher the compression the longer it takes to save the data but the lower the storage required.
Tips:#
Use Zstandard (zstd) for best speed/compression balance.
Avoid compressing coordinates—they’re small and often accessed.
Chunk wisely: too many small chunks = overhead; too few = memory pressure.
5. Zarr-Specific Storage Considerations#
Zarr stores each chunk as a separate file. This makes it perfect for parallel cloud or object storage access but can still run into the many-small-file problem on traditional file systems.
Mitigations:#
Combine chunks along multiple dimensions to reduce file count (e.g., time × lat)
Use sharded Zarr V3 (groups multiple chunks into shard files), this is in beta
Write to Tar archives if storing on traditional file systems (e.g.,
zarr.ZipStore
orTarStore
)
ds.chunk({'time': 500, 'lat': 100, 'lon': 100}) # reduce file count
6. Final Tips and Summary#
Recommendation |
Format |
Tool / Method |
---|---|---|
Avoid small files |
NetCDF |
Use |
Compress chunks |
Both |
Use |
Store locally |
NetCDF |
Prefer single files per year |
Use shards |
Zarr v3 |
Coming soon to Zarrv3/Xarray |