Zarr is an open standard for storing large multidimensional array data. It specifies a protocol and data format, and is designed to be "cloud ready" including random access, by dividing data into subsets referred to as chunks.[1][2] Zarr can be used within many programming languages, including Python, Java, JavaScript, C++, Rust and Julia.[3] It has been used by organizations such as Google and Microsoft to publish large datasets.[4][5] Early versions of Zarr were first released in 2015 by Alistair Miles.[6][7]

Zarr is designed to support high-throughput distributed I/O on different storage systems, which is a common requirement in cloud computing. Multiple read operations can efficiently occur to a Zarr array in parallel, or multiple write operations in parallel.[8]

Format description

The main data format in Zarr is multidimensional arrays. For parallelisable access, these arrays are stored and accessed as a grid of so-called "chunks". The actual data format on disk depends on the compressor and storage plugins selected by the user.[8]

An illustration of Zarr's chunking data format.

Zarr's design was influenced by that of HDF5, and so it includes similar features for metadata and grouping: arrays can be grouped into named hierarchies, and they can also be annotated with key-value metadata stored alongside the array.[8]

Applications

For bioimaging such as microscopy, a consortium called the Open Microscopy Environment (OME) created a format called "OME-Zarr", based on Zarr with some discipline-specific extensions.[9] Similarly, Zarr is being used to publish weather and satellite data [10] and energy data,[11] among others.

See also

References

  1. ^ "Zarr - chunked, compressed, N-dimensional arrays". zarr.dev. Retrieved 2024-09-12.
  2. ^ "Cloud-Optimized Geospatial Formats Guide: Zarr". guide.cloudnativegeo.org. Retrieved 2024-09-12.
  3. ^ "Zarr Implementations". zarr.dev. Retrieved 2025-01-09.
  4. ^ "Google Cloud: ERA5 data". cloud.google.com. Retrieved 2024-09-12.
  5. ^ "Microsoft Planetary Computer: Reading Zarr Data". planetarycomputer.microsoft.com. Retrieved 2024-09-12.
  6. ^ "zarr - PyPI". Retrieved 2025-02-10.
  7. ^ Alistair Miles (2016-04-14). "To HDF5 and beyond". Retrieved 2025-02-10.
  8. ^ a b c "Zarr - Tutorial". zarr.readthedocs.io. Retrieved 2024-09-12.
  9. ^ Moore, Josh (2023). "OME-Zarr: a cloud-optimized bioimaging file format with international community support". Histochemistry and Cell Biology. 160 (3). Springer Science and Business Media LLC: 223–251. doi:10.1007/s00418-023-02209-1. hdl:1721.1/151126. ISSN 1432-119X. PMC 10492740. PMID 37428210.
  10. ^ "Lazy loading: Making it easier to access vast datasets of weather & satellite data". openclimatefix.org. Retrieved 2024-09-12.
  11. ^ Sansal, Altay; Kainkaryam, Sribharath; Lasscock, Ben; Valenciano, Alejandro (2023). "MDIO: Open-source format for multidimensional energy data". The Leading Edge. 42 (7). Society of Exploration Geophysicists: 465–473. Bibcode:2023LeaEd..42..465S. doi:10.1190/tle42070465.1. ISSN 1938-3789.


No tags for this post.