Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved docs for zarr encoding options. #9987

Open
benritchie opened this issue Jan 26, 2025 · 4 comments
Open

Improved docs for zarr encoding options. #9987

benritchie opened this issue Jan 26, 2025 · 4 comments
Labels
topic-documentation topic-zarr Related to zarr storage library

Comments

@benritchie
Copy link

What is your issue?

Hi
I'm been trying to set zarr encoding options from xarray. (zarr3)

Figuring out how to do this isn't straightforward. It wasn't too hard to get this working for most zarr compressors, but getting it working for array-to-bytes codecs - ZFPY and PCodec was rather harder. (the 2 ArrayBytesCodecs). It turned out the issue is that array bytes codecs need specifying as serialisers, rather than as compressors in the encoding object.

Anyway - to cut to the chase, I think some better documentation of the format of the encoding object would be useful. - I've not been able to find any, and resorted to source code reading to find the above parameter.

I'm happy to help write this if useful, but could use a pointer for the best place to put the doc. (I'm new to making xarray changes).

should say though - the fact this works at all just a few days after zarr3 release is great!

Thanks

Format strings that seem to be working for me are as follows (arguably maybe the details of codec naming belong more in zarr land, but at least the serializer keyword is as far as I can see a xarray invention, so should be documented in xarray):

For ArrayBytesCodecs:

encoding = {"serializer": numcodecs.zarr3.()}

For ArrayBytesCodecs:
if in numcodecs:
encoding = {"compressor": numcodecs.zarr3.()}

and if native zarr3:
(note different codec name format)
encoding = {"compressor": zarr.codecs.ZstdCodec()}

@benritchie benritchie added the needs triage Issue that has not been reviewed by xarray team member label Jan 26, 2025
Copy link

welcome bot commented Jan 26, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas TomNicholas added topic-documentation topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 27, 2025
@dcherian
Copy link
Contributor

Thanks for volunteering to contribute!

On the xarray end you can pass whatever is accepted by Zarr. So perhaps the best thing to do is write out a line and example to that effect, and add more detailed docs over at Zarr

@NathanCummings
Copy link

Does the error I am seeing relate to this? I happens when I try to call xr.to_zarr() on a Dataset that a loaded from a Zarr group that was written using Zarr v2 (if any of that is relevant).
TypeError: Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead.

I noticed that the same error appears in the code block in your documentation here.

@lukasbindreiter
Copy link
Contributor

lukasbindreiter commented Feb 3, 2025

I encountered an issue with that as well just now.
As a somewhat new user to zarr specifically (but experienced with xarray and NetCDF and the way compression works there) - the situation is still quite confusing to me - and the documentation in that regard for sure could be improved.

This was my journey right now in trying to achieve this:

Using this dataset:

import xarray as xr
import numpy as np

ds = xr.Dataset(
    {
        # all zeros to verify by disk size whether it was compressed or not
        "temperature": (("x", "y", "time"), np.zeros((50, 60, 1000))),
    },
    coords={
        "x": np.arange(50),
        "y": np.arange(60),
        "time": np.arange(1000),
    },
)

My initial attempt, based on a quick search on how to do this:

import zarr

ds.to_zarr("my_store", consolidated=False, mode="w", encoding={"temperature": {"compressors": [zarr.Blosc()]}})

gives:
AttributeError: module 'zarr' has no attribute 'Blosc'

Ok - it seems those examples were for zarr2. Fortunately the migration guide mentions this, so here goes the next attempt

import numcodecs

ds.to_zarr("my_store", consolidated=False, mode="w", encoding={"temperature": {"compressors": [numcodecs.Blosc()]}})

gives the following error:
TypeError: Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead.
(which is the same one that also appears in the docs as @NathanCummings has pointed out)

Ok - this is strange, after some research and messing around with the numcodesc library and its exposed API I find this:

import numcodecs.zarr3

ds.to_zarr("my_store", consolidated=False, mode="w", encoding={"temperature": {"compressors": [numcodecs.zarr3.Blosc()]}})

This does, in fact write out a zarr store. However, I do get a very confusing UserWarning:

...path/python3.12/site-packages/numcodecs/zarr3.py:132: UserWarning: Numcodecs codecs are not in the Zarr version 3 specification and may not be supported by other zarr implementations.
  super().__init__(**codec_config)

Checking whether it was compressed or not:

>>> du -h my_store

4.0K	my_store/time/c
8.0K	my_store/time
4.0K	my_store/temperature
4.0K	my_store/x/c
8.0K	my_store/x
4.0K	my_store/y/c
8.0K	my_store/y
 32K	my_store

Yes it was, otherwise temperature wouldn't be 4.0K

However, if I leave out the compressor

!rm -rf my_store  # remove previous

ds.to_zarr("my_store", consolidated=False, mode="w", encoding={"temperature": {"compressors": []}})

!du -h my_store

I still get the same result: 4.0K my_store/temperature

So I'm assuming it already uses a compressor by default if not otherwise specified? But the default compressor doesn't produce that user warning - so it must be a different one than one of numcodecs.zarr3.* I assume - right?

I only found out about the serializer option at all from this github issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-documentation topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

5 participants