Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing audio dataset value throws Format not recognised error #7276

Open
fawazahmed0 opened this issue Nov 4, 2024 · 3 comments · May be fixed by #7278
Open

Accessing audio dataset value throws Format not recognised error #7276

fawazahmed0 opened this issue Nov 4, 2024 · 3 comments · May be fixed by #7278

Comments

@fawazahmed0
Copy link

fawazahmed0 commented Nov 4, 2024

Describe the bug

Accessing audio dataset value throws Format not recognised error

Steps to reproduce the bug

code:

from datasets import load_dataset

dataset = load_dataset("fawazahmed0/bug-audio")

for data in dataset["train"]:
    print(data)

output:

(mypy) C:\Users\Nawaz-Server\Documents\ml>python myest.py
[C:\vcpkg\buildtrees\mpg123\src\0d8db63f9b-3db975bc05.clean\src\libmpg123\layer3.c:INT123_do_layer3():1801] error: dequantization failed!
{'audio': {'path': 'C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037135.mp3', 'array': array([ 0.00000000e+00, -2.86519935e-22, -2.56504911e-21, ...,
       -1.94239747e-02, -2.42924765e-02, -2.99104657e-02]), 'sampling_rate': 22050}, 'reciter': 'Ghamadi', 'transcription': 'الا عجوز ا في الغبرين', 'line': 3923, 'chapter': 37, 'verse': 135, 'text': 'إِلَّا عَجُوزࣰ ا فِي ٱلۡغَٰبِرِينَ'}
Traceback (most recent call last):
  File "C:\Users\Nawaz-Server\Documents\ml\myest.py", line 5, in <module>
    for data in dataset["train"]:
                ~~~~~~~^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\arrow_dataset.py", line 2372, in __iter__
    formatted_output = format_table(
                       ^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 639, in format_table
    return formatter(pa_table, query_type=query_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 403, in __call__
    return self.format_row(pa_table)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 444, in format_row
    row = self.python_features_decoder.decode_row(row)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 222, in decode_row
    return self.features.decode_example(row) if self.features else row
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\features\features.py", line 2042, in decode_example
    column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\features\features.py", line 1403, in decode_nested_example
    return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\features\audio.py", line 184, in decode_example
    array, sampling_rate = sf.read(f)
                           ^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 285, in read
    with SoundFile(file, 'r', samplerate, channels,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening <_io.BufferedReader name='C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3'>: Format not recognised.

Expected behavior

Everything should work fine, as loading the problematic audio file directly with soundfile package works fine

code:

import soundfile as sf

print(sf.read('C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3'))

output:

(mypy) C:\Users\Nawaz-Server\Documents\ml>python myest.py
[C:\vcpkg\buildtrees\mpg123\src\0d8db63f9b-3db975bc05.clean\src\libmpg123\layer3.c:INT123_do_layer3():1801] error: dequantization failed!
(array([ 0.00000000e+00, -8.43723821e-22, -2.45370628e-22, ...,
       -7.71464454e-03, -6.90496899e-03, -8.63333419e-03]), 22050)

Environment info

  • datasets version: 3.0.2
  • Platform: Windows-11-10.0.22621-SP0
  • Python version: 3.12.7
  • huggingface_hub version: 0.26.2
  • PyArrow version: 17.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.10.0
  • soundfile: 0.12.1
@lhoestq
Copy link
Member

lhoestq commented Nov 4, 2024

Hi ! can you try if this works ?

import soundfile as sf

with open('C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3', 'rb') as f:
    print(sf.read(f))

@fawazahmed0 fawazahmed0 linked a pull request Nov 4, 2024 that will close this issue
1 task
@fawazahmed0
Copy link
Author

@lhoestq Same error, here is the output:

(mypy) C:\Users\Nawaz-Server\Documents\ml>python myest.py
Traceback (most recent call last):
  File "C:\Users\Nawaz-Server\Documents\ml\myest.py", line 5, in <module>
    print(sf.read(f))
          ^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 285, in read
    with SoundFile(file, 'r', samplerate, channels,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening <_io.BufferedReader name='C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3'>: Format not recognised.

@fawazahmed0
Copy link
Author

upstream bug: bastibe/python-soundfile#439

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants