Accessing audio dataset value throws Format not recognised error #7276

fawazahmed0 · 2024-11-04T05:59:13Z

Describe the bug

Accessing audio dataset value throws Format not recognised error

Steps to reproduce the bug

code:

from datasets import load_dataset

dataset = load_dataset("fawazahmed0/bug-audio")

for data in dataset["train"]:
    print(data)

output:

(mypy) C:\Users\Nawaz-Server\Documents\ml>python myest.py
[C:\vcpkg\buildtrees\mpg123\src\0d8db63f9b-3db975bc05.clean\src\libmpg123\layer3.c:INT123_do_layer3():1801] error: dequantization failed!
{'audio': {'path': 'C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037135.mp3', 'array': array([ 0.00000000e+00, -2.86519935e-22, -2.56504911e-21, ...,
       -1.94239747e-02, -2.42924765e-02, -2.99104657e-02]), 'sampling_rate': 22050}, 'reciter': 'Ghamadi', 'transcription': 'الا عجوز ا في الغبرين', 'line': 3923, 'chapter': 37, 'verse': 135, 'text': 'إِلَّا عَجُوزࣰ ا فِي ٱلۡغَٰبِرِينَ'}
Traceback (most recent call last):
  File "C:\Users\Nawaz-Server\Documents\ml\myest.py", line 5, in <module>
    for data in dataset["train"]:
                ~~~~~~~^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\arrow_dataset.py", line 2372, in __iter__
    formatted_output = format_table(
                       ^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 639, in format_table
    return formatter(pa_table, query_type=query_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 403, in __call__
    return self.format_row(pa_table)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 444, in format_row
    row = self.python_features_decoder.decode_row(row)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 222, in decode_row
    return self.features.decode_example(row) if self.features else row
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\features\features.py", line 2042, in decode_example
    column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\features\features.py", line 1403, in decode_nested_example
    return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\features\audio.py", line 184, in decode_example
    array, sampling_rate = sf.read(f)
                           ^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 285, in read
    with SoundFile(file, 'r', samplerate, channels,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening <_io.BufferedReader name='C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3'>: Format not recognised.

Expected behavior

Everything should work fine, as loading the problematic audio file directly with soundfile package works fine

code:

import soundfile as sf

print(sf.read('C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3'))

output:

(mypy) C:\Users\Nawaz-Server\Documents\ml>python myest.py
[C:\vcpkg\buildtrees\mpg123\src\0d8db63f9b-3db975bc05.clean\src\libmpg123\layer3.c:INT123_do_layer3():1801] error: dequantization failed!
(array([ 0.00000000e+00, -8.43723821e-22, -2.45370628e-22, ...,
       -7.71464454e-03, -6.90496899e-03, -8.63333419e-03]), 22050)

Environment info

datasets version: 3.0.2
Platform: Windows-11-10.0.22621-SP0
Python version: 3.12.7
huggingface_hub version: 0.26.2
PyArrow version: 17.0.0
Pandas version: 2.2.3
fsspec version: 2024.10.0
soundfile: 0.12.1

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-11-04T17:38:19Z

Hi ! can you try if this works ?

import soundfile as sf

with open('C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3', 'rb') as f:
    print(sf.read(f))

fawazahmed0 · 2024-11-04T17:44:35Z

@lhoestq Same error, here is the output:

(mypy) C:\Users\Nawaz-Server\Documents\ml>python myest.py
Traceback (most recent call last):
  File "C:\Users\Nawaz-Server\Documents\ml\myest.py", line 5, in <module>
    print(sf.read(f))
          ^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 285, in read
    with SoundFile(file, 'r', samplerate, channels,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening <_io.BufferedReader name='C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3'>: Format not recognised.

fawazahmed0 · 2024-11-09T18:51:51Z

upstream bug: bastibe/python-soundfile#439

fawazahmed0 linked a pull request Nov 4, 2024 that will close this issue

Let soundfile directly read local audio files #7278

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing audio dataset value throws Format not recognised error #7276

Accessing audio dataset value throws Format not recognised error #7276

fawazahmed0 commented Nov 4, 2024 •

edited

Loading

lhoestq commented Nov 4, 2024

fawazahmed0 commented Nov 4, 2024

fawazahmed0 commented Nov 9, 2024

Accessing audio dataset value throws Format not recognised error #7276

Accessing audio dataset value throws Format not recognised error #7276

Comments

fawazahmed0 commented Nov 4, 2024 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Nov 4, 2024

fawazahmed0 commented Nov 4, 2024

fawazahmed0 commented Nov 9, 2024

fawazahmed0 commented Nov 4, 2024 •

edited

Loading