Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error for table with checkpoint #669

Open
samansmink opened this issue Jan 31, 2025 · 2 comments
Open

Error for table with checkpoint #669

samansmink opened this issue Jan 31, 2025 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@samansmink
Copy link

Describe the bug

When using kernel 0.6.1 in the DuckDB delta extension, Kernel returns an error from ffi::selection_vector_from_dv: Deletion Vector error: Unknown storage format: ''.

This seems to be caused by checkpoint files because I can only reproduce it for tables with a checkpoint

To Reproduce

generate test table ./repro_table

from deltalake import DeltaTable, write_deltalake
import duckdb

for i in range(0,100):
    df = duckdb.query("select 1 as a;").df();
    write_deltalake(f"./repro_table", df, mode="append")

build latest main of DuckDB delta extension.

then query the table using:

SELECT * FROM delta_scan('./repro_table');

Expected behavior

No response

Additional context

No response

@samansmink samansmink added the bug Something isn't working label Jan 31, 2025
@hntd187
Copy link
Collaborator

hntd187 commented Jan 31, 2025

Are we sure this isn't a bug with delta-rs? Since delta-rs does not speak deletion vectors to my understanding?

@zachschuermann zachschuermann self-assigned this Jan 31, 2025
@samansmink
Copy link
Author

samansmink commented Feb 4, 2025

@hntd187 good point, doing something similar with pyspark is fine!

Using this to generate a table with a bunch of appends in ./repro_spark:

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from delta import *
from pyspark.sql.functions import *

builder = SparkSession.builder.appName("MyApp") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.driver.memory", "8g") \
    .config('spark.driver.host','127.0.0.1')

spark = configure_spark_with_delta_pip(builder).getOrCreate()

spark.sql(f"CREATE TABLE test_table USING delta LOCATION './repro_spark' AS SELECT '1' as a;")

for i in range(0,100):
    df = spark.createDataFrame(
        data=[
            Row(a=f"{i}"),
        ],
        schema=StructType([
            StructField(name="a", dataType=StringType())
        ])
    )
    df.write.format("delta").mode("append").save("./repro_spark")

yields a delta table that reads just fine ./build/debug/duckdb -c "FROM delta_scan('./repro_spark');"

Seems pretty plausible that delta-rs writes some deletion vector type field incorrectly in the checkpoint file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants