Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Unclear how to materialize a table or list materialized tables #7742

Open
1 task done
OlivierBinette opened this issue Dec 13, 2023 · 22 comments
Open
1 task done
Labels
docs Documentation related issues or PRs

Comments

@OlivierBinette
Copy link

Please describe the issue

Context:

I'm using Ibis to do data pre-processing, and then using the name of the materialized table in the database as the input to another pipeline.

Problem:

After spending some time working with Ibis and reading through the docs, I'm still unclear about the relationship between Ibis tables, the tables listed in con.tables, and the views and tables in my database.

Specifically, I think the documentation should clarify:

  • When and how a lazy Ibis table becomes a view or a table in your backend database.
  • How Ibis Tables get_name() function returns a value and how this name relates to the physical table name.
  • How you can list materialized tables, or distinguish between unevaluated Ibis table expressions and database tables.
  • How you can go from an Ibis Table to workinig directly with the underlying table reference (either through SQLAlchemy or your database' official Python client).

Code of Conduct

  • I agree to follow this project's Code of Conduct
@OlivierBinette OlivierBinette added the docs Documentation related issues or PRs label Dec 13, 2023
@cpcloud
Copy link
Member

cpcloud commented Dec 14, 2023

Hi @OlivierBinette 👋🏻!

Thanks for opening an issue about this.

I will write some things down here that can hopefully be used by myself or others to seed a documentation PR.

When and how a lazy Ibis table becomes a view or a table in your backend database.

As for "how", there are generally two methods for doing that:

  • newly_created_table = con.create_table(name, table_expr)
  • newly_created_view = con.create_view(name, table_expr)

Both of newly_created_table and newly_created_view are Ibis table expressions backed by some named physical entity in the database. At the expression level Ibis doesn't distinguish between a view and a table.

As for "when", that's generally when you invoke those methods.

How Ibis Tables get_name() function returns a value and how this name relates to the physical table name.

This name corresponds the physical table's name only in the case that the table is backed by a physical entity in the database like a table or view. Otherwise it's an ephemeral name.

I would try to avoid depending on the return value of this method being anything in particular for an arbitrary table expression.

While get_name() is a public API, it's not entirely clear to me whether it's useful. That said, if we were to remove it it would go through a deprecation cycle (1 or 2 minor/patch releases) and only be explicitly removed in a major release.

How you can list materialized tables

con.list_tables() returns a list of strings, one string for each table or view.

or distinguish between unevaluated Ibis table expressions and database tables.

All Ibis table expressions are unevaluated until you call to_pandas/to_pyarrow/to_pyarrow_batches.

If you're asking how to distinguish between whether something is a leaf in the expression tree or not, there's no public API for doing this right now.

Can you describe what you want to do with this information and/or more about your use case? In practice this hasn't come up IIRC.

How you can go from an Ibis Table to workinig directly with the underlying table reference (either through SQLAlchemy or your database' official Python client).

We don't have a generic API for this, partly because each backend's target representation is different. Some backends work directly with strings and don't have anything analogous to sqlalchemy.Table (for example).

Can you describe what you're trying to do here?

@OlivierBinette
Copy link
Author

OlivierBinette commented Dec 14, 2023

Thanks for your answers!

Follow-up question:

Does con.list_tables() return the name of unmaterialized tables in some cases, or of tables registered in a different schema or database? I'm working with DuckDB in particular, and if I'm not mistaken some of the tables listed in listed_tables might be in the temp rather than main. I'm not 100% sure though as I think there has been some changes with list_tables() recently.

Right now, I use a direct connection to my database (using DuckDB's Python client) to query my database and find whether or not a table has been materialized:

def material_tables_list(self):
        material_tables_data = self.duckdb_connection.sql("show tables").fetchall()
        columns_data = list(zip(*material_tables_data))
        if len(columns_data) > 0:
            return columns_data[0]
        else:
            return []

For your final question question:

The pipeline that I use after pre-processing data with Ibis expects the name of a table in my database or a DuckDBPyRelation object. It could possibly accomodate a sqlalchemy.Table class as well. This requires making sure that the input Ibis Table has been materialized and then fetching its physical name.

For more information, here's what I do:

  1. I have data processing functions that take an Ibis Table as an input and return an Ibis Table as output.
  2. Some of these data processing functions rely on a secondary pipeline which takes a table name (and a connection to my database) as arguments. In these functions, I check whether or not the input Ibis Table has been materialized, materialize it if not, and then use the get_name() function to pass the table name to the secondary pipeline. The output of the pipeline is a new table in my database, and I return an Ibis Table reference to it.
  3. Data scientists organize these data processing functions into a pipeline, by specifying dependencies across steps. An orchestration tool then runs the pipeline, making sure that relevant intermediate outputs are both materialized to the local database and saved to a data warehouse used for persistent storage.

Happy to share code if this is too abstract. I could put together a minimal example.

@OlivierBinette
Copy link
Author

@cpcloud Big picture, I think the key thing is that I am using Ibis Table objects not just for data manipulation, but also as my abstract table interface in my system.

As such, I need to not only be able to create Ibis Tables, but I also need to be able to go from an Ibis Table back to my backend database. The common interface for this might be to materialize a table in a database/schema/namespace of your choice (or even just the default one), and then get the name of the physical table as a string. I think this is mostly supported by Ibis, but the documentation might need to be clarified a bit and that use case could be highlighted.

@NickCrews
Copy link
Contributor

I think the use @OlivierBinette describes here is nearly the exact same use case that I had in #6636.

I also started #7762 if anyone wants to chime in there.

I think better docs would logically get added to https://ibis-project.org/how-to/input-output/basics#data-platforms. @OlivierBinette what parts of that guide are lacking that we should add?

@OlivierBinette
Copy link
Author

@NickCrews Maybe the data-platforms page could include a few points about the following:

  • The use of list_tables() to fetch table names, making sure to highlight that some of the tables listed there may be in different database schemas (e.g., main and temp for DuckDB), and explaining how exactly all of that is managed.
  • Best practices to check if a table is materialized (specifically in the main/default schema, not in a temporary schema)
  • How to materialize a (possibly persistent) table if it is not already materialized.
  • How to get a table's physical name or full location in the backend, when it is has one.

I don't mind whether or not Ibis' Table API directly implements functions to do these things, but I definitely want to have enough information so that I can implement reliable functions that do these things for a specific backend I might be using.

@cpcloud
Copy link
Member

cpcloud commented Dec 16, 2023

@OlivierBinette I'm having a bit of trouble understanding what you mean by "materialize".

Does that mean creating a table in the database from an expression, or executing an expression (getting back in-memory results), or something else?

@OlivierBinette
Copy link
Author

OlivierBinette commented Dec 16, 2023

@cpcloud By "materialize", I mean creating a persistent table in the database from an expression. I.e., the Ibis Table object should point a table in the backend database. Specifically:

  • This table should have a fixed name/location in the database that can be referenced by other, separate connections to the database.
  • The table should be persisted to the database, i.e. not be dropped once Ibis' connection to the database is closed.

Maybe "persisted" would be a better word here.

Here's how I currently check if a table has been "materialized", using a separate direct connection self.duckdb_connection to my DuckDB database:

def material_tables_list(self):
        material_tables = self.duckdb_connection.sql("show tables").fetchall()
        columns_data = list(zip(*material_tables))
        if len(columns_data) > 0:
            return columns_data[0]
        else:
            return []

And here's how I create a "material view" for a given table:

def material_view(self, table: Table, table_name: Optional[str] = None, overwrite=False):
        if table_name is None:
            table_name = table.get_name() # This raises an error if the Ibis table refers to an unnamed expression. A uuid should be created in that case.

        if overwrite or table_name not in self.material_tables_list():
            return self.db.create_table(table_name, table, overwrite=True)
        else:
            return self.db.tables[table_name]

@cpcloud
Copy link
Member

cpcloud commented Dec 16, 2023

Hm,

"material view"

is a confusing way to refer to "disk-backed table", especially since many platforms these days have an object called a "materialized view", which is neither a table nor is it a traditional view 😅

I think we need some documentation that gives some conceptual foundations about what words like "view" and "table" mean. They're independent of ibis, but I think we should all try to get on the same page with respect to database nouns to the extent possible.

For example, both tables and views have the ability to be temporary and overwritten, but only tables have data that takes up storage space.

@cpcloud
Copy link
Member

cpcloud commented Dec 16, 2023

To clarify a couple things:

This table should have a fixed name/location in the database that can be referenced by other, separate connections to the database.

This is the notion of a non-temporary tabular object in a database, i.e., a thing created with either CREATE VIEW or CREATE TABLE, no TEMPORARY.

The table should be persisted to the database, i.e. not be dropped once Ibis' connection to the database is closed.

This is the same notion as the first bullet point. Unless a view or a table was created with temp=True (which maps to the TEMPORARY keyword in SQL), the object will persist beyond the Ibis session.

@OlivierBinette
Copy link
Author

@cpcloud Thanks for the clarifications! I think what confuses things more is that an Ibis "table" can be a database table, a database view, an unevaluated expression, etc. So ultimately the documentation issue is to clarify the relationship between the Ibis and backend concepts, and how to figure out how one maps to the other.

For my application, I often want to make sure that an Ibis table refers to a non-temporary database table, or to make an Ibis table to a non-temporary database table if that's not already the case.

@cpcloud
Copy link
Member

cpcloud commented Dec 16, 2023

For my application, I often want to make sure that an Ibis table refers to a non-temporary database table

I would say that there's no need verify this if you're using create_table. Its result will always be a persistent named table taking up space on disk.

or to make an Ibis table to a non-temporary database table if that's not already the case.

I think create_table is the answer here as well:

t = con.create_table(name, any_table_expression)  # the second argument can be a view or a table

The return value of create_table called this way will always be a table that persists beyond the ibis session.

@OlivierBinette
Copy link
Author

OlivierBinette commented Dec 16, 2023

@cpcloud

I would say that there's no need verify this if you're using create_table. Its result will always be a persistent named table taking up space on disk.

I have functions that take as input Ibis Table objects, and unfortunately I don't know how the user created them. I'm using Ibis as a high-level Table interface abstraction for a piece of software, but I'm not responsible for supplying the input.

Should I do try/catch on con.create_table? This function throws an error if the table already exists, and I can overwrite otherwise. However, the table could already exist and be (for DuckDB) in the "temp" database schema instead of the "main" schema.

There's also a cache() function for tables, but I'm not sure what to expect of it regarding the backend.

Overall, I need a better understanding of how I should inspect Ibis Table objects against their backend.

@cpcloud
Copy link
Member

cpcloud commented Dec 16, 2023

I have functions that take as input Ibis Table objects, and I don't know how the user created them.

Ah, okay, that helps me understand a bit more about what you're doing.

Should I do try/catch on con.create_table? This function throws an error if the table already exists, and I can overwrite otherwise.

Interesting, we actually had a force argument that would generate a CREATE TABLE t IF NOT EXISTS statement which sounds like potentially what you're looking for. The semantics are: create the table as usual if it doesn't exist otherwise return the existing table.

Perhaps we should consider bringing that back, with a better name of course 😄

There's also a cache() function for tables, but I'm not sure what to expect of it regarding the backend.

cache is explicitly designed to be session-bound: any cached tables are temporary and dropped at the end of the session.

@cpcloud
Copy link
Member

cpcloud commented Dec 19, 2023

@OlivierBinette Jumping back in here a bit.

Can you elaborate on why you need to know whether the thing underlying the Ibis table expression exists on disk as a non-temporary table?

Generally speaking if you're providing an API to end users that accepts arbitrary Ibis table expressions your API shouldn't need to know details like whether the table exists on disk. Those expressions could be a physical table, or a 400 element UNION ALL monstrosity.

I'd like to understand the architectural details of your application so that we can get a better idea of whether we need to bring back the CREATE TABLE IF NOT EXISTS functionality or not.

@OlivierBinette
Copy link
Author

@cpcloud Yes for sure. I'm building a Python package that will be used by data scientists to define data processing and entity resolution pipelines. The package has pre-built functions that take Ibis Table as input, and return Ibis Tables.

Some of these pre-built functions rely on external tools for more complex data processing, including Splink. To hand off a table to Splink, I need to give Splink the name of the table in the database. I also need to make sure that the table is non-temporary, as Splink relies on its own, separate connection to the database.

So basically, I want to have interoperability between Ibis and other systems that work with the same backend database.

@OlivierBinette
Copy link
Author

@cpcloud Ideally, I'd be able to achieve the following two things:

  1. Distinguish an Ibis Table that simply points to a database view/table from an Ibis Table that represents some other expression.
  2. For an Ibis Table that points to a database view/table, to be able to get information about the view/table, such as its name, and whether or not it is a temporary table.

@NickCrews
Copy link
Contributor

I am in a very similar boat with needing to pass off tables back and forth with Splink.

@OlivierBinette Are you sure that you actually need to be able to distinguish, or do you really only care that the result fits your needs? ie if ibis had this function and you didn't know the internals, would that work?

def ensure_named(t: Table, name: str | None = None, temp: bool = True, view: bool = True) -> Table:
    """
    Create a temp/permanent view/table in the database if the table isn't already that state.
    """

After this method, it is guaranteed that t.get_name() will give me a name that splink will accept.

@OlivierBinette
Copy link
Author

@NickCrews Yes that would cover my use case. I currently have a function that roughly does this for my DuckDB backend, but it's a bit janky.

The create_table and create_view functions could also be changed to have a force argument as cpcloud was saying, although we'd want to make sure that these functions always return Ibis Tables backed by non-temporary tables.

@cpcloud
Copy link
Member

cpcloud commented Dec 19, 2023

create_table and create_view always create objects that persist, unless you pass temp=True to them.

@cpcloud
Copy link
Member

cpcloud commented Jul 16, 2024

@OlivierBinette Is this still an issue for you?

@OlivierBinette
Copy link
Author

@cpcloud I still don't think the docs are clear regarding this topic.

For instance, does 'list_tables()' only return database tables? No. If I use 'read_csv()', the resulting lazy Ibis table expression shows up in 'list_tables()' despite not being a table in the database.

@gforsyth
Copy link
Member

xref #8382

@OlivierBinette -- we're thinking of adding a ddl accessor with specific list_views, list_materialized_views, methods, etc.

The top-level list_tables lists all abstract table objects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related issues or PRs
Projects
Status: backlog
Development

No branches or pull requests

4 participants