-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Unclear how to materialize a table or list materialized tables #7742
Comments
Hi @OlivierBinette 👋🏻! Thanks for opening an issue about this. I will write some things down here that can hopefully be used by myself or others to seed a documentation PR.
As for "how", there are generally two methods for doing that:
Both of As for "when", that's generally when you invoke those methods.
This name corresponds the physical table's name only in the case that the table is backed by a physical entity in the database like a table or view. Otherwise it's an ephemeral name. I would try to avoid depending on the return value of this method being anything in particular for an arbitrary table expression. While
All Ibis table expressions are unevaluated until you call If you're asking how to distinguish between whether something is a leaf in the expression tree or not, there's no public API for doing this right now. Can you describe what you want to do with this information and/or more about your use case? In practice this hasn't come up IIRC.
We don't have a generic API for this, partly because each backend's target representation is different. Some backends work directly with strings and don't have anything analogous to Can you describe what you're trying to do here? |
Thanks for your answers! Follow-up question: Does Right now, I use a direct connection to my database (using DuckDB's Python client) to query my database and find whether or not a table has been materialized:
For your final question question: The pipeline that I use after pre-processing data with Ibis expects the name of a table in my database or a DuckDBPyRelation object. It could possibly accomodate a sqlalchemy.Table class as well. This requires making sure that the input Ibis Table has been materialized and then fetching its physical name. For more information, here's what I do:
Happy to share code if this is too abstract. I could put together a minimal example. |
@cpcloud Big picture, I think the key thing is that I am using Ibis Table objects not just for data manipulation, but also as my abstract table interface in my system. As such, I need to not only be able to create Ibis Tables, but I also need to be able to go from an Ibis Table back to my backend database. The common interface for this might be to materialize a table in a database/schema/namespace of your choice (or even just the default one), and then get the name of the physical table as a string. I think this is mostly supported by Ibis, but the documentation might need to be clarified a bit and that use case could be highlighted. |
I think the use @OlivierBinette describes here is nearly the exact same use case that I had in #6636. I also started #7762 if anyone wants to chime in there. I think better docs would logically get added to https://ibis-project.org/how-to/input-output/basics#data-platforms. @OlivierBinette what parts of that guide are lacking that we should add? |
@NickCrews Maybe the data-platforms page could include a few points about the following:
I don't mind whether or not Ibis' Table API directly implements functions to do these things, but I definitely want to have enough information so that I can implement reliable functions that do these things for a specific backend I might be using. |
@OlivierBinette I'm having a bit of trouble understanding what you mean by "materialize". Does that mean creating a table in the database from an expression, or executing an expression (getting back in-memory results), or something else? |
@cpcloud By "materialize", I mean creating a persistent table in the database from an expression. I.e., the Ibis Table object should point a table in the backend database. Specifically:
Maybe "persisted" would be a better word here. Here's how I currently check if a table has been "materialized", using a separate direct connection def material_tables_list(self):
material_tables = self.duckdb_connection.sql("show tables").fetchall()
columns_data = list(zip(*material_tables))
if len(columns_data) > 0:
return columns_data[0]
else:
return [] And here's how I create a "material view" for a given table: def material_view(self, table: Table, table_name: Optional[str] = None, overwrite=False):
if table_name is None:
table_name = table.get_name() # This raises an error if the Ibis table refers to an unnamed expression. A uuid should be created in that case.
if overwrite or table_name not in self.material_tables_list():
return self.db.create_table(table_name, table, overwrite=True)
else:
return self.db.tables[table_name] |
Hm,
is a confusing way to refer to "disk-backed table", especially since many platforms these days have an object called a "materialized view", which is neither a table nor is it a traditional view 😅 I think we need some documentation that gives some conceptual foundations about what words like "view" and "table" mean. They're independent of ibis, but I think we should all try to get on the same page with respect to database nouns to the extent possible. For example, both tables and views have the ability to be temporary and overwritten, but only tables have data that takes up storage space. |
To clarify a couple things:
This is the notion of a non-temporary tabular object in a database, i.e., a thing created with either
This is the same notion as the first bullet point. Unless a view or a table was created with |
@cpcloud Thanks for the clarifications! I think what confuses things more is that an Ibis "table" can be a database table, a database view, an unevaluated expression, etc. So ultimately the documentation issue is to clarify the relationship between the Ibis and backend concepts, and how to figure out how one maps to the other. For my application, I often want to make sure that an Ibis table refers to a non-temporary database table, or to make an Ibis table to a non-temporary database table if that's not already the case. |
I would say that there's no need verify this if you're using
I think t = con.create_table(name, any_table_expression) # the second argument can be a view or a table The return value of |
I have functions that take as input Ibis Table objects, and unfortunately I don't know how the user created them. I'm using Ibis as a high-level Table interface abstraction for a piece of software, but I'm not responsible for supplying the input. Should I do try/catch on There's also a Overall, I need a better understanding of how I should inspect Ibis Table objects against their backend. |
Ah, okay, that helps me understand a bit more about what you're doing.
Interesting, we actually had a Perhaps we should consider bringing that back, with a better name of course 😄
|
@OlivierBinette Jumping back in here a bit. Can you elaborate on why you need to know whether the thing underlying the Ibis table expression exists on disk as a non-temporary table? Generally speaking if you're providing an API to end users that accepts arbitrary Ibis table expressions your API shouldn't need to know details like whether the table exists on disk. Those expressions could be a physical table, or a 400 element I'd like to understand the architectural details of your application so that we can get a better idea of whether we need to bring back the |
@cpcloud Yes for sure. I'm building a Python package that will be used by data scientists to define data processing and entity resolution pipelines. The package has pre-built functions that take Ibis Table as input, and return Ibis Tables. Some of these pre-built functions rely on external tools for more complex data processing, including Splink. To hand off a table to Splink, I need to give Splink the name of the table in the database. I also need to make sure that the table is non-temporary, as Splink relies on its own, separate connection to the database. So basically, I want to have interoperability between Ibis and other systems that work with the same backend database. |
@cpcloud Ideally, I'd be able to achieve the following two things:
|
I am in a very similar boat with needing to pass off tables back and forth with Splink. @OlivierBinette Are you sure that you actually need to be able to distinguish, or do you really only care that the result fits your needs? ie if ibis had this function and you didn't know the internals, would that work? def ensure_named(t: Table, name: str | None = None, temp: bool = True, view: bool = True) -> Table:
"""
Create a temp/permanent view/table in the database if the table isn't already that state.
""" After this method, it is guaranteed that |
@NickCrews Yes that would cover my use case. I currently have a function that roughly does this for my DuckDB backend, but it's a bit janky. The |
|
@OlivierBinette Is this still an issue for you? |
@cpcloud I still don't think the docs are clear regarding this topic. For instance, does 'list_tables()' only return database tables? No. If I use 'read_csv()', the resulting lazy Ibis table expression shows up in 'list_tables()' despite not being a table in the database. |
xref #8382 @OlivierBinette -- we're thinking of adding a The top-level |
Please describe the issue
Context:
I'm using Ibis to do data pre-processing, and then using the name of the materialized table in the database as the input to another pipeline.
Problem:
After spending some time working with Ibis and reading through the docs, I'm still unclear about the relationship between Ibis tables, the tables listed in
con.tables
, and the views and tables in my database.Specifically, I think the documentation should clarify:
get_name()
function returns a value and how this name relates to the physical table name.Code of Conduct
The text was updated successfully, but these errors were encountered: