Questions about catalog design #8887
Replies: 4 comments 3 replies
-
I don't think there is a good reason. Perhaps you could file a new ticket with the request?
The theory is that using a remote procedure call to list table names and other functions is likely to be very poor performance (most systems would want to batch the access to the remote catalog). What DataFusion itself does to plan SQL querie is to walk over the query to find all schema / table references (in an Specifically here is the call that gets a snapshot of all references And then resolves them all here: https://github.com/apache/arrow-datafusion/blob/eb81ea299aa7e121bbe244e7e1ab56513d4ef800/datafusion/core/src/execution/context/mod.rs#L1678-L1688 This has come up a number of times and I will make a PR to try and clarify the rationale in the documentation 3、Why is only the table function of SchemaProvider asynchronous, and the rest of the functions are not asynchronous? Can the catalog and schema be obtained remotely? This is the same answer as 2 |
Beta Was this translation helpful? Give feedback.
-
I think the reasoning is largely historical, SchemaProvider originally was completely sync. #4607 added the necessary async shenanigans to make it so that SchemaProvider::table could be async, without forcing the planning machinery to also be, which given plannings highly recursive nature causes problems. However, this was not extended to other methods in order to keep the scope of the change down. I think other methods could possibly be made async, it is just a case of working the async through the various different traits and methods. Async is infuriatingly viral in this way, and so seemingly simple changes can quickly balloon into quite complex undertakings. I would not be surprised if making table_names async would require also making changes to the other catalog traits. However, as @alamb describes, the actual meat of planning is already decoupled from these traits, so this might not be totally intractable. |
Beta Was this translation helpful? Give feedback.
-
Here is a PR with more information / documentation about this: #8968 |
Beta Was this translation helpful? Give feedback.
-
Related: #8805 One thing that bothers me is that you're telling the Datafusion-using programmers they should walk the Statement and construct a suitable CatalogProvider, only for Datafusion to walk the Statement again and request things from the CatalogProvider. That seems a little silly. Also note that registering just the referred-to tables in a SchemaProvider is not enough, because My use case: most likely schema data is cached in-memory already, but fetching it could fail (e.g. data corruption on disk). I still haven't figured out the best design for doing the right amount of work ahead of time, and I'm suffering from potential |
Beta Was this translation helpful? Give feedback.
-
1、Why CatalogList doesn’t provide delete function
2、Why is only SchemaProvider's table function asynchronous and SchemaProvider.table_names 、CatalogList.catalog_names not allowed to get data from remote?
3、Why is only the table function of SchemaProvider asynchronous, and the rest of the functions are not asynchronous? Can the catalog and schema be obtained remotely?
Beta Was this translation helpful? Give feedback.
All reactions