Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Always use get_src_tbl_names() to fetch tbl Ids within dm_from_con() #1934

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 12 additions & 10 deletions R/dm_from_con.R
Original file line number Diff line number Diff line change
Expand Up @@ -93,18 +93,20 @@ dm_from_con <- function(con = NULL, table_names = NULL, learn_keys = NULL,
)
}

if (is_null(table_names)) {
src_tbl_names <- get_src_tbl_names(src, ..., names = .names)
} else {
src_tbl_names <- table_names
}
tbl_ids <- get_src_tbl_names(src, ..., names = .names)

# Fetch only the tbls which were specifically requested
if (!is.null(table_names)) {
not_found <- setdiff(table_names, names(tbl_ids))

nms <- purrr::map_chr(src_tbl_names, ~ .x@name[["table"]])
if (length(not_found) > 0) {
cli::cli_warn("Could not find table{?s}: {not_found}")
}

tbl_ids <- tbl_ids[intersect(table_names, names(tbl_ids))]
}

tbls <-
set_names(src_tbl_names, nms) %>%
quote_ids(con) %>%
map(possibly(tbl, NULL), src = src)
tbls <- map(tbl_ids, possibly(tbl, NULL), src = src)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using schemas this code fails as well, since table_names normally does not include the schema, yet names(tbl_ids) does.
Therefore, I would suggest to replace lines 96-109 with the following:

  tbl_ids <- get_src_tbl_names(src, ..., names = .names)
  # Fetch only the tbls which were specifically requested
  if (!is.null(table_names)) {
    nms <- purrr::map_chr(tbl_ids, ~ .x@name[["table"]])
    bad_table_names <- table_names[!(table_names %in% nms)]
    tbl_ids <- tbl_ids[which(nms %in% table_names)]
  }
  tbls <- map(tbl_ids, possibly(tbl, NULL), src = src)
  bad <- map_lgl(tbls, is_null)
  if (any(bad) || (exists("bad_table_names", inherits = FALSE) && length(bad_table_names) > 0)) {
    if (is_null(table_names)) {
      warn_tbl_access(names(tbls)[bad])
      tbls <- tbls[!bad]
    } else {
      abort_tbl_access(union(names(tbls)[bad], bad_table_names))
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for this!! I hadn't spotted the *_tbl_access() functions, those look super handy 😁 I'll certainly adjust to include those.

Regarding table_names, I've just been thinking about the following situation:

Suppose there is a table name which exists in multiple schemas, e.g. suppose schema1.table1 and schema2.table1 both exist.

Now if a user asks for:

dm_from_con(con, table_names = "table1", schemas = c("schema1", "schema2"))

What should we return in such a case?

One option would be for the user to provide "schema.table"-format names in table_names. The documentation for table_names would perhaps need some adjustment - but:

Or maybe we should include both schema1.table1 and schema2.table1 in the returned dm object; which I think is what your suggested changes will result in? I'm conflicted about this, because I feel like once we start dealing with multiple schemas, we should always be explicit about those schemas - but maybe that's just me being too strict with SQL ideals 😛

To offer one more alternative, we could include only the first table found with that name, and then raising a warning, as described here? #1789 (comment)

I'll happily proceed with whichever you think is best!

Copy link
Member

@TSchiefer TSchiefer Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply, good catch about the same table name in different schemas.

In my opinion, if the user provides 1 table name and more than 1 schema name, as in your example, this should lead to an error: either schema is not provided, has length 1 or needs to have a length matching the length of table_names.

names(tbl_ids) is a vector of the resulting names in the dm according to the specification in the .names argument. In case schema is given with multiple schema names, the default for this argument -- as you know -- is "{.schema}.{.table}", which leads to dm-tables just like you suggested (schema1.table1 and schema2.table1), if the user gave table_names = c("table1", "table1") and schema = c("schema1", "schema2").
In general names(tbl_ids) could be anything though and for this reason it cannot be used in any conditions.

I think the error messages need to be adapted to include the schema names of the tables that could not be accessed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okidoki! So based on these comments so far, I think we might need to modify how table_names works, such that it can accept a list in the multiple-schemas case - something like:

# No `schema` arg => use default schema, i.e. same as one-schema case (next example)
dm_from_con(con, table_names = c("table1", "table2"))

# One specified schema => `table_names` is a single vector of names
# Returned dm object contains: schema1.table1, schema1.table2
dm_from_con(con, schema = "schema1", table_names = c("table1", "table2"))

# Multiple schemas => `table_names` is a list of vectors (must it be named?)
# Returned dm object contains: schema1.table1, schema1.table2, schema2.table1, schema2.table3
dm_from_con(
  con,
  schema = c("schema1", "schema2"),
  table_names = list(
    "schema1" = c("table1", "table2"),
    "schema2" = c("table1", "table3")
  )
)

I'll make one more case for leaving things as they are currently implemented, i.e. based on names(tbl_ids), because it avoids adding complexity to table_names.

In general names(tbl_ids) could be anything though

True - but names are always constructed using the .names pattern, which is either the default (which multi-schema users must be aware of already), or specified by the user!

So the user does always know how elements of the returned dm object will be named; and therefore they know how the strings in table_names should look.

# Default schema / one schema
dm_from_con(con, table_names = c("table1", "table2"))
dm_from_con(con, schema = "schema1", table_names = c("table1", "table2"))

# Multiple schemas, default `.names`
dm_from_con(
  con,
  schemas = c("schema1", "schema2"),
  table_names = c("schema1.table1", "schema1.table2", "schema2.table1", "schema2.table3")
)

# Multiple schemas, custom `.names` - user knows that `table_names` should match `.names`
dm_from_con(
  con,
  schemas = c("schema1", "schema2"),
  # The user chooses `.names`...
  .names = "{.schema}__{.table}",
  # ... so, they know what `table_names` should look like!
  table_names = c("schema1__table1", "schema1__table2", "schema2__table1", "schema2__table3")
)

Thinking back to before multiple schema functionality was added, I think one interpretation of table_names was:

"I only want the returned dm object to contain elements with names like I've specified in table_names."

And I think that with this implementation, that interpretation is still valid now in the multi-schema case.


What do you think? For either of those approaches, I reckon I now have a solid specification to go away and implement 😇 many thanks for your patience and guidance with this one!


bad <- map_lgl(tbls, is_null)
if (any(bad)) {
Expand Down