Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: allow pick() to rename in distinct() + some thoughts about arrange() allowing renaming. #7028

Open
olivroy opened this issue May 18, 2024 · 1 comment

Comments

@olivroy
Copy link

olivroy commented May 18, 2024

This would be useful in packages to avoid cran warnings as pick() is the new preferred way.

library(tidyverse)
options(lifecycle_verbosity = "error")
mtcars <- utils::head(mtcars, n = 6)
# distinct -----------
# renaming generally makes a lot of sense, as distinct is like `transmute(...)` + `distinct()`
# consider this in EDA
mtcars |> 
  distinct(x = vs)
#>            x
#> Mazda RX4  0
#> Datsun 710 1
# as it is equivalent to: (to avoid cran warnings)
mtcars |> 
  distinct(x = .data$vs)
# Before dplyr 1.1, `across()` was recommended, which works, but .fns = NULL is soft-deprecated.
mtcars |> 
  distinct(across(all_of(c(x = "vs"))))
# The alternative when you may not know variable names in a package
# with `pick()` should therefore work too, but errors.
# I think it should work
mtcars |> 
  distinct(pick(x = "vs"))
#> Error in `distinct()`:
#>  ℹ In argument: `pick(x = "vs")`.
#> Caused by error in `pick()`:
#> ! Can't rename variables in this context.
# Similarly:
mtcars |> 
  distinct(pick(all_of(c(x = "vs"))))

# I would expect distinct() and transmute()  + distinct() to be same as above (and it is!) ---
# transmute() is still very useful as it is mutate + select in a single call)
mtcars |> 
  transmute(x = vs) |> 
  distinct()
#>            x
#> Mazda RX4  0
#> Datsun 710 1
mtcars |> 
  transmute(x = .data$vs) |> 
  distinct()
# I think this should be allowed.
mtcars |> 
  transmute(across(all_of(c(x = "vs")))) |> 
  distinct()
#> Error in `transmute()`:
#> ℹ In argument: `pick(all_of(c(x = "vs")))`.
#> Caused by error in `pick()`:
#> ! Can't rename variables in this context.

Now comes arrange()...

arrange()

While I believe distinct() should accept renaming under any circonstances, I believe that accepting renaming in arrange() is inconsistent, as arrange() ignores it silently. renaming (or attempt to do so) should error in all cases for arrange().

However, it behaves exactly the same as distinct(). (so it is consistent in that sense, but ignores )

# arrange() is same as distinct() but silently ignores renaming...
# no x variable
mtcars |> 
  arrange(x = vs)
# no x variable
mtcars |> 
  arrange(x = .data$vs)
# no x variable
mtcars |> 
  arrange(across(all_of(c(x = "vs"))))
# this rightfully errors!
mtcars |> 
  arrange(pick(x = "vs"))
#> Error in `arrange()`:
#>  ℹ In argument: `..1 = pick(x = "vs")`.
#> Caused by error in `pick()`:
#> ! Can't rename variables in this context.

Maybe a soft-deprecation warning would be desirable for attempting to rename in arrange()
Something like names ignored, use unnamed, or remove name, or create this variable with mutate() to keep it.

#  A beginner may think that this will create the variable as nothing is stopping them.
y <- mtcars |> arrange(x = vs, y = desc(disp))

## later tries to access the variable and gets a warning or even worse. This confusing data frame.

y |> mutate(z = y **2) # gets very bad output!

Suggestions for signaling that arrange() ignores renaming.

  1. (preferred) soft-deprecation warning

something like: named input is ignored in arrange(). Please omit them.
in all_of(named_external_selection) the adjustment required would just be all_of(unname(named_external_selection))

# before
x |> arrange(new_var = var)
# now
x |> arrange(var)

if you want to create a new variable or rename, use rename() or mutate(), as arrange silently ignores names provided.

using the case of tidyselect deprecation of external selection as an example

it seems pretty similar as the deprecation of select(.data$vs) in favour of select("vs") + requiring any_of() all_of(). it ended up just increasing
general consitency of code base, and improve clarity.

# Before when you read this in someone else's code (i.e. with no context)
mtcars |>
  select(vs)
# you had no idea if the data had a `vs` column, or `vs` was an external vector
# the alternative made everything much clearer.
mtcars |>
  select(all_of(vs))

My proposed change would force users to remove this potentially misleading code
have the benefit that no one would question if the resulting data frame has a x column or not.

#6980 would be addressed automatically as part of this idea. i.e no longer allowing named selection would act a bit like a check_dots_used() condition

# before
mtcars |>
  arrange(x = vs)
# this code is clearer about what it does (and gives the same result as above)
# proposed
mtcars |> 
  arrange(vs)
  1. Breaking change (respect renaming in arrange() i.e. arrange() == mutate() + arrange()), Downsides probably outweigh benefits here at this point..

Quoting the 1.1 release notes:

across(), c_across(), if_any(), and if_all() now require the .cols and .fns arguments. In general, we now recommend that you use pick() instead of an empty across() call or across() with no .fns (e.g. across(c(x, y)). (#6523).

Relying on the previous default of .fns = NULL is not yet formally soft-deprecated, because there was no good alternative until now, but it is discouraged and will be soft-deprecated in the next minor release.

I think that allowing renaming in pick() would allow you to soft-deprecate .fns = NULL of across and warn on distinct() + across().

Edit: pick() should also allow renaming in group_by()

@yjunechoe
Copy link

+1 I agree that the current renaming behavior in pick() via pick(new = old) is underspecified, and I struggle to develop a good mental model of when/how it should work.

To revive the discussion, can I also raise an alternative that's similar in spirit but at the opposite extreme in implementation? What if we disallow renaming inside pick() via new = old entirely (so, tidyselect with allow_rename = FALSE), but port the .names argument from across()? For one, I don't think renaming with pick(new = old) is a common pattern that people have picked up on yet, so it doesn't feel as breaking (I'm not even sure whether this behavior is acknowledged or encouraged from reading the docs - it seems to be supported through technicality). More importantly, I think renaming via .names instead would greatly simplify the problem since then it just becomes a question of how different dplyr verbs consume data frames (which sometimes contain new column names), and the behavior for that is already well defined and familiar (e.g., users can translate their existing experience with across(.names)). So I think .names and the act of "passing entire dataframes to dplyr verbs" is a bit more concrete and accessible than having to reason about which verbs invoke a renaming context, which feels a bit more abstract.

Some consequences of what that would look like:

library(dplyr)

# Assume a `pick2()` implementation with `.names`
pick2 <- function(..., .names) {
  across(..., .fns = identity, .names = .names)
}

df <- data.frame(a = c(1,1,2), b = c("a", "a", "b"))
df
#>   a b
#> 1 1 a
#> 2 1 a
#> 3 2 b

df %>% 
  mutate(pick2(a:b, .names = "{toupper(.col)}"))
#>   a b A B
#> 1 1 a 1 a
#> 2 1 a 1 a
#> 3 2 b 2 b

df |> 
  transmute(pick2(a, .names = "x"))
#>   x
#> 1 1
#> 2 1
#> 3 2

df |> 
  distinct(pick2(a, .names = "x"))
#>   x
#> 1 1
#> 2 2

# Solves "renaming in `group_by()`" for free
df |>
  group_by(pick2(a, .names = "x"))
#> # A tibble: 3 × 3
#> # Groups:   x [2]
#>       a b         x
#>   <dbl> <chr> <dbl>
#> 1     1 a         1
#> 2     1 a         1
#> 3     2 b         2

Note that the oddity of the arrange() case still remains a separate problem. In the .names approach, it's just treated like sorting with an external vector. But with .names I feel like there's less of an expectation that you'd create a new column - I think the special .names syntax should sufficiently discourage people into expecting a new column to be created (vs. the new = old syntax).

df |> 
  arrange(pick2(a, .names = "x"))
#>   a b
#> 1 1 a
#> 2 1 a
#> 3 2 b

One obvious downside is that pick() would lose the select+rename combo via any_of(<named vector>), which admittedly would be pretty annoying for users who've been enjoying that lean syntax (though I suspect it's not frequently used for pick()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants