FR: allow `pick()` to rename in `distinct()` + some thoughts about `arrange()` allowing renaming. #7028

olivroy · 2024-05-18T17:37:23Z

This would be useful in packages to avoid cran warnings as pick() is the new preferred way.

library(tidyverse)
options(lifecycle_verbosity = "error")
mtcars <- utils::head(mtcars, n = 6)
# distinct -----------
# renaming generally makes a lot of sense, as distinct is like `transmute(...)` + `distinct()`
# consider this in EDA
mtcars |> 
  distinct(x = vs)
#>            x
#> Mazda RX4  0
#> Datsun 710 1
# as it is equivalent to: (to avoid cran warnings)
mtcars |> 
  distinct(x = .data$vs)
# Before dplyr 1.1, `across()` was recommended, which works, but .fns = NULL is soft-deprecated.
mtcars |> 
  distinct(across(all_of(c(x = "vs"))))
# The alternative when you may not know variable names in a package
# with `pick()` should therefore work too, but errors.
# I think it should work
mtcars |> 
  distinct(pick(x = "vs"))
#> Error in `distinct()`:
#>  ℹ In argument: `pick(x = "vs")`.
#> Caused by error in `pick()`:
#> ! Can't rename variables in this context.
# Similarly:
mtcars |> 
  distinct(pick(all_of(c(x = "vs"))))

# I would expect distinct() and transmute()  + distinct() to be same as above (and it is!) ---
# transmute() is still very useful as it is mutate + select in a single call)
mtcars |> 
  transmute(x = vs) |> 
  distinct()
#>            x
#> Mazda RX4  0
#> Datsun 710 1
mtcars |> 
  transmute(x = .data$vs) |> 
  distinct()
# I think this should be allowed.
mtcars |> 
  transmute(across(all_of(c(x = "vs")))) |> 
  distinct()
#> Error in `transmute()`:
#> ℹ In argument: `pick(all_of(c(x = "vs")))`.
#> Caused by error in `pick()`:
#> ! Can't rename variables in this context.

Now comes arrange()...

arrange()

While I believe distinct() should accept renaming under any circonstances, I believe that accepting renaming in arrange() is inconsistent, as arrange() ignores it silently. renaming (or attempt to do so) should error in all cases for arrange().

However, it behaves exactly the same as distinct(). (so it is consistent in that sense, but ignores )

# arrange() is same as distinct() but silently ignores renaming...
# no x variable
mtcars |> 
  arrange(x = vs)
# no x variable
mtcars |> 
  arrange(x = .data$vs)
# no x variable
mtcars |> 
  arrange(across(all_of(c(x = "vs"))))
# this rightfully errors!
mtcars |> 
  arrange(pick(x = "vs"))
#> Error in `arrange()`:
#>  ℹ In argument: `..1 = pick(x = "vs")`.
#> Caused by error in `pick()`:
#> ! Can't rename variables in this context.

Maybe a soft-deprecation warning would be desirable for attempting to rename in arrange()
Something like names ignored, use unnamed, or remove name, or create this variable with mutate() to keep it.

#  A beginner may think that this will create the variable as nothing is stopping them.
y <- mtcars |> arrange(x = vs, y = desc(disp))

## later tries to access the variable and gets a warning or even worse. This confusing data frame.

y |> mutate(z = y **2) # gets very bad output!

Suggestions for signaling that `arrange()` ignores renaming.

(preferred) soft-deprecation warning

something like: named input is ignored in arrange(). Please omit them.
in all_of(named_external_selection) the adjustment required would just be all_of(unname(named_external_selection))

# before
x |> arrange(new_var = var)
# now
x |> arrange(var)

if you want to create a new variable or rename, use rename() or mutate(), as arrange silently ignores names provided.

using the case of tidyselect deprecation of external selection as an example

it seems pretty similar as the deprecation of select(.data$vs) in favour of select("vs") + requiring any_of() all_of(). it ended up just increasing
general consitency of code base, and improve clarity.

# Before when you read this in someone else's code (i.e. with no context)
mtcars |>
  select(vs)
# you had no idea if the data had a `vs` column, or `vs` was an external vector
# the alternative made everything much clearer.
mtcars |>
  select(all_of(vs))

My proposed change would force users to remove this potentially misleading code
have the benefit that no one would question if the resulting data frame has a x column or not.

#6980 would be addressed automatically as part of this idea. i.e no longer allowing named selection would act a bit like a check_dots_used() condition

# before
mtcars |>
  arrange(x = vs)
# this code is clearer about what it does (and gives the same result as above)
# proposed
mtcars |> 
  arrange(vs)

Breaking change (respect renaming in arrange() i.e. arrange() == mutate() + arrange()), Downsides probably outweigh benefits here at this point..

Quoting the 1.1 release notes:

across(), c_across(), if_any(), and if_all() now require the .cols and .fns arguments. In general, we now recommend that you use pick() instead of an empty across() call or across() with no .fns (e.g. across(c(x, y)). (#6523).

Relying on the previous default of .fns = NULL is not yet formally soft-deprecated, because there was no good alternative until now, but it is discouraged and will be soft-deprecated in the next minor release.

I think that allowing renaming in pick() would allow you to soft-deprecate .fns = NULL of across and warn on distinct() + across().

Edit: pick() should also allow renaming in group_by()

The text was updated successfully, but these errors were encountered:

yjunechoe · 2024-08-20T14:46:10Z

+1 I agree that the current renaming behavior in pick() via pick(new = old) is underspecified, and I struggle to develop a good mental model of when/how it should work.

To revive the discussion, can I also raise an alternative that's similar in spirit but at the opposite extreme in implementation? What if we disallow renaming inside pick() via new = old entirely (so, tidyselect with allow_rename = FALSE), but port the .names argument from across()? For one, I don't think renaming with pick(new = old) is a common pattern that people have picked up on yet, so it doesn't feel as breaking (I'm not even sure whether this behavior is acknowledged or encouraged from reading the docs - it seems to be supported through technicality). More importantly, I think renaming via .names instead would greatly simplify the problem since then it just becomes a question of how different dplyr verbs consume data frames (which sometimes contain new column names), and the behavior for that is already well defined and familiar (e.g., users can translate their existing experience with across(.names)). So I think .names and the act of "passing entire dataframes to dplyr verbs" is a bit more concrete and accessible than having to reason about which verbs invoke a renaming context, which feels a bit more abstract.

Some consequences of what that would look like:

library(dplyr)

# Assume a `pick2()` implementation with `.names`
pick2 <- function(..., .names) {
  across(..., .fns = identity, .names = .names)
}

df <- data.frame(a = c(1,1,2), b = c("a", "a", "b"))
df
#>   a b
#> 1 1 a
#> 2 1 a
#> 3 2 b

df %>% 
  mutate(pick2(a:b, .names = "{toupper(.col)}"))
#>   a b A B
#> 1 1 a 1 a
#> 2 1 a 1 a
#> 3 2 b 2 b

df |> 
  transmute(pick2(a, .names = "x"))
#>   x
#> 1 1
#> 2 1
#> 3 2

df |> 
  distinct(pick2(a, .names = "x"))
#>   x
#> 1 1
#> 2 2

# Solves "renaming in `group_by()`" for free
df |>
  group_by(pick2(a, .names = "x"))
#> # A tibble: 3 × 3
#> # Groups:   x [2]
#>       a b         x
#>   <dbl> <chr> <dbl>
#> 1     1 a         1
#> 2     1 a         1
#> 3     2 b         2

Note that the oddity of the arrange() case still remains a separate problem. In the .names approach, it's just treated like sorting with an external vector. But with .names I feel like there's less of an expectation that you'd create a new column - I think the special .names syntax should sufficiently discourage people into expecting a new column to be created (vs. the new = old syntax).

df |> 
  arrange(pick2(a, .names = "x"))
#>   a b
#> 1 1 a
#> 2 1 a
#> 3 2 b

One obvious downside is that pick() would lose the select+rename combo via any_of(<named vector>), which admittedly would be pretty annoying for users who've been enjoying that lean syntax (though I suspect it's not frequently used for pick()).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FR: allow `pick()` to rename in `distinct()` + some thoughts about `arrange()` allowing renaming. #7028

FR: allow `pick()` to rename in `distinct()` + some thoughts about `arrange()` allowing renaming. #7028

olivroy commented May 18, 2024 •

edited

Loading

yjunechoe commented Aug 20, 2024

FR: allow pick() to rename in distinct() + some thoughts about arrange() allowing renaming. #7028

FR: allow pick() to rename in distinct() + some thoughts about arrange() allowing renaming. #7028

Comments

olivroy commented May 18, 2024 • edited Loading

arrange()

Suggestions for signaling that arrange() ignores renaming.

using the case of tidyselect deprecation of external selection as an example

yjunechoe commented Aug 20, 2024

FR: allow `pick()` to rename in `distinct()` + some thoughts about `arrange()` allowing renaming. #7028

FR: allow `pick()` to rename in `distinct()` + some thoughts about `arrange()` allowing renaming. #7028

olivroy commented May 18, 2024 •

edited

Loading

Suggestions for signaling that `arrange()` ignores renaming.