Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Set focus to new and used columns after mutate() #6252

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

krlmlr
Copy link
Member

@krlmlr krlmlr commented Apr 30, 2022

Needs r-lib/pillar#549.

In the example below, speed, air_time and distance are shown because they played a role in the mutate() . Normally none of these columns are visible.

The focus columns are highlighted with an underline, this formatting can't be shown in a reprex.

library(dplyr)
nycflights13::flights %>%
  mutate(speed = air_time / distance)
#> # A tibble:      336,776 × 20
#> # Focus columns: speed, air_time, distance
#>     year month   day dep…¹ sch…² dep…³ arr…⁴ sch…⁵ arr…⁶ air_time distance speed
#>    <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>    <dbl>    <dbl> <dbl>
#>  1  2013     1     1   517   515     2   830   819    11      227     1400 0.162
#>  2  2013     1     1   533   529     4   850   830    20      227     1416 0.160
#>  3  2013     1     1   542   540     2   923   850    33      160     1089 0.147
#>  4  2013     1     1   544   545    -1  1004  1022   -18      183     1576 0.116
#>  5  2013     1     1   554   600    -6   812   837   -25      116      762 0.152
#>  6  2013     1     1   554   558    -4   740   728    12      150      719 0.209
#>  7  2013     1     1   555   600    -5   913   854    19      158     1065 0.148
#>  8  2013     1     1   557   600    -3   709   723   -14       53      229 0.231
#>  9  2013     1     1   557   600    -3   838   846    -8      140      944 0.148
#> 10  2013     1     1   558   600    -2   753   745     8      138      733 0.188
#> # … with 336,766 more rows, abbreviated variable names ¹​dep_time,
#> #   ²​sched_dep_time, ³​dep_delay, ⁴​arr_time, ⁵​sched_arr_time, ⁶​arr_delay, and 8
#> #   more variables: carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#> #   dest <chr>, hour <dbl>, minute <dbl>, time_hour <dttm>

Created on 2022-04-30 by the reprex package (v2.0.1)

@krlmlr krlmlr requested review from hadley and DavisVaughan April 30, 2022 15:52
@hadley
Copy link
Member

hadley commented Aug 4, 2022

Hmmmm I like the way this looks, but adding the extra attribute is makes a lot of tests fail. I'm confident we can work around this for dplyr itself (see current hack in utils.R) but we'd need to think about how this will play out for other packages. Maybe if is_testing() is TRUE we could instead set waldo_opts? But that'll only help for packages using testthat 3e. And we might need to figure out some way to avoid the small expensive of an env var check for every mutate.

@krlmlr
Copy link
Member Author

krlmlr commented Aug 18, 2022

Crazy thought: what if we reversed responsibility here? If each mutate() registered the resulting data frame together with its focus columns in a global variable (collection?) so that attributes remain unchanged but the focus columns are still known? We could purge old data frames from that container after a relatively brief interval (seconds? minutes? defined with an option?) to avoid hidden state kept indefinitely.

Otherwise: can we set waldo_opts in on_load(), but only when testing?

@DavisVaughan
Copy link
Member

Concrete example of what I don't particularly love about focus columns. I'm explicitly requesting that the new column come before time_hour, but the focus column makes it look like it came after distance (which also doesn’t even really exist there either)

library(dplyr)

nycflights13::flights %>%
  mutate(speed = air_time / distance, .before = time_hour)
#> # A tibble:      336,776 × 20
#> # Focus columns: air_time, distance, speed
#>     year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ air_time distance speed
#>    <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl>    <dbl>    <dbl> <dbl>
#>  1  2013     1     1      517      515       2     830     819      11      227     1400 0.162
#>  2  2013     1     1      533      529       4     850     830      20      227     1416 0.160
#>  3  2013     1     1      542      540       2     923     850      33      160     1089 0.147
#>  4  2013     1     1      544      545      -1    1004    1022     -18      183     1576 0.116
#>  5  2013     1     1      554      600      -6     812     837     -25      116      762 0.152
#>  6  2013     1     1      554      558      -4     740     728      12      150      719 0.209
#>  7  2013     1     1      555      600      -5     913     854      19      158     1065 0.148
#>  8  2013     1     1      557      600      -3     709     723     -14       53      229 0.231
#>  9  2013     1     1      557      600      -3     838     846      -8      140      944 0.148
#> 10  2013     1     1      558      600      -2     753     745       8      138      733 0.188
#> # … with 336,766 more rows, 8 more variables: carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#> #   variable names ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

I also don't really love that it makes the column look like its in the 12th position, like I could access it with df[[12]], but it isn't really there and it is somewhat hard to figure out where it actually is

@DavisVaughan
Copy link
Member

Also why is air_time showing up twice? Once as arr_t…³ and once as air_time?

@krlmlr
Copy link
Member Author

krlmlr commented Aug 26, 2022

Thanks for your feedback.

I need to look into the confused column order, something doesn't seem to be right.

I have thought about adding a vertical separator if columns are omitted, perhaps |. Would that help?

air_time is showing only once, the other column is arr_time .

@DavisVaughan
Copy link
Member

DavisVaughan commented Aug 26, 2022

a vertical separator if columns are omitted,

I'm not sure I follow, can you mock an example?

@krlmlr
Copy link
Member Author

krlmlr commented Aug 28, 2022

The column order is correct. We have three focus column as a result of the operation: the two input columns, and the output column.

The focus columns are shown in the header. Is this helpful at all?

I have tweaked the reprex to show what the vertical separator could look like.

Should we treat focus columns differently in non-interactive mode, e.g., ignore them for now?

library(conflicted)
library(dplyr)

options(max.print = 20)

nycflights13::flights %>%
  mutate(speed = air_time / distance, .before = time_hour) %>%
  attributes()
#> $names
#>  [1] "year"           "month"          "day"            "dep_time"      
#>  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
#>  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
#> [13] "origin"         "dest"           "air_time"       "distance"      
#> [17] "hour"           "minute"         "speed"          "time_hour"     
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
#>  [ reached getOption("max.print") -- omitted 336756 entries ]
#> 
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"
#> 
#> $pillar_focus
#> [1] "speed"    "air_time" "distance"
nycflights13::flights %>%
  mutate(speed = air_time / distance, .before = time_hour)
#> # A tibble:      336,776 × 20
#> # Focus columns: speed, air_time, distance
#>     year month   day dep_time sched_de…¹ dep_d…² arr_t…³|air_time distance speed
#>    <int> <int> <int>    <int>      <int>   <dbl>   <int>|   <dbl>    <dbl> <dbl>
#>  1  2013     1     1      517        515       2     830      227     1400 0.162
#>  2  2013     1     1      533        529       4     850      227     1416 0.160
#>  3  2013     1     1      542        540       2     923      160     1089 0.147
#>  4  2013     1     1      544        545      -1    1004      183     1576 0.116
#>  5  2013     1     1      554        600      -6     812      116      762 0.152
#>  6  2013     1     1      554        558      -4     740      150      719 0.209
#>  7  2013     1     1      555        600      -5     913      158     1065 0.148
#>  8  2013     1     1      557        600      -3     709       53      229 0.231
#>  9  2013     1     1      557        600      -3     838      140      944 0.148
#> 10  2013     1     1      558        600      -2     753      138      733 0.188
#> # … with 336,766 more rows, 10 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#> #   dest <chr>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#> #   variable names ¹​sched_dep_time, ²​dep_delay, ³​arr_time

Created on 2022-08-28 by the reprex package (v2.0.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants