Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC bin breaks derived from scale breaks #6174

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

arcresu
Copy link
Contributor

@arcresu arcresu commented Nov 1, 2024

This is a proof of concept of the minimal changes necessary to fix #6159. If you're willing to consider this approach I'll finish it off with documentation, tests, and the outstanding TODOs below.

The first part is essentially the same as the extension discussed in the issue: i.e. the follow.scale param on stat_bin causes it to inherit bins from the scale. As noted, that only works if the scale doesn't get new breaks during the final retraining, i.e. provide fixed breaks, or disable scale expansion and hope other layers don't cause issues. In this example the bins don't align with the final breaks because the scale expands after the binning, causing the breaks to move.

(TODO: add follow.scale to the other binning stats. Suppress the default binning warning when follow.scale = TRUE. Add a value like follow.scales = "minor" to allow inheriting major and minor breaks?)

devtools::load_all("~/code/ggplot2")
#> ℹ Loading ggplot2

set.seed(2024)
df <- data.frame(
  date = as.Date("2024-01-01") + rnorm(100, 0, 5),
  z = sample(c("a", "b"), 100, replace = TRUE)
)

ggplot(df, aes(date)) +
  geom_histogram(follow.scale = TRUE)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The fix is to tell the scale that we want the breaks to be "frozen" before the stats are computed. Subsequent retraining is free to change the limits, which affects which breaks are shown, but once the breaks are frozen it acts as though they had been passed in as an explicit breaks vector.

(TODO: Add a param to the continuous scale constructor and scale_{x,y}_{continuous,date,datetime}. Maybe come up with a better name than freezing, like breaks_computation = c("auto", "before_stat"))

ggplot(df, aes(date)) +
  geom_histogram(follow.scale = TRUE) +
  ggproto(NULL, scale_x_date(), freeze_breaks = TRUE)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It also looks reasonable when there are multiple facets:

ggplot(df, aes(date)) +
  geom_histogram(follow.scale = TRUE) +
  facet_wrap(vars(z)) +
  ggproto(NULL, scale_x_date(), freeze_breaks = TRUE)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Adding a distant data point and setting the scales to free, we can see that the binning is done independently for different facets:

rbind(df, data.frame(date = as.Date("2025-03-01") , z = "a")) |>
  ggplot(aes(date)) +
  geom_histogram(follow.scale = TRUE) +
  facet_wrap(vars(z), scales = "free_x") +
  ggproto(NULL, scale_x_date(), freeze_breaks = TRUE, guide = guide_axis(angle = 90)) 
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2024-11-01 with reprex v2.1.1

This seems probably desirable behaviour since we did explicitly request free scales here. Changing it to make the binning consistent across panels would also be a bit complicated because I think the facets clone the scales before the first time breaks are computed.

To make the combination of settings more discoverable, it's probably reasonable to add a warning when using follow.scale with a scale that doesn't have freeze_breaks = TRUE.

Please let me know if I've overlooked some way that these changes will cause problems with other parts of ggplot!

@arcresu arcresu marked this pull request as draft November 1, 2024 06:17
@thomasp85
Copy link
Member

I'm a bit weary of adding this kind of link between scales and stats tbh. It makes the code much harder to reason about. Is there anything in this you couldn't achieve with binned scale and geom_bar()

@teunbrand
Copy link
Collaborator

From the linked issue:

I can't for example add a geom_vline() to mark a specific date on the axis, since the vertical line would then be snapped into a bin by the scale transform.

@arcresu
Copy link
Contributor Author

arcresu commented Nov 1, 2024

Thanks, yes it's exactly that. A typical use case I'm working with is epidemic curves that are annotated with contextual events. To pick a random example online, something like this (but sometimes the date ranges incolved are shorter so snapping to the bin centre is a more significant change):

Epidemic-curve-of-outbreaks-cases-and-deaths-due-to-confirmed-ILI-in-older-people

The main data is a histogram with date bins that are either epidemiological weeks (i.e. weeks with the boundary fixed to a specific day of the week depending on local concentions) or calendar months/years. Annotations refer to events on a specific day or are time series that might be binned differently. A binned scale would force all layers to use the same bins.

The follow.scale part of this change by itself could easily live in an extension, but the frozen breaks part is hard to do outside of ggplot.

@thomasp85
Copy link
Member

Ah, I see... Sorry for driving by with a half-informed suggestion 🙈

@thomasp85
Copy link
Member

I'm still very much against geoms/stats being able to control aspects of the scaling. This kind of flow of control could lead to incompatibility between layers etc (and make code harder to reason about)

@arcresu
Copy link
Contributor Author

arcresu commented Nov 1, 2024

The change I proposed is the opposite: scales influence stats. I get that it stretches the abstractions a bit unnaturally, though.

The status quo solution is to separately tell the scale and the stat to use the same bin/break width and offset. It's certainly doable when you understand the issue, but it tends to be brittle.

So I suppose a compromise could be tweaks to the way breaks and bins are specified (especially with date scales) so that there's a more discoverable way to control them with the same syntax? I've see a lot of cases where people worked out how to synchronise bin (binwidth=7) and break widths (date_breaks="week") but got stuck with the offsets.

@thomasp85
Copy link
Member

My reading skills are obviously impaired this morning 🙈

@arcresu
Copy link
Contributor Author

arcresu commented Nov 6, 2024

I've found a different way to get the result I'm after without needing any changes to the scale internals. Essentially rather than the scales going out of their way to avoid updating breaks after the first time, the breaks function is memoised by wrapping it along with an environment. I previously thought that interior mutability in a function wouldn't be possible.

library(ggplot2)
#> 
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:base':
#> 
#>     is.element

set.seed(2024)
df <- data.frame(date = as.Date("2024-01-01") + rnorm(100, 0, 5))

StatBin2 <- ggproto(
  "StatBin2", StatBin,
  compute_panel = function(self, data, scales, breaks = NULL, ...) {
    breaks <- breaks %||% scales$x$get_transformation()$inverse(scales$x$get_breaks())
    ggproto_parent(StatBin, self)$compute_panel(data, scales, breaks = breaks, ...)
  }
)

p <- ggplot(df, aes(date)) + geom_histogram(stat = StatBin2)

p + scale_x_date(breaks = scales::breaks_width("2 week"))
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`.

(this bit as before, with breaks undesirably recomputed after scales are expanded)

breaks_cached <- function(breaks) {
  ggplot2::ggproto(
    "BreaksCached", NULL,
    fn = breaks,
    cached = NULL,
    get_breaks = function(self, limits) {
      if (is.null(self$cached)) self$cached <- self$fn(limits)
      self$cached
    }
  )$get_breaks
}

p + scale_x_date(breaks = breaks_cached(scales::breaks_width("2 week")))
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2024-11-06 with reprex v2.1.1

It's feasible for an extension to handle everything by providing its own version of each binning stat (which is a minor hassle to substitute in for the normal stats) along with breaks_cached().

Alternatively there could be a less invasive ggplot implementation with follow.scale on the binning stats and breaks_cached() in {scales}. It doesn't have to use ggproto - this also works:

breaks_cached <- function(f) {
  fn <- function(...) {
    if (is.null(.cache.env$cached)) {
      .cache.env$cached <- .cache.env$inner(...)
    }
    .cache.env$cached
  }
  e <- new.env()
  e$inner <- f
  e$cached <- NULL
  rlang::fn_env(fn)$.cache.env <- e
  fn
}

Does this approach feel better for ggplot?

@teunbrand
Copy link
Collaborator

I agree with Thomas that anything that alters the state of the scale is undesirable. While caching breaks technically alters state, I feel there is not really a way ggplot2 could enforce this, even if it wanted to, because a breaks function is entirely up to users to specify. For this reason, I think it may be feasible to give StatBin read-only access to the breaks (e.g. with get_breaks()). If users then use a cached function for breaks, we could assume they're doing this deliberately and take ownership of any consequences of that action (good and bad).

@arcresu
Copy link
Contributor Author

arcresu commented Dec 16, 2024

I've updated this based on my previous comment and fleshed it out a little with documentation. The breaks_cached helper could live in scales (using an alternative implementation) or left to an extension.

Does the follow.scale part look like something you'd take? It's useful, if a bit niche, even without breaks_cached as long as the scale's breaks are fixed. I would just need to add the equivalent feature to the other binning stats.

@arcresu arcresu force-pushed the bin_from_scale_breaks branch from 63c0ab1 to d6e536e Compare December 16, 2024 01:29
@teunbrand
Copy link
Collaborator

Thanks for the updates! I think aside from off/major/minor there should be a 4th option that would be the union of major and minor breaks. Do we need a separate follow.scale argument? Can't we use keywords like breaks = "major" that are invoked when is_string(breaks)?. I'm not convinced the cached breaks needs to live in ggplot2, this is perhaps more in the realm of the scales package. In addition, I'm unsure why the ggproto would be necessary there, as I imagine you can just take advantage of the closure property of functions instead.

@arcresu
Copy link
Contributor Author

arcresu commented Dec 16, 2024

Thanks for taking a look!

For sure, the param can be folded into the existing one if that isn't confusing.

I'm happy to see if the breaks helper fits in scales. Since it caches any function it's not even really specific to breaks, so maybe there's a better home for it (rlang?), just as long as ggplot can refer to it later in docs. As for why it uses ggproto... that was just the first thing I tried that worked. There's an alternative version in my comment above without ggproto that could probably be further simplified if I understood a bit better how closures work in R :)

Since it sounds like you're open to the general idea of this, I'll extend it to the other binning stats and clean it up as a proper PR.

@teunbrand
Copy link
Collaborator

teunbrand commented Dec 16, 2024

Before you extend this to other bin calculations, I'd wait for #6212 to be merged.
Closures work by knowing about the environment they were created in. So in the following example, the inner function knows about the cache variable. For reference: https://adv-r.hadley.nz/environments.html#function-environments

cache_breaks <- function(breaks_fun) {
  cache <- NULL
  function(...) {
    if (is.null(cache)) {
      cache <<- breaks_fun(...)
    }
    cache
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Way to request stat_bin() to inherit breaks from the scale
3 participants