POC bin breaks derived from scale breaks #6174

arcresu · 2024-11-01T06:14:59Z

This is a proof of concept of the minimal changes necessary to fix #6159. If you're willing to consider this approach I'll finish it off with documentation, tests, and the outstanding TODOs below.

The first part is essentially the same as the extension discussed in the issue: i.e. the follow.scale param on stat_bin causes it to inherit bins from the scale. As noted, that only works if the scale doesn't get new breaks during the final retraining, i.e. provide fixed breaks, or disable scale expansion and hope other layers don't cause issues. In this example the bins don't align with the final breaks because the scale expands after the binning, causing the breaks to move.

(TODO: add follow.scale to the other binning stats. Suppress the default binning warning when follow.scale = TRUE. Add a value like follow.scales = "minor" to allow inheriting major and minor breaks?)

devtools::load_all("~/code/ggplot2")
#> ℹ Loading ggplot2

set.seed(2024)
df <- data.frame(
  date = as.Date("2024-01-01") + rnorm(100, 0, 5),
  z = sample(c("a", "b"), 100, replace = TRUE)
)

ggplot(df, aes(date)) +
  geom_histogram(follow.scale = TRUE)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The fix is to tell the scale that we want the breaks to be "frozen" before the stats are computed. Subsequent retraining is free to change the limits, which affects which breaks are shown, but once the breaks are frozen it acts as though they had been passed in as an explicit breaks vector.

(TODO: Add a param to the continuous scale constructor and scale_{x,y}_{continuous,date,datetime}. Maybe come up with a better name than freezing, like breaks_computation = c("auto", "before_stat"))

ggplot(df, aes(date)) +
  geom_histogram(follow.scale = TRUE) +
  ggproto(NULL, scale_x_date(), freeze_breaks = TRUE)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It also looks reasonable when there are multiple facets:

ggplot(df, aes(date)) +
  geom_histogram(follow.scale = TRUE) +
  facet_wrap(vars(z)) +
  ggproto(NULL, scale_x_date(), freeze_breaks = TRUE)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Adding a distant data point and setting the scales to free, we can see that the binning is done independently for different facets:

rbind(df, data.frame(date = as.Date("2025-03-01") , z = "a")) |>
  ggplot(aes(date)) +
  geom_histogram(follow.scale = TRUE) +
  facet_wrap(vars(z), scales = "free_x") +
  ggproto(NULL, scale_x_date(), freeze_breaks = TRUE, guide = guide_axis(angle = 90)) 
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

^{Created on 2024-11-01 with reprex v2.1.1}

This seems probably desirable behaviour since we did explicitly request free scales here. Changing it to make the binning consistent across panels would also be a bit complicated because I think the facets clone the scales before the first time breaks are computed.

To make the combination of settings more discoverable, it's probably reasonable to add a warning when using follow.scale with a scale that doesn't have freeze_breaks = TRUE.

Please let me know if I've overlooked some way that these changes will cause problems with other parts of ggplot!

thomasp85 · 2024-11-01T07:57:31Z

I'm a bit weary of adding this kind of link between scales and stats tbh. It makes the code much harder to reason about. Is there anything in this you couldn't achieve with binned scale and geom_bar()

teunbrand · 2024-11-01T07:59:16Z

From the linked issue:

I can't for example add a geom_vline() to mark a specific date on the axis, since the vertical line would then be snapped into a bin by the scale transform.

arcresu · 2024-11-01T08:44:26Z

Thanks, yes it's exactly that. A typical use case I'm working with is epidemic curves that are annotated with contextual events. To pick a random example online, something like this (but sometimes the date ranges incolved are shorter so snapping to the bin centre is a more significant change):

The main data is a histogram with date bins that are either epidemiological weeks (i.e. weeks with the boundary fixed to a specific day of the week depending on local concentions) or calendar months/years. Annotations refer to events on a specific day or are time series that might be binned differently. A binned scale would force all layers to use the same bins.

The follow.scale part of this change by itself could easily live in an extension, but the frozen breaks part is hard to do outside of ggplot.

thomasp85 · 2024-11-01T08:59:21Z

Ah, I see... Sorry for driving by with a half-informed suggestion 🙈

thomasp85 · 2024-11-01T09:01:09Z

I'm still very much against geoms/stats being able to control aspects of the scaling. This kind of flow of control could lead to incompatibility between layers etc (and make code harder to reason about)

arcresu · 2024-11-01T09:29:37Z

The change I proposed is the opposite: scales influence stats. I get that it stretches the abstractions a bit unnaturally, though.

The status quo solution is to separately tell the scale and the stat to use the same bin/break width and offset. It's certainly doable when you understand the issue, but it tends to be brittle.

So I suppose a compromise could be tweaks to the way breaks and bins are specified (especially with date scales) so that there's a more discoverable way to control them with the same syntax? I've see a lot of cases where people worked out how to synchronise bin (binwidth=7) and break widths (date_breaks="week") but got stuck with the offsets.

thomasp85 · 2024-11-01T09:33:28Z

My reading skills are obviously impaired this morning 🙈

arcresu · 2024-11-06T03:39:10Z

I've found a different way to get the result I'm after without needing any changes to the scale internals. Essentially rather than the scales going out of their way to avoid updating breaks after the first time, the breaks function is memoised by wrapping it along with an environment. I previously thought that interior mutability in a function wouldn't be possible.

library(ggplot2)
#> 
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:base':
#> 
#>     is.element

set.seed(2024)
df <- data.frame(date = as.Date("2024-01-01") + rnorm(100, 0, 5))

StatBin2 <- ggproto(
  "StatBin2", StatBin,
  compute_panel = function(self, data, scales, breaks = NULL, ...) {
    breaks <- breaks %||% scales$x$get_transformation()$inverse(scales$x$get_breaks())
    ggproto_parent(StatBin, self)$compute_panel(data, scales, breaks = breaks, ...)
  }
)

p <- ggplot(df, aes(date)) + geom_histogram(stat = StatBin2)

p + scale_x_date(breaks = scales::breaks_width("2 week"))
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`.

(this bit as before, with breaks undesirably recomputed after scales are expanded)

breaks_cached <- function(breaks) {
  ggplot2::ggproto(
    "BreaksCached", NULL,
    fn = breaks,
    cached = NULL,
    get_breaks = function(self, limits) {
      if (is.null(self$cached)) self$cached <- self$fn(limits)
      self$cached
    }
  )$get_breaks
}

p + scale_x_date(breaks = breaks_cached(scales::breaks_width("2 week")))
#> `stat_bin2()` using `bins = 30`. Pick better value with `binwidth`.

^{Created on 2024-11-06 with reprex v2.1.1}

It's feasible for an extension to handle everything by providing its own version of each binning stat (which is a minor hassle to substitute in for the normal stats) along with breaks_cached().

Alternatively there could be a less invasive ggplot implementation with follow.scale on the binning stats and breaks_cached() in {scales}. It doesn't have to use ggproto - this also works:

breaks_cached <- function(f) {
  fn <- function(...) {
    if (is.null(.cache.env$cached)) {
      .cache.env$cached <- .cache.env$inner(...)
    }
    .cache.env$cached
  }
  e <- new.env()
  e$inner <- f
  e$cached <- NULL
  rlang::fn_env(fn)$.cache.env <- e
  fn
}

Does this approach feel better for ggplot?

teunbrand · 2024-11-12T15:17:48Z

I agree with Thomas that anything that alters the state of the scale is undesirable. While caching breaks technically alters state, I feel there is not really a way ggplot2 could enforce this, even if it wanted to, because a breaks function is entirely up to users to specify. For this reason, I think it may be feasible to give StatBin read-only access to the breaks (e.g. with get_breaks()). If users then use a cached function for breaks, we could assume they're doing this deliberately and take ownership of any consequences of that action (good and bad).

arcresu · 2024-12-16T00:58:41Z

I've updated this based on my previous comment and fleshed it out a little with documentation. The breaks_cached helper could live in scales (using an alternative implementation) or left to an extension.

Does the follow.scale part look like something you'd take? It's useful, if a bit niche, even without breaks_cached as long as the scale's breaks are fixed. I would just need to add the equivalent feature to the other binning stats.

teunbrand · 2024-12-16T08:16:00Z

Thanks for the updates! I think aside from off/major/minor there should be a 4th option that would be the union of major and minor breaks. Do we need a separate follow.scale argument? Can't we use keywords like breaks = "major" that are invoked when is_string(breaks)?. I'm not convinced the cached breaks needs to live in ggplot2, this is perhaps more in the realm of the scales package. In addition, I'm unsure why the ggproto would be necessary there, as I imagine you can just take advantage of the closure property of functions instead.

arcresu · 2024-12-16T08:42:02Z

Thanks for taking a look!

For sure, the param can be folded into the existing one if that isn't confusing.

I'm happy to see if the breaks helper fits in scales. Since it caches any function it's not even really specific to breaks, so maybe there's a better home for it (rlang?), just as long as ggplot can refer to it later in docs. As for why it uses ggproto... that was just the first thing I tried that worked. There's an alternative version in my comment above without ggproto that could probably be further simplified if I understood a bit better how closures work in R :)

Since it sounds like you're open to the general idea of this, I'll extend it to the other binning stats and clean it up as a proper PR.

teunbrand · 2024-12-16T08:50:31Z

Before you extend this to other bin calculations, I'd wait for #6212 to be merged.
Closures work by knowing about the environment they were created in. So in the following example, the inner function knows about the cache variable. For reference: https://adv-r.hadley.nz/environments.html#function-environments

cache_breaks <- function(breaks_fun) {
  cache <- NULL
  function(...) {
    if (is.null(cache)) {
      cache <<- breaks_fun(...)
    }
    cache
  }
}

POC bin breaks derived from scale breaks

f9618ac

arcresu marked this pull request as draft November 1, 2024 06:17

arcresu added 3 commits December 16, 2024 09:06

Revert break freezing functionality

93ff456

Add breaks_cached() implementation

0b601ee

Complete follow.scale feature for stat_bin()

4a7ce18

Fix default follow.scale inconsistency

d6e536e

arcresu force-pushed the bin_from_scale_breaks branch from 63c0ab1 to d6e536e Compare December 16, 2024 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC bin breaks derived from scale breaks #6174

POC bin breaks derived from scale breaks #6174

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

teunbrand commented Nov 1, 2024

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

arcresu commented Nov 6, 2024 •

edited

Loading

teunbrand commented Nov 12, 2024

arcresu commented Dec 16, 2024

teunbrand commented Dec 16, 2024

arcresu commented Dec 16, 2024

teunbrand commented Dec 16, 2024 •

edited

Loading

POC bin breaks derived from scale breaks #6174

Are you sure you want to change the base?

POC bin breaks derived from scale breaks #6174

Conversation

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

teunbrand commented Nov 1, 2024

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

arcresu commented Nov 1, 2024

thomasp85 commented Nov 1, 2024

arcresu commented Nov 6, 2024 • edited Loading

teunbrand commented Nov 12, 2024

arcresu commented Dec 16, 2024

teunbrand commented Dec 16, 2024

arcresu commented Dec 16, 2024

teunbrand commented Dec 16, 2024 • edited Loading

arcresu commented Nov 6, 2024 •

edited

Loading

teunbrand commented Dec 16, 2024 •

edited

Loading