Detect a slow raidz child during reads #16900

don-brady · 2024-12-23T23:39:59Z

Motivation and Context

There is a concern, and has been observed in practice, that a slow disk can bring down the overall read performance of raidz. Currently in ZFS, a slow disk is detected by comparing the disk read latency to a custom threshold value, such as 30 seconds. This can be tuned to a lower threshold but that requires understanding the context in which it will be applied. And hybrid pools can have a wide range of expected disk latencies.

What might be a better approach, is to identify the presence of a slow disk outlier based on its latency distance from the latencies of its peers.  This would offer a more dynamic solution that can adapt to different type of media and workloads.

Description

The solution proposed here comes in two parts

Detecting a persistent read outlier – adapts over time under various factors (workload, environment, etc.)
Placing outlier drive in a sit out period -- the raidz group can use parity to reconstruct the data that was skipped.

Detecting Outliers
The most recent latency value for each child is saved in the vdev_t. Then periodically, the samples from all the children are sorted and a statistical outlier can be detected if present. The code uses a Tukey's fence, with K = 2, for detecting extreme outliers. This rule defines extreme outliers as data points outside the fence of the third quartile plus two times the Interquartile Range (IQR). This range is the distance between the first and third quartile.

Sitting Out
After a vdev has encounter multiple outlier detections (> 50), it is marked for being in a sit out period that by default lasts for 10 minutes.

Each time a slow disk is placed into a sit out period, its vdev_stat.vs_slow_ios count is incremented and a zevent class ereport.fs.zfs.delay is posted.  

The length of the sit out period can be changed using the raid_read_sit_out_secs module parameter.  Setting it to zero disables slow outlier detection.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

Tested with various configs, including dRAID.

For an extreme example, an HDD was used in an 8 wide SSD raidz2 and it was compared to taking the HDD offline. This was using a fio(1) streaming read workload across 4 threads to 20GB files. Both the record size and IO request size were 1MB.

Original	Offline	Detection
401 secs	54 secs	58 sec
204 IOPS	1521 IOPS	1415 IOPS
204 MiB/s	1521 MiB/s	1415 MiB/s

Also measured the cost over time of vdev_child_slow_outlier() where the statistical analysis occurs (every 20ms).

@vdev_child_slow_outlier_hist[ns]:  
[256, 512)           832 |@                                                   | 
[512, 1K)          14372 |@@@@@@@@@@@@@@@@@@@@@@@                             | 
[1K, 2K)           32231 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 
[2K, 4K)           23283 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               | 
[4K, 8K)            7773 |@@@@@@@@@@@@                                        | 
[8K, 16K)           2790 |@@@@                                                | 
[16K, 32K)           411 |                                                    | 
[32K, 64K)           122 |                                                    | 
[64K, 128K)           51 |                                                    | 
[128K, 256K)          26 |                                                    | 
[256K, 512K)           5 |                                                    |

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>

amotin · 2024-12-26T16:22:54Z

module/zfs/vdev_raidz.c

+	    zio->io_size >= two_sectors && zio->io_delay != 0) {
+		vdev_t *vd = zio->io_vd;
+
+		atomic_store_64(&vd->vdev_recent_latency, zio->io_delay);


hrtime_t is not uint64_t, it is long long in variations. Clang on FreeBSD does not appreciate this.

amotin · 2024-12-26T17:05:06Z

module/zfs/vdev_raidz.c

+	 */
+	uint64_t two_sectors = 2ULL << zio->io_vd->vdev_top->vdev_ashift;
+	if (zio->io_type == ZIO_TYPE_READ && zio->io_error == 0 &&
+	    zio->io_size >= two_sectors && zio->io_delay != 0) {


Could you explain why do we care about the two sectors (all data columns) here?

Not accounting aggregated ZIOs makes this algorithm even more random that periodic sampling alone would do. With RAIDZ splitting ZIOs between vdevs into smaller ones, they are good candidates for aggregation.

amotin · 2024-12-26T17:07:38Z

module/zfs/vdev_raidz.c

+		if (parity_avail > 0 &&
+		    c >= rr->rr_firstdatacol &&
+		    rr->rr_missingdata == 0 &&
+		    vdev_skip_latency_outlier(cvd, zio->io_flags)) {


Here seems to be O(n^2). For each vdev you take a lock and compare it to all other vdevs.

amotin · 2024-12-26T17:13:14Z

module/zfs/vdev_raidz.c

+	 * Calculate how much parity is available for sitting out reads
+	 */
+	int parity_avail = rr->rr_firstdatacol;
+	for (int p = 0; p < rr->rr_firstdatacol; p++) {
+		raidz_col_t *rc = &rr->rr_col[p];
+		if (rc->rc_size > 0 &&
+		    !vdev_readable(vd->vdev_child[rc->rc_devidx])) {
+			parity_avail--;
+		}
+	}


This does not take into account vdev_dtl_contains() below. Some of disks despite being readable might not have valid data.

amotin · 2024-12-26T17:15:52Z

module/zfs/vdev_raidz.c

+				/* Periodically check for a read outlier */
+				if (zio->io_type == ZIO_TYPE_READ)
+					vdev_child_slow_outlier(zio);


Why is this within the loop on rm_nrows if it does not take it as an argument?

amotin · 2024-12-26T17:26:34Z

module/zfs/vdev_raidz.c

+	latency_sort(lat_data, samples);
+	uint64_t fence = latency_quartiles_fence(lat_data, samples);
+	if (lat_data[samples - 1] > fence) {
+		/*
+		 * Keep track of how many times this child has had
+		 * an outlier read. A disk that persitently has a
+		 * higer than peers outlier count will be considered
+		 * a slow disk.
+		 */
+		atomic_add_64(&svd->vdev_outlier_count, 1);


With small number of children and with only one random sample from each I really doubt this math can be statistically meaningful.

amotin reviewed Dec 26, 2024

View reviewed changes

amotin added the Status: Code Review Needed Ready for review and testing label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect a slow raidz child during reads #16900

Detect a slow raidz child during reads #16900

don-brady commented Dec 23, 2024

amotin Dec 26, 2024

amotin Dec 26, 2024

amotin Dec 26, 2024

amotin Dec 26, 2024

amotin Dec 26, 2024

amotin Dec 26, 2024

Detect a slow raidz child during reads #16900

Are you sure you want to change the base?

Detect a slow raidz child during reads #16900

Conversation

don-brady commented Dec 23, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin Dec 26, 2024

Choose a reason for hiding this comment

amotin Dec 26, 2024

Choose a reason for hiding this comment

amotin Dec 26, 2024

Choose a reason for hiding this comment

amotin Dec 26, 2024

Choose a reason for hiding this comment

amotin Dec 26, 2024

Choose a reason for hiding this comment

amotin Dec 26, 2024

Choose a reason for hiding this comment