This is Google's cache of https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums. It is a snapshot of the page as it appeared on Sep 2, 2011 00:09:49 GMT. The current page could have changed in the meantime. Learn more

Text-only version
These search terms are highlighted: ext4 metadata checksums  
Ext4 Metadata Checksums - Ext4

Ext4 Metadata Checksums

From Ext4

Jump to: navigation, search

Contents

Overview

TLDR: Add crc32c to ext4 superblock, inode, block and inode bitmap, extent tree, directory block, htree block, and extended attribute objects with as few disk layout adjustments as possible.

Regular: As much as we wish our storage hardware was 100% reliable, it is still quite possible for data to be corrupted on disk, corrupted during transfer over a wire, or written to the wrong places. To protect against this sort of non-hostile corruption, it is desirable to store checksums of metadata objects on the filesystem to prevent broken metadata from shredding the filesystem. In theory, btrfs has stronger guarantees against corruption (uniform checksums on _all_ metadata blocks, redundant copies of all metadata, etc.) but this retrofit to ext4 will provide stronger protections for users who desire to stay with or refuse to migrate off of ext4, and at the fairly low cost of a single tune2fs/e2fsck.

This document is intended to record Darrick's metadata checksum design as he works on writing the necessary patches.

Algorithm

The popular sentiment is that a CRC will suffice to detect bit flips and other various corruption. The existing block group checksum uses the ANSI CRC16 polynomial (0x8005), which probably suffices for 32-byte block group descriptors. However, this crc16 is not be the most desirable function for the other metadata objects; longer CRCs are generally better at detecting errors when the data being checksummed gets large. It is expected that this will be the case since the bitmaps and the directory blocks are generally 4KiB in size.

The CRC32c polynomial (0x1EDC6F41) seems to have stronger error detection abilities over regular CRC32 (0x04C11DB7). It is implemented in hardware on Core i7 Intel CPUs and can be made to run reasonably quickly on other processors. Therefore, it seems desirable to use it. Further study is required to determine which CRCs (and which implementations) are fastest.

CRC Stuffing

For the space-constrained block groups (at least in standard 32-bit mode) It has been suggested that because CRC16 is implemented in software, we should find a way to use the fast crc32c function yet somehow shrink the checksum to fit in 16 bits.

For the bitmap checksums it seems possible to take advantage of the property crc32(a ^ b) = crc32(a) ^ crc32(b).

Benchmarking

I culled crc code from the Linux kernel and e2fsprogs, and linked it all into a big dumb program that crcs a large block of data. Following are bandwidth results (K/s) from various machines and a block size of 512MB:

machineclockcrc16crc16-t10difcrc32-kern-becrc32-kern-lecrc32-e2fs-becrc32-e2fs-lecrc32ccrc32c-intelcrc32c-by8-becrc32c-by8-lecrc32c-intelby8
Xeon X56502.67GHz381,856293,9511,039,3891,059,679454,377454,133419,9644,447,4311,684,0711,698,3091,843,101
Core i7-9503.06GHz363,599279,431996,363994,851429,477428,275398,3824,131,1271,573,2101,593,7761,714,893
3.6GHz Pentium 43.6GHz391,433345,666915,502925,717512,564511,035437,946n/a1,097,0681,146,0041,099,856
Core2 67002.67GHz332,726320,891933,688937,826453,658453,377390,229n/a1,653,4831,652,0181,362,838
1.5GHz POWER5+1.5GHz160,096111,396285,927314,650169,446169,447160,106n/a620,102624,184599,048
1.9GHz POWER5+1.9GHz202,266140,844360,555396,713214,207214,202202,224n/a807,723808,243775,657
Athlon64 X2 4200+2.2GHz261,927298,252767,435767,469392,507392,520337,204n/a1,193,2781,102,3281,136,237
3GHz Pentium 43GHz360,264307,781793,679790,873421,749421,491393,766n/a935,662942,952910,220
1GHz Pentium 31GHz67,44868,429157,668157,609116,705116,294107,558n/a
VIA C72GHz133,243132,670296,732296,757228,180228,417153,906n/a339,504343,237327,777
VIA C7800MHz52,75952,765118,037118,83290,87490,48360,962n/a138,600137,445132,069
Opteron 82182.6GHz304,453346,510888,013890,044454,597454,210391,157n/a1,189,3121,176,8441,176,380
Xeon E54503GHz405,184326,1241,052,8061,055,434511,349510,867421,542n/a1,675,7811,686,9211,816,082
P4 Xeon MP2.7GHz174,024150,326267,248267,390175,788176,342185,110n/a319,609320,717270,821
Xeon E31103GHz406,181326,3241,055,9291,057,013518,032516,353422,631n/a1,676,3841,696,4551,831,592
500MHz PIII500MHz34,03434,77893,96896,52862,24862,89655,315n/a121,693121,570116,931
Core2 T74002.16GHz277,295261,794758,097758,311367,066366,937316,754n/a1,329,8321,328,3571,088,756
Core2 T23001.66GHz210,691232,884586,950587,660298,031297,973239,845n/a855,838855,600763,868
Core2 T75002.2GHz304,027286,315835,736836,694400,011400,388348,750n/a1,465,9041,467,4641,181,531
Xeon X55502.67GHz385,203296,8621,053,1781,054,078455,272455,312422,9264,351,3921,667,0161,676,2301,822,632
PowerMac G52GHz212,214147,982377,590417,308225,339225,339212,190n/a/738,237736,327728,993
Xeon X55702.93GHz384,908259,286855,428855,416421,520421,524406,5964,283,5261,818,8241,818,7561,632,126
Xeon X75602.3GHz197,739140,100427,931427,931213,622224,348204,1482,143,132898,852889,125863,381
Opteron 83542.2GHz257,997258,429650,962650,855369,342367,794337,798n/a984,548984,264996,814
Core i7??2.6GHz241,697193,481597,500597,550267,273267,266264,2753,249,9291,257,1601,257,2361,219,009
Xeon E53352GHz268,751216,926696,639695,757344,071342,600280,610n/a1,115,0851,124,4721,217,095
P7?3.3GHz321,760205,227417,772460,069257,957258,019320,922n/a933,644902,453929,237
P6?4GHz409,852388,417815,203910,645471,649486,732431,875n/a1,233,7021,290,8491,239,037

Here is a description of the various CRC implementations tested:

algorithmdescription
crc16ANSI CRC16 algorithm in kernel (Sarwate)
crc16-t10difT10 CRC16 used for DIF in kernel (Sarwate)
crc32-kern-beBE CRC32 in kernel (slice by 4)
crc32-kern-leLE CRC32 in kernel (slice by 4)
crc32-e2fs-beBE CRC32 in e2fsprogs 1.41 (slice by 4)
crc32-e2fs-leLE CRC32 in e2fsprogs 1.41 (slice by 4)
crc32cDefault CRC32C in kernel (Sarwate)
crc32c-intelAccelerated CRC32C on Intel Core i7
crc32c-by8-beBob Pearson's updated BE CRC32 algorithm, but with CRC32C polynomial (slice by 8)
crc32c-by8-leBob Pearson's updated LE CRC32 algorithm, but with CRC32C polynomial (slice by 8)
crc32c-intelby8Intel's CRC32C algorithm http://prdownloads.sourceforge.net/slicing-by-8/ (slice by 8)

At a 4K block size the time slices are so tiny that it's difficult to identify any clear trends.

It is well known that Sarwate's algorithm has been superseded (performance-wise) by the bit slicing implementations; these results support that conclusion. All slice-by-N implementations had #define'd a polynomial, making it trivially easy to port the code to the "default" CRC32C implementation. Obviously, the hardware solution eats all the others for lunch, though it only exceeds the slice-by-8 algorithm by a factor of ~2.5x and the slice-by-4 algorithms by a factor of ~4x. Either way, 1.5GB/s of _metadata_ updates is quite a lot, so the performance hit might not be too hard provided we can replace the current software crc32c code with one of these slice-by algorithms.

As a side note, it is also desirable to optimize the crc16-t10dif algorithm, not for ext4 but for DIF disks.

Also, I hear that the upcoming SPARC T4 will have hardware CRC32c acceleration.

Existing Metadata Checksumming

Block Groups

The block group descriptor is protected by a CRC16. On a 64-bit filesystem, it may be possible either to extend the field to 32-bits, or to stuff a 32-bit crc into 16 bits per the "Stuffing" section above.

Journal

jbd2 has a (probably infrequently) used journal_checksum feature that ensures the integrity of the journal contents. Currently it supports CRC32, MD5, or SHA1 checksums, though as of Linux 3.0 it only seems to support CRC32. This can be easily switched over to CRC32c.

On-Disk Structure Modifications

Darrick will try to implement this without requiring an on-disk format change. Basically, that means that we have to find places where checksums can be crammed into existing data structures.

Superblock

Andi Kleen posted a patch to checksum the superblock. Darrick plans to massage this patch a little bit; the crc32c will be pasted into the superblock at offset 0x24C.

Inodes

Inode checksums are only supported on Linux. The checksum is a crc32c field at offset 0x7C, which puts it in the middle of osd2.linux2. The checksum covers the inode and everything else that follows it (afaik in-inode extended attribute blocks).

Inode/Block Bitmap (64-bit)

Each bitmap has its own crc32c checksum; both checksums are stored in the block group descriptor. The inode bitmap checksum is at offset 0x18, and the block bitmap checksum is at offset 0x38. This only works if the 64bit feature is set, unfortunately.

Inode/Block Bitmap (32-bit)

For 32-bit filesystems, Darrick is considering using the 16-bit fields in the block group descriptor at offset 0x18 and 0x20 to store either crc16 or stuffed crc32c values of the inode and block bitmaps. It's probably better to have a slow crc16 over no crc at all.

Extent Tree

Filesystem blocks are always 1024, 2048, or 4096 bytes, and the extent tree header and entry structures are both 12 bytes long. Therefore, because 2^n % 12 >= 4, there is sufficient space to store a crc32c just past the end of the last struct ext4_extent. The checksum is computed only the part of the extent block that is in use.

Directory Blocks

Regular directory leaf blocks (i.e. blocks that are not secretly htree nodes) are a semi-packed array of variable-length records. A 12-byte directory entry is created at the end of the block with a an inode of 0 to make the entry look unused to old ext4 drivers; a name_len of 0; and a rec_len large enough to hold a crc32c. In a cursory analysis of 250,000 directories, just 29 had blocks that did not have sufficient space to hold the 12-byte tail. tune2fs will advise users to run e2fsck -D to rebuild all directories so that all directory blocks may have a checksum.

HTree

The htree root and internal nodes do not hide a checksum in a fake dirent at the end of the block because that would require the removal of two struct dx_entry from each htree block. Instead, the limit count is decreased by 1 and the crc32c stored at the end of the block. Again, tune2fs will advise users to run e2fsck -D to rebuild all directories and perform any necessary htree rebalancing.

Unfortunately, in adding htree checksums to a very very large directory, it is possible to overflow the htree.

Extended Attributes (EAs)

For EAs stored in a separate disk block (i.e. not stored after the inode), there is sufficient space to store a crc32c directly in the header.

For EAs stored in the extra space after the inode, Darrick thought incorrectly that the h_magic field was never checked. That turned out to be untrue, so his new proposal is to follow Andreas Dilger's suggestion simply to extend the inode checksum to cover the extra space after the inode structure. That will require a fair amount of changes to e2fsprogs, but not a lot for the kernel.

Metadata Not Being Upgraded

Direct/indirect/triple-indirect block maps are not targeted for checksums, as this results in a totally incompatible disk format change and reduces the maximum file size considerably. Files should be converted to extents via chattr +e for increased safety and less overhead.

Tool Updates

A user should be able to turn on this feature at mke2fs time simply by specifying -O metadata_csum. Because the 64bit feature allows arbitrarily large block group descriptors that are large enough to enable crc32c for the bitmaps, mke2fs should warn the user if the feature set is metadata_csum,^64bit when it becomes the case that the 64bit feature has been tested thoroughly.

It should be possible to convert existing filesystems with a simple tune2fs -O metadata_csum. tune2fs will apply checksums to all metadata structures that can trivially take them, and tell the user to run e2fsck -D if necessary. e2fsck will gain the ability to reorganize directory tree blocks to accommodate the checksum fields. Obviously, 64bit mode cannot (currently) be enabled on existing filesystems.

It should be possible to disable metadata checksumming on an existing filesystem with tune2fs -O ^metadata_csum, with the same conditions outlined for enabling checksums on an existing filesystem.

debugfs should try to display checksums whenever possible.

It should NOT be possible for old fs code to write to a filesystem with metadata checksums enabled. The metadata_csum flag is implemented as a ROCOMPAT flag, which should keep (non-malicious) old programs from messing things up.

Stuff Darrick Hasn't Thought Hard Enough About

  • Other filesystems' use of checksums??
  • Other ext4 features being concurrently developed?
  • Value-adds that use some ext4 fields without noting it in the ext4 documentation.
  • Defensive programming when we have to parse the metadata that is being checksummed (extent tree? dir blocks? htree blocks?)