Commit 76e084a
btrfs: fallback to buffered IO if the data profile has duplication
[BACKGROUND]
Inspired by a recent kernel bug report, which is related to direct IO
buffer modification during writeback, that leads to contents mismatch of
different RAID1 mirrors.
[CAUSE AND PROBLEMS]
The root cause is exactly the same explained in commit 968f19c
("btrfs: always fallback to buffered write if the inode requires
checksum"), that we can not trust direct IO buffer which can be modified
halfway during writeback.
Unlike data checksum verification, if this happened on inodes without
data checksum but has the data has extra mirrors, it will lead to
stealth data mismatch on different mirrors.
This will be way harder to detect without data checksum.
Furthermore for RAID56, we can even have data without checksum and data
with checksum mixed inside the same full stripe.
In that case if the direct IO buffer got changed halfway for the
nodatasum part, the data with checksum immediately lost its ability to
recover, e.g.:
" " = Good old data or parity calculated using good old data
"X" = Data modified during writeback
0 32K 64K
Data 1 | | Has csum
Data 2 |XXXXXXXXXXXXXXXX | No csum
Parity | |
In above case, the parity is calculated using data 1 (has csum, from
page cache, won't change during writeback), and old data 2 (has no csum,
direct IO write).
After parity is calculated, but before submission to the storage, direct
IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
has a different content.
Now all data is submitted to the storage, and the fs got fully synced.
Then the device of data 1 is lost, has to be rebuilt from data 2 and
parity. But since the data 2 has some modified data, and the parity is
calculated using old data, the recovered data is no the same for data 1,
causing data checksum mismatch.
[FIX]
Fix the problem by checking the data allocation profile.
If our data allocation profile is either RAID0 or SINGLE, we can allow
true zero-copy direct IO and the end user is fully responsible for any
race.
However this is not going to fix all situations, as it's still possible
to race with balance where the fs got a new data profile after the data
allocation profile check.
But this fix should still greatly reduce the window of the original bug.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>1 parent b120eb9 commit 76e084a
1 file changed
+12
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
814 | 814 | | |
815 | 815 | | |
816 | 816 | | |
| 817 | + | |
| 818 | + | |
817 | 819 | | |
818 | 820 | | |
819 | 821 | | |
| |||
827 | 829 | | |
828 | 830 | | |
829 | 831 | | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
830 | 842 | | |
831 | 843 | | |
832 | 844 | | |
| |||
0 commit comments