Skip to content

Commit 76e084a

Browse files
adam900710kdave
authored andcommitted
btrfs: fallback to buffered IO if the data profile has duplication
[BACKGROUND] Inspired by a recent kernel bug report, which is related to direct IO buffer modification during writeback, that leads to contents mismatch of different RAID1 mirrors. [CAUSE AND PROBLEMS] The root cause is exactly the same explained in commit 968f19c ("btrfs: always fallback to buffered write if the inode requires checksum"), that we can not trust direct IO buffer which can be modified halfway during writeback. Unlike data checksum verification, if this happened on inodes without data checksum but has the data has extra mirrors, it will lead to stealth data mismatch on different mirrors. This will be way harder to detect without data checksum. Furthermore for RAID56, we can even have data without checksum and data with checksum mixed inside the same full stripe. In that case if the direct IO buffer got changed halfway for the nodatasum part, the data with checksum immediately lost its ability to recover, e.g.: " " = Good old data or parity calculated using good old data "X" = Data modified during writeback 0 32K 64K Data 1 | | Has csum Data 2 |XXXXXXXXXXXXXXXX | No csum Parity | | In above case, the parity is calculated using data 1 (has csum, from page cache, won't change during writeback), and old data 2 (has no csum, direct IO write). After parity is calculated, but before submission to the storage, direct IO buffer of data 2 is modified, causing the range [0, 32K) of data 2 has a different content. Now all data is submitted to the storage, and the fs got fully synced. Then the device of data 1 is lost, has to be rebuilt from data 2 and parity. But since the data 2 has some modified data, and the parity is calculated using old data, the recovered data is no the same for data 1, causing data checksum mismatch. [FIX] Fix the problem by checking the data allocation profile. If our data allocation profile is either RAID0 or SINGLE, we can allow true zero-copy direct IO and the end user is fully responsible for any race. However this is not going to fix all situations, as it's still possible to race with balance where the fs got a new data profile after the data allocation profile check. But this fix should still greatly reduce the window of the original bug. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171 Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
1 parent b120eb9 commit 76e084a

File tree

1 file changed

+12
-0
lines changed

1 file changed

+12
-0
lines changed

fs/btrfs/direct-io.c

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -814,6 +814,8 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
814814
ssize_t ret;
815815
unsigned int ilock_flags = 0;
816816
struct iomap_dio *dio;
817+
const u64 data_profile = btrfs_data_alloc_profile(fs_info) &
818+
BTRFS_BLOCK_GROUP_PROFILE_MASK;
817819

818820
if (iocb->ki_flags & IOCB_NOWAIT)
819821
ilock_flags |= BTRFS_ILOCK_TRY;
@@ -827,6 +829,16 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
827829
if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
828830
ilock_flags |= BTRFS_ILOCK_SHARED;
829831

832+
/*
833+
* If our data profile has duplication (either extra mirrors or RAID56),
834+
* we can not trust the direct IO buffer, the content may change during
835+
* writeback and cause different contents written to different mirrors.
836+
*
837+
* Thus only RAID0 and SINGLE can go true zero-copy direct IO.
838+
*/
839+
if (data_profile != BTRFS_BLOCK_GROUP_RAID0 && data_profile != 0)
840+
goto buffered;
841+
830842
relock:
831843
ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
832844
if (ret < 0)

0 commit comments

Comments
 (0)