Phenomenon:
In ext3/ext4 file system, synchronized write performance of 4KB block size is about 50MB /s. However, Huawei ES3000 PCI-E SSD 4K synchronized write performance is almost 220MB /s . There is a performance gap between PCI-E SSD raw device and filesystems on it.
Root Cause Analysis :
1. The 4K synchronized write performance 50MB/s is related to ext3/ext4 filesystems mechanism. Actual data written into storage device is larger than user’s data size. So effective write bandwidth is smaller than max throughput of PCI-E SSD.
In the raw device ( no file system ) , for a single process 4KB synchronized write , PCI-E SSD card performance can reach 220MB / s.
In the file system , when synchronous write 4KB , We can observe the SSD write bandwidth up to 200MB / s by iostat system command.
2. why ext3/ext4 file systems has write amplification ?
A journaled file system allocates a special area—the journal—in which it records the changes it will make ahead of time.
Since ext3/ext4 having Journal function , when user write a 4K I/O data to the storage, the filesystems will update corresponding block Journal firstly and then write data into storage.
Each Journal size is fixed , usually set to 32KB(user can change journal length when format filesystem) . That is, regardless of the upper I/O data block size, to write a block of data, corresponding 32KB Journal data and 4KB Metadata must be updated.
However, 32KB journal data will be divided into 5 pieces by ext3/ext4 journaling system, and there are 5 individual I/O for journal data. And then it deceases storage performance further more because of smaller data I/O.
The following code is in the ext4 kernel file fs/jbd2/commit.c of kernel source code 2.6.32-358. Line 638~654 is for journal data update and one journal(typical size is 32KB)l is divided journal–>j_wbufsize individual(typical j_wbufsize is 5).
00304: /*
00305:* jbd2_journal_commit_transaction
00306:*
00307:* The primary function for committing a transaction to the log.This
00308:* function is called by the journal thread to begin a complete commit.
00309:*/
00310: void jbd2_journal_commit_transaction(journal_t *journal)
00311: {
00312: structtransaction_stats_s stats;
00313: transaction_t *commit_transaction;
00314: structjournal_head *jh, *new_jh, *descriptor;
00315: structbuffer_head **wbuf= journal–>j_wbuf;
… …
00622: /* If there’s no more to do, or if the descriptor is full,
00623: let the IO rip!*/
00624:
00625: if(bufs == journal–>j_wbufsize ||
00626: commit_transaction–>t_buffers == NULL||
00627: space_left < tag_bytes+ 16) {
00628:
00629: jbd_debug(4, “JBD: Submit %d IOs\n”, bufs);
00630:
00631: /* Write an end-of-descriptor marker before
00632: submitting the IOs.”tag” still points to
00633: the last tag we set up. */
00634:
00635: tag–>t_flags |=cpu_to_be32(JBD2_FLAG_LAST_TAG);
00636:
00637: start_journal_io:
00638: for(i = 0; i < bufs; i++) {
00639: structbuffer_head *bh= wbuf[i];
00640: /*
00641: * Compute checksum.
00642: */
00643: if(JBD2_HAS_COMPAT_FEATURE(journal,
00644: JBD2_FEATURE_COMPAT_CHECKSUM)){
00645: crc32_sum =
00646: jbd2_checksum_data(crc32_sum, bh);
00647: }
00648:
00649: lock_buffer(bh);
00650: clear_buffer_dirty(bh);
00651: set_buffer_uptodate(bh);
00652: bh–>b_end_io = journal_end_buffer_io_sync;
00653: submit_bh(write_op, bh);
00654: }
00655: cond_resched();
00656: stats.run.rs_blocks_logged += bufs;
For example:
When writing data block size is 4KB, an effective amount of data written to PCI-E SSD is only 4KB / (4KB +32 KB +4 KB) = 10%.
For data block size of 32KB , the effective amount of data written to the PCI-E SSD is 32KB / (32KB +32 KB +4 KB) = 47%
For data block size of 512KB , the effective amount of data written to the PCI-E SSD ‘s 512KB / (512KB +32 KB +4 KB) = 93.4%
Note: This is only the proportion of valid data , and effective performance of bandwidth is not the same.
That is, when using 4KB synchronized write, the effective proportion data written into the SSD is only about 10%.
3. Is there any method to improve the performance ?
Since write amplification is the cause of filesystems journal and metadata update. Effective method to improve performance is to minimize the update of metadata and journal.
(1)Most possible method to improve performance is not using O_TRUNC when open file.
When using this parameter, we found that it will increase about 80% performance.
Other possible performance improvement method :
(2) formatted file system , adjust Journal length size.
such as adjust journal length size to 16KB.
mkfs.ext4 /dev/hioa -J size=64
We have tested journal length from 4K to 64K, and there is no obviously performance variety. Performance variety range is about 5%.
( 3 ) using writeback Journal mode when mounting filesystems.
mount /dev/hioa /data -t ext4 -o data=writeback
When using writeback data mode, it will be about 3% performance improvement.
我写了一篇写放大的定量分析,还没发布
谢谢。期待你的分享。