Question: I used Intel SSD data center tool to check my NVMe information, as shown below:
– Intel Optane(TM) SSD DC P4800X FUKS7175003R375AGN -…Bootloader : EB3B0213DevicePath : /dev/nvme0n1DeviceStatus : HealthyFirmware : E2010211IntelNVMe : TrueLBAFormat : 0NativeMaxLBA : 732585167NumErrorLogPageEntries : 63NumLBAFormats : 6PhySpeed : The selected drive does not support this feature.PhysicalSectorSize : The selected drive does not support this feature.PhysicalSize : 375083606016PowerGovernorAveragePower : The desired feature is not supported….SMBusAddress : 256SectorSize : 512SerialNumber : FUKS7175003R375AGNTCGSupported : False…
As you can see, SectorSize = 512. However, when I test with fio using blocksize=4096 is a lot faster than blocksize=512. I know that a page in SSD needs to be erased before writing again, but here 512 should be exactly the same as a page, so it should be fast, how come this happen?
Answer: With SSDs, the block size presented to upper layers is nowhere near the erase page size but 4096 bytes IS closer to the erase page size than 512 bytes. Further, if you send data down in “clumps” of 4096 bytes rather than 512 bytes then everything has less work to do for the same total I/O and the I/O will be more frequently aligned to page size. In fact you will probably find things are faster again when using a 64k block size – minimum block size is different to optimal block size! See http://codecapsule.com/2014/02/12/coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/ (especially the section about NAND-flash pages and blocks) and http://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/ for details.