Lesson for the day, IOPs and latency matter….
The NAS that stores VM disk images has had SSDs. No real reason, except when it came time to replace some dying spinning rust, the SSDs were cheap enough. So I put SATA SSDs in. The NAS itself is not blazingly fast, and the Gbit Ethernet the NFS is served over would be the real limiting factor anyway I thought.
And overall, I was right. The network did limit the throughput. I wasn’t booting servers any faster, things were just ticking over lightly touching the SSDs no matter how hard the VMs were working….
Life goes on, servers change and expand and things just keep working….
And then 8 months l have both the SSDs start having performance issues in parallel. Nothing obvious, there’s already some firmware bugs on them, so I have to take them out and run them through the Samsung software under Windows. So I migrate the VMs to the RAID10 spinning rust on the other NAS, and swap in RAID1 spinning rust in the original NAS…
And the gates of hell opened up.
The throughput might not have changed, but boy did the latency/IOPs change. The spinning rust just could not keep up. I split things over the two NAS and that keeps things mostly OK but right on the edge.
Then serendipity … at one point I just stuffed a 32G SATA SSD in the DVD spot on the NAS. It’s not configured, it just shows up as a ATA device not AHCI device, but it’s there. And it’s SSD.
Partition it into a 16G and a 1G partition. Tell ZFS to use the 16G as cache and 1G as log.
And suddenly all the performance woes just ease dramatically. Most of the writes stay in memory or go to SSD except when they’re periodically flushed to hard disk. And the HDD flush doesn’t slow anything else down. So writes are much more performance. The cache part isn’t as effective, but nothing to ignore either, especially as it takes load off the hard disks for memory cache misses.
So…. ordered a small SSD for the other NAS that unfortunately does not have one.