Posted by John Kim
Finely tuning buffers, queues and caches can make your storage system hum. And that’s exactly what we discussed in our recent SNIA Ethernet Storage Forum webcast, ““Everything You Wanted to Know About Storage But Were Too Proud To Ask – Part Teal: The Buffering Pod.” If you missed it, it’s now available on-demand.
In this blog, you’ll find detailed answers from our panel of experts to all the great questions we received during the live event. I also encourage you to check out the other on-demand webcasts in this “Too Proud To Ask” series here and stay informed on upcoming events in this series by following us on Twitter @SNIAESF.
Q. Question on cache – What would be the right size of cache at each point (clients / Front-end connect / Storage controller / Back end connect / Physical storage).
A. Great question! The main consideration for cache sizing at any point is the workload. If the workload is conducive to cache benefits, then the more cache the merrier! However, when workload is not conducive to cache, adding more cache capacity won’t be beneficial. For example, if the workload is 100% sequential reads of small 4K IOs, having the data be pre-loaded into cache is going to be extremely helpful, and increasing the size of such cache at the end-point will be good. If the workload is random, and the IO size is changing, pre-fetching data into cache may be not a good idea. Similarly, with write cache, the benefit is realized two-fold: first, when the write is stored in cache and ack’ed back to the host (such write is typically called “dirty”, because it hasn’t been flashed back to the disk) and second, when the dirty write is overwritten by the host before it is flashed. Any other combination of workloads and IO will only get partial benefit from the cache. Sizing cache is a very difficult exercise and there are no universal answers. Every implementation has its own pluses and minuses.
Q. Isn’t a higher queue depth increasing latency as well, so applications would run slower as they are waiting longer for IO to complete?
A. The answer to this is very dependent on the environment. In general having more outstanding operations would increase the load on the interconnects and storage media which would result in the per-IO latency increasing. The alternative is having a small queue depth which may produce consistently lower per IOP latency at the expense of less throughput and IOPS. There are numerous techniques for dealing with mixed storage traffic, low-latency and high throughput, such as multi-queues, out-of-order completions, immediate and delayed data transfers in-line, ready to transfer, and policies. The NVM media latency roadmap is also helping with these types of latency vs. throughput decisions by enabling devices that achieve full-throughput at very low queue depths.
Q. Does SCSI protocol have a max queue depth of 32?
A. No, the SCSI Architecture Model allows for up to 64 bits for the command identifier field and each of the SCSI transports (iSCSI, SAS, …) defines a maximum within that range. There may be implementation-dependent SCSI endpoints that define smaller ranges.
Q. How would a distributed software defined storage technology deal with queue depth and how can this be advantageous or not advantageous?
A. Interesting question. Distributed software defined storage is by definition made up of multiple autonomous layers of software components orchestrated to provide stable storage. These types of systems will have many outstanding operations (queue depth) at multiple-stages and layers. It’s also not uncommon to see SDS file systems front-ended with block-based protocols, such as iSCSI, which enable the initiators to build up large queue depths of operations.
Q. Are queue depth and buffer the same?
A. No, queue refers to command and response queues, buffers refer to in-flight data buffers. Command and response queues often contain pointers to these buffers embedded in the read or write commands.
Q. Are caches and buffers made of the same silicon that makes up SSD disks? Which one is faster?
A. As a general idea, yes, SSDs, RAMs, caches, and buffers are all made from the silicon. If we dig a little deeper, device caches and buffers are typically made of high-speed static random access memory (SRAM), which is faster than the slower and cheaper dynamic RAM (DRAM), used for main memory. Modern SSDs are utilizing an even slower memory, which is commonly known as Flash memory, and we differentiate that type of storage by its structure: Single-Level Cell (SLC), Multi-Level Cell (MLC), etc. Although, there are some SSDs that are made out of DRAM, too. And then there are some newer technologies, like NVDIMM, 3D XPoint, etc. So, while the underlying physical material is still the same silicon, it’s the architecture that makes all the difference.
Q. In PFC.. If there are pending items in P1… can P2 or P3 etc. go ahead?
A. Yes. Priority Flow Control (PFC, also called Per Priority Pause, though rarely) is designed specifically to only pause traffic on one priority, allowing the remaining priority Classes of Service to work according to their configurations. So, for example, if PFC were to pause Priority Queue 1, and Priority Queue 3 also had a “no-drop” configuration but was not having any issues, PFC on Queue 1 would be triggered but PFC on Queue 3 would not. In reality, having more than one no-drop lane on a link is very, very rare, but it does illustrate that PFC operates on a per-priority basis, not on the whole link.
Q. Do all Ethernet based NVMe-oF (NVMe over Fabrics) implementations require some form of Data Center Bridging (DCB)? Or, are there versions of Ethernet based NVMe-oF (RoCE & iWARP) that run over standard Ethernet without needing DCB?
A. Yes, both iWARP and RoCE can be run without DCB. To maintain peak performance either DCB or other flow control mechanisms like ECN are recommended.
Q. Do server devices automatically honor the pause frame or does it require configuration?
A. I am assuming “server devices” refers to Ethernet ports on a server. It depends on the default settings of the NIC or LOM or those loaded by the driver during initialization. Generally speaking NIC devices that support PFC also support DCBX (Data Center Bridging Exchange). DCBX is a protocol that allows an end device, like a NIC, to get its proper configuration settings from the switch. That means that in an environment where PFC needs to be assigned to a specific Class of Service (CoS), the switch will send the NIC the proper settings during the setup configuration.
Q. Is it mandatory for all devices in network, host and storage to have same speed ports?
Q. What are the theoretical devices for modeling and analyzing cache, buffer or queue behaviors?
A. Computers with software 🙂
Q. What if I have really large sized writes and they fill up the cache quickly? Is there a way to bypass the large sized writes?
A. The time of the presentation limited the amount of material we were able to share. One of the subjects we didn’t talk about was the cache software algorithm. Most storage vendors manage the cache by not letting extremely large IOs to be cached. Back in the spinning storage era, an IO of 2MB would typically be considered too large to be cached, and would be sent directly to disk.
Q. What will be the use of cache in all flash storage please? As flash is the highest performance disk.
A. See the answer to question above “Are caches and buffers made of the same silicon that makes up SSD disks? Which one is faster?” Hardware cache and buffers are typically made out of the fastest memory, then comes RAM and last are the SSDs aka flash disks. Therefore, storing data on a faster layer is still beneficial to the performance.
Q. Does the LUN Queue Depth includes the Queue Depth discussed here?
A. Yes, SCSI LUN queue depth enables the initiator(s) to have multi outstanding I/O operations in flight.
Q. Will you use a queuing algorithm to manage IO queue? If your answer is yes, which algorithm will you use?
A. There are several storage protocols that define mechanisms for a target to dynamically adjust the queue depth available to the initiator through various forms of credit exchanges. Having these types of mechanism enables the target to implement multi-initiator load balancing across targe