Storage subsystem architectures can be split into a number of components, the Host adapters, which carry the communication channels to the outside world, the Device adapters, which communicate to the real, physical disks, the Cache, which stores data electronically to speed up performance, and the Processors, which manage all the other components. The subsystem will also need a communications architecture to connect all the parts together, and of course power systems to drive it, and cooling fans. The main components are discussed in the sections below. The important point is that the architecture should be non-blocking, as far as possible. This means that all requests can happen simultaneously, without any need to queue for resource.
There are three main connectivity architectures, and these are all discussed below. The main difference between them all is the number of parallel communications operations they can support. This in turn determines the overall subsystem bandwidth. However, be aware that when suppliers quote overall bandwidth numbers, these are maximum theoretical figures that you would never see in real life.
Traditionally, subsystem components were connected together by a bus. Only one device can talk over the bus at a time, so other devices have to wait in a queue. Storage Subsystems have several buses, control information and data are usually segregated onto separate bus structures. However, bus technology limits internal bandwidth to under 2 GB/s. Bus technology is simple and cheap, but is difficult to scale up.
The bus architecture is illustrated in the gif below. Even though there are 16 host paths into the device, only 4 concurrent IOs are possible, as the connection bus has only 4 paths. If more than 4 IOs are scheduled, subsequent IOs are blocked as shown in red, until a bus becomes free. If the device is driven within its capability then blocking is not really a problem, but the point is that a bus architecture can be overloaded.
Write IOs are shown as blue. When they reach the cache, they are effectively complete as far as the application is concerned. The staging down to disk happens in the background.
Read IOs are shown as green. Some of these are sequential IOs and so are pre-staged into cache to help application performance
Most large storage subsystems use a Switched Architecture, which could be considered as a SAN in a box. Components are attached by fibre or copper links to switches. As new components are added, more links are added to cope.
Switch architecture is usually implemented as a PCI-e configuration. The paths between the components are called 'lanes' and are full duplex. That is, they can communicate in both directions at once.
The gif below illustrates the principle behind a switched architecture. It will be non-blocking as long as the fan-out to fan-in ratio on the switches is 1:1. It is obviously simplified to make it reasonable to draw. A real switch has up to 64 connections on each side.
EMC's Direct Matrix technology uses dedicated fibre links to connect components to every other component they need to talk to. So each HBA is connected to every cache segment with a dedicated link. Every cache segment is connected to every disk adapter with a dedicated link. If new cache segments, or new disk adapters are added, then a new set of dedicated links are added. That makes the subsystem fully scaleable, without sacrificing any internal performance.
The gif below illustrates how the matrix can always cope with any data going to and from the adapters to the cache. Again, this is a very simplified diagram, the real thing has a lot more connections.
Host adapters (HA) are used to connect the external communications channels (Fibre, SAS, SATA, SCSI, ESCON, FICON) to the communications web within the storage subsystem. Typical terminology includes HBA (Host Bus Adapter) or HB (Host Bay). The Host Bay generally connects several external channels to several internal channels. For example, an HB might have 8 incoming Fiber Channel ports, which connect to 4 internal buses. In this case, it is evident that this is a blocking architecture, as the data channels are going from 2 to 1. If the architecture is n:n, then it is a good idea to check with the vendor that this means that there can be n simultaneous data operations, without any blocking.
Most enterprise disk communications is now either Switched Fabric or Point-to-Point. The protocol options are
Non Volatile Memory Express, or NVMe, is an open standards protocol specifically designed for communications between between servers and non-volatile memory or Flash Storage. Once Flash disks replaced spinning disks, it became obvious that the SCSI was not fast enough to handle them. NVMe was specifically designed to connect Flash storage via PCIe connectors on a motherboard. You can read more about it at the NVMe page.
SCSI (Small Computer Systems Interface) is parallel bus based, and can support up to 15 devices. Only two devices can communicate on the bus at a time. Faster variants of SCSI have been introduced, SCSI Express, SCSI2, and Ultra2 SCSI. SCSI Express can reach up to 985 MB/s burst transfers.
The bottom line for application performance used to be the physical disks but now flash disks have replaced them (more or less) the bottleneck is with cache and processors. As far as the application is concerned, all write IOs usually terminate in the cache now, and the final write out to disk is asynchronous. The cache is therefore very important in large disk subsystems and can be a blocking point. To reduce the impact of this it should be segmented, and have several data paths though it. A typical configuration would have 4 cache segments, each with 16 data paths, allowing 64 concurrent IO operations. A segmented cache is also more resilient, as then the subsystem can survive a cache failure.
The processors are at the heart of the subsystem. These are the CPUs which contain all the microcode which describes the subsystem emulation, the RAID set-up, the channel connectivity and more. Terms used include 'cluster' and 'ACP'. You need at least 2 processors to make concurrent microcode changes, as then you can switch one processor off, while you alter it. The subsystem will run on the remaining clusters, though performance may be degraded.
A number of storage software extras were developed over the years to use spare processor capacity, for example for RAID, snapshot copy and remote mirroring. These products could use spare processor cycles while the subsystem was waiting for the spinning disks to deliver data. With the advent of Flash storage and NVMe, the data is delivered faster and there is not so many space processor cycles available. This means that the processor can become a bottleneck now.