How do virtual SANs, ILM and Tiered Storage fit together? Can they deliver real benefit to your business?
Tiered storage is a new term to describe an old concept. The idea is that you have storage devices with different capacity, throughput, access speeds, connectivity and price. Expensive storage is fast and highly available. Cheap storage is slow and may be less reliable. The storage tiers will include different types of disk tape and possibly optical. This is sometimes called the Storage Continuum and was called the Storage Triangle several years ago.
ILM is a combination of storage tiers, software to manage the storage and rules to decide how the management will work as the value of data changes. This section is about the virtualization hardware that enables ILM, the ILM section describes the software but remember that they are both needed for good data management. SNIA is currently agreeing specifications for consistent ILM rules and metadata, which will integrate ILM with physical storage tiers.
Virtualization Defined
Few organisations or vendors can agree on a definition of Storage Virtualization
SNIA came up with the following; "The act of integrating one or more (back end) services or functions with additional (front end) functionality for the purpose of providing useful abstractions. Typically virtualization hides some of the back end complexity, or adds or integrates new functionality with existing back end services. "
The Enterprise Strategy Group definition Virtualization as; "a technology
that gathers data location information from physical storage devices,
network services and applications, and then abstracts the locations into
logical views for end users".
Taking these definitions together, combining them with others, and not
worrying too much about rigorous formality, Virtualization can be summarised
into the following points.
Storage Virtualization can simplify the physical implementation of different
devices by standardising them into one logical view.
Virtualization divorces the storage from the server and lets the server concentrate
on processing application requests
Virtualization removes the requirement for technicians to have a good knowledge
of each server platform to be able to manage and configure physical
storage.
Virtualization can add value to a physical implementation by providing extra
functionality.
The advantages of Virtualization
Common tool set
Storage software can now cost more than the hardware it supports. Each
storage vendor has its own brand of software and requires a different
skill set to manage it. Virtualization could permit centralised and consistent
management of all volumes within a data centre using the same methods
and products, no matter what the platform. This will result in a reduction
in staff costs, as the process of managing and relocating the data would
be standardised. Virtualization can cut the cost of replication software
as it uses one set of software tools to manage all storage. If the mirroring
and copying functionality moves to the virtualization server then there
is no requirement for storage subsystem based products. Cost savings predict
a payback inside two years.
Data placement within storage tiers.
Tiered storage can be defined as a set of storage pools having different
performance and availability characteristics, the most important point
being that each tier has a different cost. In the past, most sites adopted
a 'one size fits all' approach, and either used expensive storage for
all their data, or went for a middle ground compromise. The ultimate goal
of virtualization is to consolidate multiple storage devices from different
vendors into a set of storage pools or tiers. One can then place data into the
correct pool, depending on the value of that data. The real trick is to
recognise that the value of data changes with time, and so move the data
around the tiers as its value changes. The problem with any large scale
adoption of storage tiers was a combination of finding some way to connect
all the tiers together, some way of moving the data between the tiers,
and some way to decide which tier to use for a piece of data at a point
in its lifecycle. SANs fixed the connectivity problem and Virtualization
should fix the data movement issue. The third issue should be resolved
by ILM. ILM will add the final building block to virtualization by providing
the 'business policies' which describe what sort of management different
classes of data will get. The policies will determine which tier is used
to hold the the data. The policies will also determine what sort of backup
that class of data will get, and if it will be moved to less expensive
media if it becomes inactive. Holding the policies centrally should simplify
enterprise storage management, and allow a single administrator to manage
a lot more storage.
Data Replication
Data replication involves taking an instant copy of a disk, a file space, or
maybe a file, The benefits of this are a combination of fast backup to
eliminate the backup window and a point-in-time disk copy for fast recovery
from virus attacks and rapid creation of test data. It has been possible
to do this for some time with host and subsystem based Virtualization
systems to disks of the same type. The Replication section has more details
Virtualization combines this function with tiered storage and enables
replication from expensive disk to cheap disk. Cheap SATA disk cost a
lot less than fibre-channel disk, is suitable for disk based backups,
and may be adequate for test systems. The financial benefits of using
cheaper storage tiers for these applications can be considerable.
Cross-site replication has always required two disk subsystems of similar
type and cost. Virtualization permits cross-site replication to unlike,
and potentially much cheaper devices.
Data Migration
Data migration is similar in principle to replication, except that the data
is moved between devices, not copied. Because the files spaces seen by
applications are virtualize then mapped to a physical implementation,
that physical implementation can be changed by copying the data then adjusting
the mapping pointers. If you need to free up older devices for disposal,
the data can be moved off the old devices transparently with no need to
stop the applications. Also, if a file space has a performance problem
it can be moved to a quieter disk without any need to stop applications,
a process known as 'hot file reallocation'. The benefits are reduced application
downtime and fewer requirements to work unsocial hours.
Capacity management
The process of managing LUNs, volumes and files in the Open Systems area is still a major issue. If it is necessary to change a LUN size or add new servers, this requires a lot of complicated manual effort. The processes are different depending on the platform and that makes the issue worse. To make the effort more manageable, storage managers often make LUNs big enough to cope with 12 months or so of growth and that can be expensive.
The physical / logical separation that virtualization introduced means you can expand, define and delete LUNs without affecting the rest of the system, and use up free space more efficiently.
Virtualization facilitates more efficient capacity management as it allows
you to share physical disk capacity between different server platforms.
It should mean the end of the days when you had gigabytes of free space
for Windows, while your Unix systems were struggling to find space. Virtualization
should also make capacity management easier, as it can automatically expand
disks, file systems and databases when they hit a space threshold.
Another benefit of virtualization is that the physical storage implementation is separated from the logical view at the file servers. This should also allow you to mix and match many kinds of physical storage, from several vendors, while hiding the detail of this from the application servers.
Single Subsystem effect
It should be noted here that there is a trend towards consolidating all of your data into one large, multi-terabyte disk subsystem, or maybe two, remote mirrored subsystems. Most of these subsystems will support different classes of disks internally. If you install a single subsystem, then you get most of the above advantages without needing virtualisation.
The three types of Virtualization
While there are lots of different virtualization products around, they
are all variants of three basic architectures; Host based, Controller
Based and Network Based. Network based virtualization has variants and
combinations of In-band and Out-band; and Switch based and Server based.
Some of the various architectures are mutually exclusive and some are
complementary. The following summaries may help master the confusion.
Host-based Virtualization
Host based virtualization has been around for years and involves splitting
up a physical volume into virtual disks or LUNs using volume manager type
software. Examples are Logical Volume Manager for UNIX, Veritas Storage
Foundation for Windows or VMware. The function of the LVM is to intercept
IO requests from applications, work out which physical storage subsystem
they will be directed to, then translate the IOs into a format that makes
sense to those physical boxes. As this Virtualization runs on the host,
it will consume host CPU. Host based Virtualization may be ideal if only
a single host is involved, but it can cause major problems if disks are
shared between servers as the servers could access the same physical space
on a disk. Every server needs to know about the other servers' virtualization
maps to ensure data integrity. This issue, and the ongoing requirement to keep on top of server patches and licensing, is driving storage vendors off host based virtualization and onto switches.
Controller-based Virtualization
Application servers usually write data out to file spaces. Controller-based
virtualization creates virtual images of those files spaces in the storage
subsystem and maps them to pools of physical disks.
Virtualization in the controllers or storage subsystems began with cached
and RAID storage controllers. Neither of these simplified storage subsystems,
but they both added value; faster responses and increases resilience.
Recent developments have introduced very large storage controllers with
a hundreds of terabytes of internal capacity and petabytes of external
storage connection. The application servers are connected through a SAN
to the virtualization controller that will host the top-tier storage directly,
then third party mid and low tier disks can be attached to it. This is
an in-band solution with the architecture scaled to a single subsystem
managing petabytes of storage. Even though it will have lots of redundant
internal components, the subsystem will be a Single Point Of Failure and
it might need lots of channels for performance.
The main problem with controller based virtualization is that it
is difficult to share an FC SAN between different controllers, especially
from different vendors. Controller based virtualization will pretty
much lock you into one vendor.
An example is the HDS Universal Storage Platform. See the USP section for details
Network-based Virtualization
If host based virtualization lies in the host and storage based virtualization
lies in the subsystem it should be obvious that Network based virtualization
lies in the SAN that connects the Hosts to the Storage Servers. However
there are a few different implementations and to some extent these depend
on how they manage the application data.
There are two types of information that passes between the hosts and the storage, data and metadata. The data is the blocks of information that make up files and records; the metadata contains information about the data, including the location of the blocks of data.
If the virtualization is 'in-band' then it lies in the data path so
all the data and the metadata pass through it. The virtualization 'appliance'
will create and allocate virtual volumes on the storage subsystems as
required. It presents these back to the hosts, and when it receives an
I/O request from an application the virtualization server will translate
the IO from the file system virtual request to the physical disk IO and
pass it on to the correct disk array.
If the virtualization is out-band then it will trap the metadata IO and
use that to set up a path for the data IO. Once the path is defined, the
appliance takes no further part in the operation so the meta-data
passes through the virtualization appliance, but the data does not. The
virtualization appliance will create and allocate virtual volumes on the
storage subsystems, but it requires agent software in the data path to
do the virtual/physical IO translation and to present virtual volume information
to the operating system. If an application requests an I/O then the software
agent performs the virtual/physical translation and directs the I/O request
to the appropriate storage sub-system.
The virtualization code can either run on a dedicated server or 'appliance',
or it can run inside the SAN switches. It might see intuitive that virtualization
in the SAN switches must be in-band, but in fact the virtualization software
can run on blades inside the switch and be out-band.
The IBM SAN Volume Controller (SVC) is in-band virtualization that either runs on xSeries servers, or can run as embedded software within a Cisco MDS 9000 switch. More detail can be found in the SVC section.
More detail can be found in the SVC section
IBM used to sell an out band solution called SAN File System (SFS) but they have now withdrawn this from marketing.
EMC Invista runs on an out-band server with agent software in the SAN switches. It supports replication between devices, dynamic volume movement and volume pooling. See the Invista page for more details
'Grid Storage' is yet another virtualization buzz word. Grid storage
can be fitted together and expanded easily a bit like Lego blocks. Virtualization
combines this modular storage with common management software. An example
is HP Storage Grid, but early indications are that they will only support
HP hardware. The Storage Grid page has more details
Conclusion
It appears that storage virtualization has a lot to offer in terms of
simplifying the management of storage from disparate platforms and different
subsystems, but maybe not so much if you run all your disk storage on one subsystem.
The cost case for consolidation of storage management software
also looks compelling. There is a confusing choice of virtualization solutions
and it looks like it will be difficult to switch to a different solution
after implementation. The key appears to be a thorough understanding of
your existing storage and also a good prediction about how it will develop
in future. Based on that, you should be able to draw up a list of requirements
to help you to decide if virtualization is suitable for you, and then
select the appropriate solution for your environment.
There is a perception in the industry that virtualization is moving towards
intelligent switches or gateway devices. The future vision is that these devices will be application
aware, and will be able to deliver storage based on service requirements rather than based
on hardware capability. A large SAN director will be able to handle Fibre Channel, iSCSI
and Ethernet concurrently, and so bring all parts of the storage continuum together.
Add an intelligent blade to the director which can hold ILM based data management rules,
and you have almost got storage utopia. Only time will tell if the technology will live
up to this promise.