HSM on Intel systems
What is HSM?
The concept behind HSM is simple. When you create a file, you allocate it on
expensive disk. When a file is new, you would expect to be updating
or accessing it quite often, but after a while the file becomes 'stale'
and is hardly ever looked at. The data is still important, but it does
not need to be kept on expensive primary disk. AN HSM product moves
stale data automatically to less expensive, slower secondary storage
devices like cheap
SATA disk, tape libraries or optical jukeboxes.
When you access the file again, it is automatically and transparently retrieved back to the expensive, or primary disk. In theory, users never run out of storage and have constant access to their data regardless of where it is stored.
The HSM principle is explained in a bit more detail in the GIF below.
The customer sees the left hand disk in which the dark blue boxes represent normal data files that are currently being used. In this context, 'currently' probably means that these file have been looked at in the past 3 months.
The disk is full of older files shown in yellow and they typically use 80% of the disk space. The HSM product migrates the older files off to cheaper storage, but leaves a small green stub file behind in the same directory as the original. This means that it's easy to find the file, as the stub is where you expect it to be.
The stub files use a lot less space than the original, so they represent a considerable space saving. If you try to access the stub file, the HSM software intercepts the open request, and holds it, while it goes off to retrieve the file from the near-line storage. Once it copies the data back to the on-line disk it releases the open request, and you get your file. This typically takes a few seconds for a recall from cheap disk, while a recall from tape can take a minute or two. Some products give you a warning message that the recall is in progress, and some give you the option to cancel the recall.
HSM can work on two retention parameters, the amount of time data is allowed to live on primary storage before it is moved to secondary storage, and the amount of time it can live on secondary storage before it is deleted. This means that you can use HSM to automate your deletion policies, but to be honest the big challenge is to analyse your data and decide what the deletion policies should be, before you can automate them.
Alternatively, HSM can work on disk thresholds. It will start migrating older data if the disk occupancy exceeds a high water threshold, and then stop migrating once the occupancy level hits a low threshold. The intention here is that you never run out of disk space.
- The customer gets a 'virtual volume' that more or less has no size limit.
- HSM allows a 'half way house' between having the data on fast disk, and deleting the data. This allows you to keep data available for longer, without the problems of maintaining old data on expensive primary disk. HSM saves money. In 2006 round numbers, a gigabyte of high availability, high performing disk costs about £6, a gigabyte of cheap disk costs about £1, and a gigabyte of tape costs about 30p.
- HSM can also speed up disk recovery. If you need to recover a whole disk, it will take a lot longer if you have to recover all the old, stale data alongside the current data. Even if you protect your disks from hardware failure by data mirroring, this does not protect the data from logical errors. Accidental or deliberate disk reformat, or file corruption caused by the introduction of a virus could corrupt all mirrored copies of the data, and then the disk would have to be recovered from a tape backup. Some products can take up to 24 hours to recover a 200GB disk. If you move all that older data off the primary disk, you could reduce the primary data from 200GB to 40GB, which you could recover much faster.
What is the difference between HSM Migration and Archiving?
An archive is typically a point in time application backup, which is retained
for several years. After a verified successful backup, selected files
can be deleted from hard disk to free up space. They can then be recovered
from archive later if required.
The advantage of archiving is that it can be achieved using existing hardware and software. The only cost is the archive tape.
The disadvantage is that the process is all manual. You may have to delete the files manually, and you will have to keep a manual record kept of the deleted files and associated archive tapes. If a file is required again, then a manual restore is needed.
Archiving is suitable for remote file servers connected by narrow bandwidth networks. Archiving could also be considered for project based servers, where an end of project backup could be taken, and filed away for reference, before the system is cleaned up for the next project.