The concept behind HSM is simple. When you create a file, you allocate it on expensive disk or maybe SSD. When a file is new, you would expect to be updating or accessing it quite often, but after a while the file becomes 'stale' and is hardly ever looked at. The data is still important, but it does not need to be kept on expensive primary disk. AN HSM product moves stale data automatically to less expensive, slower secondary storage devices like cheap SATA disk or tape libraries.
When you access the file again, it is automatically and transparently retrieved back to the expensive, or primary disk. In theory, users never run out of storage and have constant access to their data regardless of where it is stored.
The issue that HSM software providers face now is that hardware storage tiering provides an HSM like function without any special software and works in a much more dynamic fashion. The enterprise disk page has some information about hardware tiering, but in short, 'active' data is held on SSD disk and once the data becomes inactive, it is moved off to slower spinning disk. This process itself is becoming a bit stale, and many disk subsystems are now all-flash, as opposed to flash / spinning disk hybrid.
What HSM can do is move data between disk subsystems, or even to tape. It can also delete data once it reaches end of life, so for that reason alone it is worth looking at.
The HSM principle is explained in a bit more detail in the GIF below.
The customer sees the left hand disk in which the dark blue boxes represent normal data files that are currently being used. In this context, 'currently' probably means that these file have been looked at in the past 3 months.
The disk is full of older files shown in yellow and they typically use 80% of the disk space. The HSM product migrates the older files off to cheaper storage, but leaves a small green stub file behind in the same directory as the original. This means that it's easy to find the file, as the stub is where you expect it to be.
The stub files use a lot less space than the original, so they represent a considerable space saving. If you try to access the stub file, the HSM software intercepts the open request, and holds it, while it goes off to retrieve the file from the near-line storage. Once it copies the data back to the on-line disk it releases the open request, and you get your file. This typically takes a few seconds for a recall from cheap disk, while a recall from tape can take a minute or two. Some products give you a warning message that the recall is in progress, and some give you the option to cancel the recall.
HSM can work on two retention parameters, the amount of time data is allowed to live on primary storage before it is moved to secondary storage, and the amount of time it can live on secondary storage before it is deleted. This means that you can use HSM to automate your deletion policies, but to be honest the big challenge is to analyse your data and decide what the deletion policies should be, before you can automate them.
Alternatively, HSM can work on disk thresholds. It will start migrating older data if the disk occupancy exceeds a high water threshold, and then stop migrating once the occupancy level hits a low threshold. The intention here is that you never run out of disk space.
An archive is typically a point in time application backup, which is retained for several years. After a verified successful backup, selected files can be deleted from hard disk to free up space. They can then be recovered from archive later if required.
The advantage of archiving is that it can be achieved using existing hardware and software. The only cost is the archive tape.
The disadvantage is that the process is all manual. You may have to delete the files manually, and you will have to keep a manual record kept of the deleted files and associated archive tapes. If a file is required again, then a manual restore is needed.
Archiving is suitable for remote file servers connected by narrow bandwidth networks. Archiving could also be considered for project based servers, where an end of project backup could be taken, and filed away for reference, before the system is cleaned up for the next project.