The main HSM components are illustrated below.
The white boxes are the data functions, the yellow ellipses are the control datasets, the blue boxes are the data stores, and the green boxes are the recovery logs and trace datasets.
Backup and Migration are the two main functions of HSM.
HSM recognises data in three places. ML0 (Migration level 0) is the on-line data that is accessed by applications and users. Data
is usually moved to ML1 (Migration level 1) first, according to SMS management class rules. ML1 is a dedicated pool of disks, which are non-SMS managed. The data is compressed. The minimum size a dataset can be is about 53K (assuming 3390 track geometry). Small datasets could be archived to ML1, but still use the same space, so small datasets are held as records in Small Data Set Packing datasets (SDSPs). These are standard KSDS VSAM files.
To use small dataset packing, you have to tell DFHSM how big a small dataset is. You need an entry in your ARCCMDxx dataset like
As a guide, a single track record using half track blocking, and getting 3:1 compression will use about 160KB on ML1. If all these assumptions are correct for your site, then 160KB is a reasonable cutoff point, as anything smaller would not occupy a whole track.
If a dataset continues to be unused, it will eventually be migrated off to ML2 (migration level 2), which is usually high capacity cartridge. Large datasets are often migrated straight off to ML2.
Migrated datasets are given a special catalog entry, with a volser of MIGRAT, to indicate that the dataset is migrated. The MCDS (Migration Control Data Set) keeps a record of what has been migrated, and where the migrated data is held. If you try to access a migrated dataset, it will be automatically recalled back to ML0.
There are claims that ML1 is not relevant anymore. Disk subsystems with tiered Flash / Fast Disk / Slow Disk storage and automatic tiering software provide the same functionality as a compressed disk pool, and arguably manage things better. The mainframe CPU saving in not needing to manage the ML1 data can be considerable, but then again, HSM housekeeping normally runs at quiet times when there are free CPU cycles. I guess that this is something that each site needs to evaluate for themselves, then decide what is best for them.
There are three variations of space management
To run primary space management you need to issue commands like these
DEFINE PRIMARYSPMGMTCYCLE(YNNNYNN) CYCLESTARTDATE(2017/01/23)
SETSYS PRIMARYSPMGMTSTART(0100 0300)
This means run primary space management on Mondays and Fridays, starting at 01:00. HSM will not start to process any new volumes after 03:00. The reason why it starts on a Monday is because January 23rd 2017 was a Monday. You would typically enter this command once when setting up HSM, and then just enter it again if you wanted to change the parameters.
Primary space management does all the space management functions on the primary, or ML0 disks. If allowed by parameters, it will work its way through each ML0 volume, processing the largest dataset first, and delete temporary and expired datasets, release unused space, then migrate data to ML1 or ML2 as appropriate, until all volumes are below their SMS thresholds, or there are no more datasets elligible for processing.
Secondary space management needs an initial command, similar to primary space management. If you never enter this command, then secondary space management will never run.
DEFINE SECONDARYSPMGMTCYCLE(YNNNYNN) CYCLESTARTDATE(2017/01/23)
SETSYS SECONDARYSPMGMTSTART(0030 0200)
Secondary space management basically looks after the ML1 and ML1 archive pools. If the management class criteria are met it moves data from ML1 to ML2, it runs TAPECOPY commands if they are needed and it deletes expired migrated datasets.
The end of that last sentence needs a bit of expansion. HSM can delete migrated datasets, that is it can delete data once it is not required. It does this based on retention policies set in the DFSMS management class. Normally, HSM will not delete a dataset unless it has a current backup of it. This is an issue if you do not use DFHSM to backup your data, so it is possible to apply a patch to HSM that allows it to delete data that it has not backed up.
So if you are expecting HSM to delete data and this is not happening, one possibility is that HSM requires a backup before it will delete the data. Other things that can go wrong is that for SMS volumes the storage group containing the volumes must be defined with AM=Y and the HSM parameter 'Scratch expired Data Sets' must be set to YES. If it is not, change it with command
HSEND SETSYS EXPIREDDATASETS(SCRATCH)
You run interval migration on one LPAR, so for that LPAR you specify
in the ARCCMDxx Parmlib member, and in all other LPARs you specify
Interval migration runs every hour, and checks each volume occupancy against the SMS threshold settings for the volume's storage pool. If the high threshold is exceeded, then DFHSM will migrate eligible datasets until the low threshold is reached or no more data sets are eligible. It will also delete temporary and expired data sets
You can restrict the number of recall tasks with the following SETSYS parameter.
SETSYS MAXRECALLTASKS(n2) TAPEMAXRECALLTASKS(n1)
The tape recall tasks are a subset of the max recall tasks, so n1 must be smaller than n2.
Data needs to be backed up on a regular basis, incase it is accidentally deleted or corrupted. Hardware failure is very rare these days. HSM can stage backups to ML1, or write them straight to tape. Backups are recorded in the BCDS (Backup Control Data Set). This makes recovery very easy. The OCDS (Offline Control Data Set) keeps a record of all tapes used by HSM, both backup and migration.
To schedule HSM backups to run automatically, you need to add lines like these to your ARCCMDxx member
SETSYS AUTOBACKUPSTART(0100 0200 0600)
What this says is that the backups will start between 01:00 and 02:00, and no new volume backups will start after 06:00 Up to three concurrent backup tasks can run on this host. If you are running in a sysplex with several LPARs, its best to run several concurrent backup tasks from a single LPAR, rather than spreading the tasks between LPARS.
Fast Replication uses FlashCopy and can create a backup of several complete storage pools in a very short time, so reducing application downtime. Fast replication is designed and optimised for DB2 backups but can be used for any data. You need to define a special SMS storage pool type called a copy pool, which is used to fast replicate all the volumes one of more storage groups. When you define the copy pool, you just add in the associated source storage groups, not the source volumes. Each copy pool can be associated with up to 256 storage groups and each storage group can be replicated in up to 50 different copy pools, which makes the configuration very flexible.
If you run a FRBACKUP command for a particular copy pool, DFHSM will create a fast replication backup for each volume in every storage group defined within that pool.
These backups are kept as versions on disk, and you can keep up to 85 backup versions. IBM recommends that you keep at least 2 versions on disk, as when the FRBACKUP command is issued, the oldest backup version is delete to make room for the new one. If you just keep one version, then you will have no backups for a short time. You have the option of dumping the backup data in the copy pool to tape, but the tape backups are actually associated with the source volumes, not the copy volumes.
Recovery from a fast replication backup version can be performed at the copy pool level from a disk copy, at the individual volume level from a disk or tape copy, or at the data set level from a disk or tape copy.
You define a copy pool using option 'P' from the primary DFSMS menu, and add some volumes to it and then associate one or more primary SMS storage pools with it. You do not associate any primary volumes with the copy pool, these are added dynamically and so do not need to be changed as volumes are added and deleted to the primary pool.
As FRBACKUP is optimised for DB2 it works with the DBMS to ensure buffers are flushed and the entire DB2 configuration backup is backed up consistently. If you use FRBACKUP for non-DB2 systems it's up to you to manage the buffers and make sure that all the datasets that you need to recover an application are all in the associated primary storage pools. Ideally, the application should be stopped for the short time it takes for the FRBACKUP function to run. Special consideration must be given to multi-volume data sets. Because VTOC enqueues are obtained as each volume is processed, a multi-volume data set can be updated on one of its volumes while the other volume is being fast replicated. To avoid this situation, you must ensure that data set activity for multi-volume data sets is quiesced during the fast replication process
A standard FlashCopy relationship is normally withdrawn once all the source tracks are copied to the target, with FlashCopy in either COPY or NOCOPY mode. It is possible to maintain this relationship with persistent FlashCopy so that changes to the primary volumes are recorded in a bit map, then a FlashCopy refresh just needs to copy the changed data.
FRBACKUP uses this principle to create different backup versions on disk, and it is controlled by the 'Number of DASD Fast Replication Backup Versions with Background Copy' parameter in the Copy Pool definition panel in DFSMS, so you need to plan this out in advance. The value can be a number between 0 and and 85.
If you specify a number greater than 0, then you use incremental FlashCopy, and a persistant relationship is established. When you run the first backup, all the data in the primary storage pools is copied in the background to the copy pool, and the data in the copy pool will be that same as the data was in the primary pools, when the FRBACKUP command was issued. The actual first copy could take some time. Subsequent backups will just copy changed data since the last backup and will take much less time.
If you specify 0, then HSM used the NOCOPY FlashCopy option and only one backup version at a time is supported by the copy pool. When you take subsequent backups with the FRBACKUP command, a new FlashCopy relationship is established and it replaces the previous backup version instantaneously. You might do this if you are just using the Fast Replication backup as a staging position before copying it to tape, if you wish to use fast reverse restore, or if you want to use Space Efficient volumes as target volumes.
You use a SETSYS command to change the performance characterisitics of FRBACKUP as folllows
SETSYS MAXCOPYPOOLTASKS(FRBACKUP(15) FRRECOV(15) DSS(24))
FRBACKUP uses DFDSS as its data mover, and the FRBACKUP and FRRECOV parameters determine how many concurrent invocations of DFDSS can run, and they can be between 1 and 64. The DSS parm is the number of volume pairs that FRBACKUP passes to each DFDSS instance and can be between 1 and 254. The default values are shown in the above command for all 3 values.
In all these examples we will work with a copy pool called CPYPOOL1. You can use the PREPARE keyword to test the FRBACKUP command and make sure the copy pool is set up correctly.
HSEND FRBACKUP CP(CPYPOOL1) PREPARE
You run a backup of all the storage pools associated with the copy pool with the EXECUTE parameter. The value of fast replication DASD backup versions determines if the backup is persistent or not. If this value is 0, DFSMSdss is always called in NOCOPY mode. The second command specifically asks for an incremental backup.
HSEND FRBACKUP CP(CPYPOOL1) EXECUTE
HSEND FRBACKUP CP(CPYPOOL1) FCINCREMENTAL
If you want to copy to tape you need to assign a dumpclass. You can either do this on the copy pool definition, or explicitly with the command. It's best to define a dumpclass that is only used for fast replication. The two available options are
HSEND FRBACKUP CP(CPYPOOL1) DUMP DCLASS(DCFR001)
HSEND FRBACKUP CP(CPYPOOL1) DUMPONLY DCLASS(DCFR001)
The DUMP keyword means create an FR backup in the copy pool then immediately copy it off to tape.
The DUMPONLY keyword means take an existing FR backup and copy that off to tape. By default the latest backup is used, but you can pick out older backups by using one of the DATE, VERSION, TOKEN, or GENERATION parameters.
You can recover all of the volumes in all of the storage pools associated with a copy pool with a single command, and as this is using FlashCopy, it is quite fast. Remember, if you specified VERSIONS=0 and Allow Fast Reverse Restore=N, you cannot recover from the DASD backup copy.
HSEND FRRECOV CP(CPYPOOL1) VERIFY(Y)
If this command hits a problem and a volume or two are missed, you can recover an individual volume with the command
HSEND FRRECOV TOVOLUME(volser) FROMCOPYPOOL(CPYPOOL1)
The CDS files are critical to HSM, so updates to them can be logged. If a CDS fails, it can be recovered from backup, then the log updates applied to get it back to the point of failure. The PDA (Problem Determination) files are trace datasets. The CDS Recovery section explains how the logs can be used to fix CDS errors.
When several HSM instances run in a Sysplex they are called an HSMplex. You can run more than one HSM instance on a single LPAR, as well as running instances on separate LPARS, with a maximum of 39 instances in a single plex. All the instances must share the MCDS, BCDS, OCDS and JRNL datasets.
As these datasets are shared, you need to consider how you will manage dataset integrity by managing ENQs between instances and LPARs. Check out the available IBM manuals and redbooks, which detail how to do this.
One instance must be defined as the Primary instance, by placing PRIMARY=YES in the startup parms for that instance. The Primary instance is responsible for running the CDS backups, pre-migration backups, moving backups from ML1 to the backup volumes, deleting expired dumps and extra VTOC copy datasets.
If there is no primary instance running, then those functions do not run. You can set several of the other instances to be primary standbyes, so they will be automatically elligible to be promoted if the original primary fails. If this happens, then the first standbye primary that manages to take over will succeed as new primary, and the others revery to being standbyes again.
For LPARS that are able to access same set of data, you can use a Common Recall Queue to balance out recall work between LPARS. If some of your LPARS can access all the data and some cannot, you can exclude the instances that do not share data from the common recall queue with startup parameters like these: