Problems with the HSM started task

These links lead to sections in the text below.

   EADM Advert

Accelerate DB2 Write with zHyperWrite and "EADM™ by Improving DB2 Logs Volumes Response Time:

Finding the HSM started task and parameter library

The HSM startup JCL will be in one of your JES proclibs. This is site dependant. SYS1.PROCLIB(JES2) will contain a list of all allocated procedure libraries (assuming you use JES2).

The startup parameters are usually in member ARCCMD00 in SYS1.PARMLIB, but your site may well have changed this. The startup JCL will contain

procname PROC CMD=xx

where xx are the two characters after ARCCMD

back to top


How do I duplex HSM tapes?

HSM Tape duplexing is controlled by a parameter in the SYS1.PARMLIB member, usually ARCCMD00. The parameter is simply

SETSYS DUPLEX(BACKUP(N) MIGRATION(Y))

In this example, we are duplexing the migration tapes but not the backup tapes.

back to top


Checking the activity log

A good place to look for started task problems, is to check the Activity Logs, but you can't browse them as HSM has exclusive use. Issue the HSM command

RELEASE HARDCOPY

to view activity logs with data in them (the command has no effect for logs that are empty). If ACTLOGTYPE is specified as SYSOUT then they are printed to the spool.
If the activity logs are allocated to DASD, printing does not occur, but the current logs are closed and new copies of the logs are allocated to HSM.

To see useful data in the logs, you needed to have set the following parameter

SETSYS ACTLOGMSGLVL FULL

This shows the results of all commands, successes and failures, and is the best option, though the logs could be big. Alternatives are EXCEPTIONONLY - failures only and REDUCED - Level 0 movement only

You can control where the log goes, with the parameter

SETSYS ACTLOGTYPE SYSOUT(class)
or
SETSYS ACTLOGTYPE DASD UNIT=diskesoteric SPACE=(TRKS,(n,m))

The disk logs will be called sitehlq.Hhostid.function.Dyyddd.Thhmmss logs where Hhostid = HSM host ID from the PROC statement, and function is CMDLOG (cmd), BAKLOG (bkup), DMPLOG (dump), or MIGLOG (migration)

back to top


HSM started task abends

There are two ways of getting problem information, from dumps of the started task, or from PDA traces. If HSM abends, then is restarted, VSAM error messages are issued when HSM attempts to open the CDSs. A VERIFY command is issued and the open is retried, which is usually successful, then hopefully HSM continues processing normally.
However, if HSM cannot initialize successfully, this probably means that one of the Control Datasets and/or the journal has been damaged during the abnormal end. To fix this -

  • Analyze the dump resulting from the abnormal end to determine what is damaged

    If it is a CDS:

    • Recover it via an IDCAMS IMPORT command to the most recent backup copy.
    • Start HSM
    • Issue the UPDATEC command to combine the latest transactions in the journal data set with the restored backup copy of the control data set.

    If it is the journal:

    • Stop HSM
    • Reallocate a new journal data set
    • Restart HSM
    • Back up the CDSs.

If both a CDS and the journal are corrupt. then you probably can't do a clean recovery. You might need to use the AUDIT or FIXCDS commands to make your records correct in the CDSs. AUDIT allows the system to cross-check the various records concerning data sets and HSM resources. AUDIT can list errors and propose diagnostic actions or, at your option, complete most repairs itself.

back to top


Auditing a CDS

Consider using the AUDIT command for the following reasons:

  • After any CDS restore (highly recommended)
  • After an ARC184I message (error when reading/writing DFSMShsm CDS records)
  • Errors on the RECALL or DELETE of migrated data sets
  • Errors on BDELETE or RECOVER of backup data sets
  • DFSMShsm tape-selection problems
  • RACF mismatch messages
  • Power or hardware failure

You can use AUDIT to cross-check the following sources of control information:

  • MCDS or individual migration data set records
  • BCDS or individual backup data set records or ABARS records
  • OCDS or individual DFSMShsm-owned tapes
  • DFSMShsm-owned DASD volumes
  • Migration-volume records
  • Backup-volume records
  • Recoverable-volume records (from dump or incremental backup)
  • Contents of SDSP data sets

It is best to use AUDIT at times of low system activity, as some audit processes can run for quite some time. Be aware that an audit will lock out some HSM activity.

SMS-managed volumes cannot be audited with the AUDIT command and the process for auditing SMS-managed data sets is similar to that for auditing non-SMS-managed data sets.

If AUDIT is executing and backup of the HSM CDSs is started (via BACKVOL CDS command or AUTOBACKUP), all HSM functions on the host that started this backup are halted until the AUDIT function and the backup of the HSM CDS's have completed.

back to top


Diagnostic Patches

You can enable, or disable dumps for diagnostics with the following patches

Dump for Installation Exit Abends

PATCH .MCVT.+2D BITS(.......1)

Tracing OPEN/CLOSE/EOV problems

Tape - PATCH .MCVT.+F2 x'00' /* Activate */
PATCH .MCVT.+F2 x'FF' /* Deactivate */
DASD - PATCH .MCVT.+F3 x'00' /* Activate */
PATCH .MCVT.+F3 x'FF' /* Deactivate */

Causing a DUMP when DSS error occurs

PATCH .MCVT.+454'390' /* ADR390E */

Determining why SMS-managed datasets are not being processed

PATCH .MGCB.+26 x'FF' /* Space Mgt */
PATCH .BGCB. +24 x'FF' /* Backup */

back to top


HSM Hanging

When HSM appears to be hanging then the following should be checked before resorting to canceling HSM.

  • Across all LPARs check whether there are any outstanding messagethat require a response. If all your LPARS are in a Sysplex, then the SDSF command

    D R,R

    will display all outstanding prompts for all LPARs.

  • Check across all LPARs whether HSM is waiting on tape drives and if so prioritise making them available.
  • Check to see if there are any reserves or enqueues which are causing HSM to queue. How you do this will depend on what monitoring software you have installed onsite.
  • Check across all LPARs for active HSM tasks by issuing;

    F xxxxxx, Q ACT

    This checks whether HSM activity such as AUDIT or TAPECOPY on one LPAR is impacting on activity on another LPAR.

  • Issue F xxxxxx,HOLD ALL across all LPARs (or HOLD AUDIT/TAPECOPY etc) and wait a few minutes to see whether HSM activity returns to normal (potentially this could 10 minutes or longer.).

back to top


Entering commands with MODIFY

If for any reason you cannot enter HSM commands from TSO, you can use the MVS Modify command from the system console. All modify commands start F "your started task name". The commands below assume the task is called HSM

F HSM,SWAP LOG
F HSM,CANCEL USERID(userid)
F HSM,SETSYS etc
F HSM,CANCEL DATASETNAME(datasetname)
F HSM,CANCEL REQUEST(requestnumber)

back to top


Using VSAM RLS for the Control Datasets

VSAM Record Level Sharing is explained in a bit of detail in the VSAM RLS section. If you run multiple DFHSM tasks on multiple LPARS then RLS can help DFHSM performance as the individual tasks enque the CDS files at record level instead of cluster level, so allowing concurrent reads and updates. VSAM RLS uses the sysplex Coupling Facility and to use it, the CDS files must be SMS managed.

The use RLS you need to work through the following tasks

The control datasets must be SMS managed
The control datasets cannot be split by keyranges. This technique was commonly used to cope with very large files, especially the MCDS, but is not supported for z/OS release 1.3 and above anyway. Alternative options are to use extended addressing or multi-cluster CDS files.
The CDS datasets must also be flagged as non-recoverable by using the IDCAMS ALTER command with the LOG(NONE) parameter.

If you are already running RLS, then you need to check that the cache structure is at least 1MB. Otherwise you need to set up RLS as explained in the link above.

Finally, you need to add CDSSHR=RLS to all the DFHSM catalogued procedures then recycle them

back to top


Using Large Format Disks for the Journal Files

Each HSM Journal file must fit onto one disk and be in a single extent. This means that the files are limited in size by the available disk sizes, which meant no bigger than 8.5 GB if you were using Model 9 disks. This is not big enough for HSMplex systems where several HSM instances could be sharing the same journal files

Consider using large formal Extended Attribute disks, sometimes called 3390-A disks. These can contain up to 223 Gigabytes. The actual size of these volumes will vary depending on how they were created on the underlying hardware, but in general they are much bigger than standard volumes, and will resolve the issue of journal files being too small.

back to top