- z/OS file structures
- DFSMS on z/OS
- Data Class
- Management Class
- Storage Class
- Storage Group
- ACS routines
- z/OS performance
- z/OS file utilities
- z/OS SMF statistics
- z/OS RMF reporting
There can be several reasons why batch jobs run in a sequential series, when they could be running in parallel. Parallel runs will obviously save time. The first thing that you need to do is to analyse your batch to see if there are any opportunities to improve parallelism. A typical batch system, even in a medium sized shop, can easily run to several thousand jobs per night. You need someone with an intimate knowledge of how the various applications hang together, and how they share their data, to be able to investigate running more work in parallel. This task is normally assigned to a Batch Analyst, but a less skilled person can understand complex batch systems with the aid of a picture or two.
Tools such as ASG-Workload Planner (originally Beta44 from Beta Systems) will build up a picture of your schedule, and draw it or your PC or printer. These tools will plot out the critical path through the schedule, and can make it easier to spot bottlenecks. Obviously, it would be a good idea to start your tuning excercise with the jobs on the critical path.
This is a nice theoretical statement, but in practice batch jobs have variable run times so the critical path can change from night to night. The key here is to be pragmatic when analysing a schedule and not try to understand it in too much detail.
If you want to analyse a small application, then it is possible to do this using an Excel spreadsheet. First you need to get hold of the start and end times for the jobs in your application. You can either get this from SMF record types 31 and 32, or just extract messages 'IEF403I' (job start) and 'IEF404I' (job end) from the SYSLOG. If your batch suite spans over midnight, then you need to specify both day and time in the spreadsheet, as Excel would consider a job starting at 01:00, as running before one starting at 23:00.
There is not much point in tuning jobs that have a very short run time, say 5 minutes or less and a good way to reduce complexity is to exclude these jobs from your excel chart, or whatever other way you chose to analyse the data.
You need to especially watch for breaks in the schedule where little or nothing is running. Jobs can be waiting for a start time, a dataset or hardware resources. If you see a delay, find the reason for the delay, you can take the appropriate action. For instance:
If JOBB is waiting for a dataset that is used by JOBA, then may be done by coding JOBA with DISP=OLD on the datasets. This problem is worse if several jobs want to share a dataset and all jobs have DISP=OLD coded. If the dataset is a VSAM file, then consider using VSAM RLS as this allows the file to be shared between tasks, with updates managed by locking at record level instead of file level. Other options for sharing file are Hyperbatch and Batch Pipes. Multi-step jobs can also be an issue where large running steps must process sequentially. Consider breaking these jobs up into more jobs with fewer steps and using Batch Pipes to manage the data flow.
Traditionally, Initiators were used to 'throttle back' the flow of work to manage CPU usage. For example, if production batch normally ran in class 'C' initiators, then maybe just 10 class 'C' initiators would be defined to deliberately constrain the workload so that only 10 jobs could run simultaneously. Today, everyone should be using WLM managed initiators to manage the number of initiators based on workload, rather than setting hard limits. Consider having a set of WLM policies designed to favor batch, that you can switch to for overnight provessing, then switch back to your daytime WLM policies when batch processing completes.
Are the jobs waiting for backups? Almost all database applications are 'online' these days, which means that they can safely run alongside updating batch jobs, and updating online users. The only provisio is that you should try to run them at a quiet time, as they could have a small impact on performance. Once a backup completes, the data can be recovered by restoring from the backup, and then applying the logs to add in any updates made since the backup started.
Another issue is where several image copies exist in a batch run, backing up the same database. The extra backups should not be necessary as it should be possible to recover a database to any point in time, by restoring the database from the daily image copy, then rolling forward with the log files.
If your applications include interim backups, then these are prime candidates for FlashCopy, or its variants. This is explained in the Flashcopy section. This allows application work to run in parallel with backups, so saving dedicated backup time.
Jobs should not really be waiting for tape drives anymore. If you can afford it, buy a VTS tape subsystem as this allows you to define hundreds of logical tape drives to a z/OS system.
Another way to delay your batch is to have lots of critical files archived off by DFHSM or FDRABR. The jobs then hang and wait while the datasets are recalled. This is an especial problem with jobs that just run weekly, monthly or quarterly. You could schedule a 'pre-batch' job whose sole function is to open then close these files, so they get recalled before the batch run starts, or use a product to do this like OpenTech Systems’ HSM/Advanced Recall and DTS Software’s MON-PRECALL
If you use IEFBR14 to delete unwanted files by using a dataset disposition of (MOD,DELETE), this will recall the dataset, then delete it, which is a total waste of resource. You could use a z/OS ALLOCxx parmlib option IEFBR14_DELMIGDS(NORECALL) to convert the HSM recall to an HDELETE, or alternatively, just use an IDCAMS DELETE command instead, as that does not recall files.