Navigation Bar

Reliability

A batch job can fail for a variety of reasons; bad input data, program error, hardware errors, space problems are just some. The real issue is that batch jobs are usually linked together, and if one fails, the rest wait on it being fixed. The odd failure here and there is not too much an issue, but if your failure rate is 5% or more, then you probably have an underlying problem. One way to investigate this, is to use the process below

  • Analyse the data
    If you record job failures in a problem management system, then try running reports against the database. If not, then you may be reduced to logging failures as they happen in your favorite data repository; excel spreadsheet, access database, or piece of paper.
  • Look for trends
    Apart from the simple "do we get a lot of the same abend code" analysis, also look for time patterns. Do you get the same failures at month end or weekend? If you work on credit card systems, do they get fragile on Public Holidays?
  • Determine root causes
    So now you've got a trend. Why is it happening? If your credit card systems go wild on Public Holidays, its almost certain that's because the masses are out spending loads of money. Before you go off on your holiday -
  • Apply fixes
    To continue with the credit card example, you don't really want to discourage people from using their cards, so make sure you provide enough spare resources a public holidays to cope.

Another example might be cartridge failures which trend to data creation on a particular cartridge drive. Get your maintainer in to fix it!

This stuff all sounds simple common sense, but there's little point spending a fortune tuning a batch run, if all you do is speed it on its way to the next failure.

back to top


By entering and using this site, you accept the conditions and limitations of use

 

 

 

Advertising banner for Lasconet