• Lascon Storage

•

# TSM Database and Recovery Log

## TSM Database version 5.5 and earlier

### Basic database structure

TSM 6.1 uses a DB2 relational database. This means it can be much bigger than the legacy database, does not neeed database audits and will automatically re-organise itself if required. Software database mirroring is no longer supported.
One thing to note is that you cannot have any other DB2 applications running on the same server as the one hosting your TSM database, though you can have multiple TSM instances on one host server. A TSM 'instance' is everything required to run a TSM server, including database, logs, storage pools etc.

Several of the commands we used to use to manage the database are different. Some of these are :-

 New Command Effect Old command EXTEND DBSPACE Increase the size of the database DEFINE DBVOLUME then EXTEND DB QUERY DBSPACE Check the database size QUERY DBVOLUME SET DBRECOVERY Used to define the database backup device class and must be run before the first backup. DEFINE DBBACKUPTRIGGER SET DBREPORTMODE A new command used to decide how much diagnostic information to report. Options are; NONE, FULL, PARTIAL. None

### Database sizing and tuning

The database does not consist of a collection of files or 'volumes' like the legacy database. Instead, the database can exist in up to 128 directories or 'containers' to use the correct DB2 term. The data is striped evenly across the directories and unlike the legacy DB, they do not require an initial format before they can be used. The Q DBSPACE output below shows a database striped over 3 containers.

 LOCATION               TOTAL SIZE OF File System(MB)  SPACE USED (MB)
FREE SPACE AVAILABLE (MB)
/tsm/tivoli/dbdir001   102,144                        6,919.22
98,224
/tsm/tivoli/dbdir002   102,144                        6,919.22
98,224
/tsm/tivoli/dbdir003   102,144                        6,919.22
98,224


When a TSM database is initially created on multiple file systems, that database will be spread eaqually over all the file systems. However if you add an extra file system to the database space using the 'extend dbs' command, DB2 will not rebalance the database to spread the data equally. This means that if some of the original file spaces were 100% full, they will still be 100% full after the new filespace is added and this could cause the TSM to stop.
If you are running TSM Server V6.2 or above you can rebalance the database dynamically using DB2 commands. I suggest that you look up the IBM technote about this, and also contact IBM for advice before trying this.

The database can be sized anywhere between 2.2GB and 1TB. The DB2 database will be between 35 and 50% bigger than the equivalent legacy database, partly becuase it hold sort space for SQL queries. The DB2 database is largely self tuning, so there is no requirement for DB2 tuning skills. A new parameter, DBMEMPERCENT, replaces the old BUFFPOOLSIZE. This set of buffers contains much more data than the old buffer so the recommendation is to set its size to unlimited. In fact, TSM/DB2 will try to change it to unlimited on startup.

Two other legacy features are not required now; database audits and indexed tables.
The database uses DB2 relational consistency rules to prevent incorrect data from entering, and is self auditing. The database will aslo run automatic 'runstats' from time to time. This is a DB2 feature that optimises storage paths through the database to improve performance.
The database also uses relational indices, so it does not require special index tables to speed up SQL queries.

### Recovery log sizing and tuning

TSM 6.1 has three recovery logs.
The Active log contains updates that have not been committed to disk yet and is used for roll-forward or roll-back in case of problems. Once a transaction is committed, the data is moved to the archive log. The default size for the Active log is 2GB and the size can be inceased by increments of 512MB right up to 128GB.
The Archive log contains committed transaction data and is used for PIT recovery of the database. The Archive log is cleared out by a full database backup. However it retains all data updates applied right back to the second last backup, so you need to size your archive log with that in mind.
The Failover Archive log
TSM collectively calls these three logs the 'recovery log', but a DB2 DBA would just call them 'transaction logs'.

The log files form part of the TSM database, and unlike the legacy TSM database there is no need to create and format log volumes. The logmode is equivalent to legacy roll-forward. In DB2 terms, these are archive logs, not circular logs. This means that the log files can fill up, so log file management is still required. You can specify a failover log for the Archive log to help prevent this, but the Active log cannot failover and the size is fixed between 2GB and 128GB, so don't allocate all the space that you have available for the Active log, keep some in reserve for emergencies.
It is highly recommended that FailoverArchiveLog space be set aside for possible emergency use. You can use slower disks for FailoverArchiveLog space.

If the Active log fills up and the server stops, the process to get your TSM server up again is:

DSMSERV DISPLAY LOG - to check the current log status
Update the Active log size parameter in dsmserv.opt
Start the server up

### Best practice for Database and Storage Pool disks

The following are some of the 'Best Practices' recommendations from IBM for setting up DB disk volumes for TSM Servers

Use fast, low latency disks for the Database, use SSD if you can afford it. Avoid the slower internal disks included by default in most AIX servers, and avoid consumer grade PATA/SATA disks. Use faster disks for the Active Logs too. Do not mix active logs with disks containing the DB, archive logs, or system files such as page or swap space. Slower disks for archive logs and failover archive logs can be used, if needed.

Use multiple database containers. For an average size DB, it is recommended to use at least 4 containers initially for the DB. Larger TSM servers, or TSM servers planning on using data deduplication, should have up to 8 containers or more. You should plan for growth with additional containers up front as adding containers later can result in an imbalance of IO and create hot spots.
Place each database container in a different filesystem. This improves performance; DB2 will stripe the database data across the various containers. Tivoli Storage Manager supports up to 128 containers for the DB. Ideally place each container on a different LUN, though this is not so important for high end disks like XIV or VMAX.

There should be a ratio of one database directory, array, or LUN for each inventory expiration process.

The block size for the DB varies depending on the tablespace, most are 16K, but a few are 32K. Segment/strip sizes on disk subsystems should be 64K or 128K.

If you use RAID, then define all your LUNs with the same size and type. Don't mix 4+1 RAID5 and 4+2 RAID6 together. RAID10 will outperform RAID for heavy write workloads, but costs twice as much. RAID1 is good for active logs.

Smaller capacity disks are better than larger ones if they have the same rotational speed. Have containers on disks that have the same capacity and IO characteristics. don't mix 10K and 15K drives for the DB containers.

Cache subsystem "readahead" is good to use for the active logs; it helps in archiving them faster. Disk subsystems detect readahead on a LUN by LUN basis. If you have multiple reads going against a LUN, then this detection fails. So several smaller LUNs are better than a few large ones, but too many LUNS can be harder to manage.

However it is very difficult to given generic rules about disk configuration as this very much depends on what type of disks you are using.

High end disk subsystems such as the EMC DMX, the HDS VSP and the IBM DS8000 have very large front end cache to speed up performance, and stripe data in a way that makes it difficult to separate data by physical spindle. The IBM XIV takes this virtualisation to a higher level again. To get the best performance from these devices you want enough LUNs to spread the IO and get readahead cahce benefit, but not so many that they become difficlult to manage. For the XIV, consider using a queue depth of 64 per HBA to get best advantage of the parallelism capabilities.

Don't stripe your data using logical volumes, let the hardware do the striping. As a rule of thumb, consider using 50GB volumes for DISK pools and 25GB volumes for file pools. Define the same number of volumes per LUN as the RAID type to make up the LUN, so for example with 4+1 RAID5, define 4*50GB volumes per LUN, then each LUN will use 250GB, with effective capacity of 200GB.

The Unix Tips section contains some detail on how to use VMSTAT and IOSTAT commands to investigate potential disk bottlenecks.

### Using DB2 commands on a TSM server

IBM's design model for TSM v6 and upwards is to store TSM metadata in a DB2 database, without the TSM administrators needing to know anything about DB2 and how to manage it. That design model holds well, but there are a few circumstances where a bit of DB2 comes in useful. On Windows, you start a DB2 command line from Start -> All Programs -> IBMDB2 -> Command Line Tools.

AUTHORISING A USERID TO BE ABLE TO START THE TSM SERVER

The TSM DB2 system is 'owned' by the userid that installed it, and normally only that userid has the administration authority needed to manage the DB2 database, including the ability to start the TSM service. However you can give access to another userid using DB2 commands. Open up a command line as the TSM instance owner by right clicking on it and taking the 'run as' option. You will need the instance owner userid and password to do this. Once you have the command line, type the following commands

 db2==>  connect to tsmdb1



Userid TSM_ADMIN can now be used to stop and start the TSM services

Recovering from a full archive log

Under tsm 6.x, the archive and active log directories can fill up, and if they do, the server will shut down. To prevent this, you need to make sure you trigger a FULL database backup once the archive log hits a threshold, but if the worst happens and the log files do fill up, you need a recovery process.

If this happens then you cannot use TSM commands to move the logs into bigger directories, as you cannot start TSM. What you need to do is create temporary logs elsewhere, then prune the archive log using native DB2 commands. However, remember that the archive log will hold enough information to wind back through the last 2 full backups, so you need to run 2 full backups to clear it down.

1. Create a temporary directory large enough to hold your active logs. The dsmserv.opt file may contain the log sizes in the ACTIVELOGSIZE parameter, and if not, it will point to the physical log location.

2. Open a DB2 command line and run the commands below to switch the logs to a new location
 Set db2instance=SERVER1
db2start
db2 update db cfg for tsmdb1 using newlogpath path\to\new\logs
db2stop
db2start

3. 'Activate' the database to copy the log files with the following command, this command does not affect the original logs. This may take a while, and success will be indicated when you see a command prompt again.

db2 activate db tsmdb1

4. Now you need to back the database up to clear the logs out, and you need to do this on disk, so identify or create a directory with enough space to take a database backup then run the following DB2 commands.
 db2stop
db2start
db2 backup db tsmdb1 to path\to\database\backup\directory

The archive logs will start pruning once you see the ‘Backup Successful’ message, but this could take a while to appear if your database is large. Make a note of the backup timestamp, which will look something like ‘The timestamp for this backup image is: 20120412130821’
5. Find some more space, and run another full DB2 database backup with the command.
 db2 backup db tsmdb1 to path\to\another\database\backup\directory

When this second backup completes, the archive log directory and original active log directory are empty of log files. Make a note of the backup timestamp again, let us call this one 20120412150425

6. Now you need to delete the first backup using these commands – note how you use the timestamp from step 4.
 db2stop
db2start
db2 connect to tsmdb1
db2 PRUNE HISTORY 20120412130821 WITH FORCE OPTION AND DELETE

7. Point DB2 back to the original, empty active log in the original location, you will get this from the ACTIVELOGDIRECTORY parameter in dsmserv.opt
 db2 UPDATE DATABASE CONFIG FOR TSMDB1 USING NEWLOGPATH path\to\activelogdir

8. Connect to the database again, and that will automatically start moving the active logs from the temporary location to original active log location, and again, this can take a while if the logs are big.
 db2 force application all
db2stop
db2start
db2 connect to tsmdb1

9. Now you need to start the TSM server up and run a good backup. You need to start the server in the foreground to do this, so open a normal Windows command line and navigate to the server directory and run dsmserv. If you have more than one TSM server on this machine, you may need to use the -k option to get the right server. This will bring you up a TSM server command line. Disable your client sessions then take 2 full database backups. You need to know your backup device classes to be able to do this.
 Disable sessions
Backup db type=full dev=your_db_devclass

10. Delete the second DB2 database backup as follows, using the database timestamp that you recorded in step 5.
 db2 PRUNE HISTORY 20120412150425 WITH FORCE OPTION AND DELETE

11. Now you can halt your server in the foreground, and start it normally. Remember to enable sessions.

It is possible to query what is happening while a database recovery is in progress with the db2pd utility, a DB2 diagnostic tool that is provided with the TSM server installation code. You simply run this as a command from the shell prompt, like this:

 tsm:~ # db2pd
db2pd> You are running db2pd in interactive mode.
db2pd> If you want command line mode, rerun db2pd with valid options.
db2pd> Type -h or -help for help.
db2pd> Type q to quit.


To check out what is happening with a database recovery, run

 db2pd> -recovery -db tsmdb1


STARTING AND STOPPING AUTO RUNSTATS

Runstats is used to optimise access paths through the TSM tables and should normally be set to run automatically as required. However if runstats starts automatically when the TSM is started up after a database upgrade, it can cause performance problems to the extent that no-one can log into the system.
To temporarily suspend auto runstats, before halting the TSM server for an upgrade, submit the following commands to the DB2 instance that is associated with the TSM server:

db2 connect to tsmdb1
db2 update db cfg for TSMDB1 using AUTO_RUNSTATS OFF


Now runstats will not start automatically when you restart TSM server. However you need runstats to keep your database optimised, so once you are happy that your TSM server is up and running, submit the following commands to the DB2 instance for your TSM server and Runstats will resume normal processing.

db2 connect to tsmdb1
db2 update db cfg for TSMDB1 using AUTO_RUNSTATS ON


SOME OTHER POTENTIALLY USEFUL COMMANDS

You can enter any DB2 command from the DB2 command line, including SQL queries and commands that update or delete the database, so be careful. Some of the query commands could be useful for investigating TSM problems

• get instance -   returns the name of the TSM server
• list active databases -   will show the TSM database name as known to DB2, and the path to it.
• get dbm config -   shows the settings for the database configuration manager
• db2start and db2stop -   obvious what these are! They should never be necessary as DB2 should be started automatically as part of the server startup, but if necessary, it can be done manually
• The following commands require you to be logged in with adminstrator authority and connected to a database

• get db cfg show detail -   shows the confguration parameters for the database
• list tables -   for the connected database
• describe table table_name -   lists the columns for the specified table

### Investigating Problems with the Server Instance

The first place to start is the TSM Active log, but if you need to go deeper, then the DB2 logs can be useful. However finding those logs can be a challenge as the location can depend on the the OS platform or even the OS release level. The best way to be sure you have the correct log is to check the DIAGPATH variable in DB2.
Start up a DB2 command line, in Windows go to Start->Programs->IBM DB2->Command Line tools->Command Window and in UNIX, su - db2inst1 (db2inst1 is the default instance, if you change the instance name or have multiple instances, you need to su to the correct userid for your instance). You then type 'db2' to open the DB2 command line
From the db2 command line type db2=> get dbm cfg. The command produces a lot of output, look for the line like "Diagnostic data directory path     (DIAGPATH) = /home/E1WT1/e1wt1/sqllib/db2dump     and this shows the path to the log files. If the DIAGPATH is blank, look for the default PATH directory instead.
'quit' gets you out of that DB2 command line

The DB2diag.log contains information like database backups, table reorganizations, memory management messages, start and stop of TSM server and hardware information logged at instance start time, as well as error and warning messages.

Sometimes when investigation TSM server problems, the DB2 terminology does not quite match TSM so the error messages in the DB2 logs can look a bit strange. For example, Tivoli Storage Manager refers to “transactions” which DB2 calls to “units of work” (UOW). Tivoli Storage Manager uses select statements where DB2 uses SQL and are also sometimes referred to as DML, or data manipulation language statements.

Another potentially useful file is the startup trace log dsmupgdx.trc, which is located in c:\program files\tivoli\tsm for Windows or /opt/tivoli/tsm/server for UNIX and Linux servers. If you get database startup problems it's alwayus worth checking the file to see if any useful error messages exist.

Some TSM errors messages are very generic and more investigation is needed to pinpoint the problem. An example is 'ANR8503E A failure occurred in writing to volume'. To investigate this you need to enable a TSM server trace. To do this, enter the following commands from the TSM server command line

    trace disable *
trace begin path/name/trace.out
redo the activity that created the error message
trace flush
trace end
trace disable *


The open the trace file and look at the messages that were written out just before the error.

### ANR0170E on Database Startup

When trying to start up TSM the following error message can appear "ANR0170E - Error detected, database restart required", and you may see errors in the actlog a bit like

ANR0171I dbieval.c(874): Error detected on 3:2, database in evaluation mode.
ANR0170E dbieval.c(935): Error detected on 3:2, database restart required.
ANR0162W Supplemental database diagnostic information: -1:58031:-1034
([IBM][CLI Driver] SQL1034C The database is damaged. The application has been disconnected from the database. All applications processing the database have been stopped.

The resolution is to restart DB2 manually with the RESTART command. Open up a DB2 command line window as explained above then issue the following

 set db2instance=db2inst1 (this is the default instance)
db2 force application all
db2stop
db2 restart database db2inst1


You might have to run the restart command a few times before the issue is resolved. If this does not fix the problem you probably need to contact IBM Support, although you can use the db2dart command to run a database analysis. This generates a report file that would be useful for IBM support.

 db2 force application all
db2stop
db2dart db2inst1 /db


### AIX Maximum Number of Processes

You may see a database backup failing on an AIX server with an error like 'ANR2968E Database backup terminated. DB2 sqlcode: -2033. DB2 sqlerrmc: 292' If this error is not corrected the recovery log will fills up and crash the server. You may also see a message like 'Insufficient AIX system resource' in the db2diag log file.

The API error code 192 means that the API was unable to 'fork' or create a process to do its database backup. AIX has a parameter called maxuproc which limits the maximum number of processes that a user is allowed per user, and this value should be increased.

To see what value is set, use the command

lsattr –l sys0 –E | grep maxuproc


and to change the value use the command below, selecting a value that is suitable for your server.

chdev –l sys0 –a maxuproc=’2048’


### Effect of Deduplication on Database size

Deduplication will save a lot of backend storage, but it does this at the expense of increasing the size of the TSM database because the TSM database has to store and track the metadata that is required to manage the deduplication. The exact amount of extra space required is difficult to calculate up front, as it depends on your average 'deduplication chunk size' and this will vary depending on how well your data deduplicates. IBM suggests an typical chunk size of 100,000 bytes, and provides some scripts that you can run to measure your exact average chunk size once you have deduplication working. Each chunk needs 490 bytes of metadata to describe the data in the primary pool, and another 190 bytes for the data in each copy pool.

A starting point is to estimate your database sixe without deduplication, and to to this you use the formula
db_size = file_count * number_of_backup_copies * 200
To give you an idea of how many backup files exist, you can find the number of backup files that you are holding with the following SQL query on an existing server

 select sum(cast(num_files as bigint)) from occupancy -
where node_name is not null -
and filespace_id is not null


To calculate the deduplication overhead, use the formula below to get the number of chunks
chunk_count = total_backedup_data_in_GB * 10,000 * 2
The doubling factor at the end of the formula is to cater for 'base deduplication chunks' that is, chunks that must remain even after a file is expired and deleted from TSM. The extra database overhead is then
chunk_count * (490 + 190 * extra_backup_copies)

Running this formula on an existing server with a 135GB database predicted an increase of 105GB with deduplication, which is not a trivial amount.

## TSM Database version 5.5 and earlier

TSM 5.5 goes out of support at the end of April 2014

### Recovery log processing

The TSM database is quite sophisticated, and uses a transaction log, called the recovery log. Multiple updates are grouped together into 'transactions', which have to be applied to the database as a unit. If the server crashes, say, and the updates in a transaction have not been applied, then the partial updates must be backed out using the log records. This all or nothing approach protects database integrity during updates.

If the server cannot update the recovery log, because it is full, then the server crashes. So its worth knowing what makes the log fill up, and how to avoid it.

The log has two pointers, a 'head' pointer and a 'tail'. The head marks the position where the next update can take place, new updates are added at the head. The tail marks the position where the oldest transaction is still processing, and also where the last update can take place. Tail movement depends on how the 'logmode' is set up. If you define logmode=rollforward, then the tail will only move when a database backup is run. If you use logmode=normal, then the tail moves when the oldest transaction completes. When the pointers reach the end of the file, they start again from the beginning. Consider the logfile as being a circle, with the head and tail pointers being points on the circumference. The command Q STATUS will tell you which logmode you are using.

The tail is then 'pinned' by the oldest in-flight transaction, and if this is not cleared before the head catches up, then the file is full. Tivoli provided a new command with TSM 4.2.2.1, 'show logpinned', which will identify the transaction which is pinning the log.

The log file usually fills up due to a combination of two events. An old transaction hangs around and 'pins' the tail, while another process is causing the head to move rapidly, so it catches up.
Long running transactions can be caused by very large database backups, or smaller backups running over slow lines. A process which is trying to recover from a tape I/O error can also hang around for a long time.
Rapid head movement is caused by something which is doing large quantities of database updates, very fast. Expire Inventory is a good example of this. There are ways to manage this

• Don't schedule inventory expiration when large backups are running
• Make the log almost as large as possible, which is about 13GB at the moment. But, leave a bit of free space so you can extend the log if the server crashes.
• Consider clearing out your log before the backups start, by temporarily reducing the dbbackup trigger. UPDATE DBB LOGF=20 should force a backup. However, remember that if you are running with logmode=rollforward, and the tail is pinned, then the database backup will not clear out the log.
• Consider running with a smaller value of dbbackuptrigger during the backup run, to help prevent the log from filling. However, this can cause lots of backups to be triggered, so use with caution.
• Monitor the log utilisation, and alert support staff, if the log exceeds, say 80%. The support staff then need to look for an process which is holding the tail, and cancel it, or look for a process which is rapidly filling up the log and cancel that. Or, to be on the safe side, cancel them both.
• TXNGroupmax (maximum number of files sent to the server in a single transaction) and TXNBytelimit (total number of bytes in a single transaction) are usually set high to speed up backup performance. If you are getting problems with your log filling up, consider reducing these to force more frequent commit points.

Recovery log processing was enhanced in TSM 4.2. If the DB Backup Trigger is set correctly, and the LOGMODE is in ROLLFORWARD, then a database backup will start when the log reached 100% full. If the Recovery log hits 100%, then TSM will stop all processes except the database backup. When the backup completes, TSM issues the message

ANR2104I Activity log process restarted - recovered from an insufficient space condition in the Log or Database.

This should help us avoid some difficult recovery situations.

### Database Defragmentation

This contentious issue applied to legacy databases only. The legacy TSM Server database has a b-tree format, and grows sequentially from the beginning to end. When file entries are deleted by expiration processing or file space/volume delete processing, this leaves spaces or holes in that database. These may be re-used later when new information is added, but they mean that the TSM database is using more space than it needs to. The only way you can compress the database so that the 'holes' left by deleted pages are not present is to use the database unload/reload utility.

The problem is that while the dump takes about an hour, the load utility can take several hours. Does it make a difference? I have seen performance improve after defragmenting a database, and I've also see an unload/reload make performance worse. A defrag will reduce the physical size of your database.

The Tivoli team supplied a new command with TSM 5.3 for to you to check to see what an unload/reload would achieve, called 'ESTIMATE DBREORGSTATS' This will estimate the amount of space that would be recovered by an unload reload.

For older releases of TSM use the QUERY DB to see if you need to defrag your TSM DB.


Available Assigned   Maximum   Maximum    Page      Total      Used   Pct  Max.
Space Capacity Extension Reduction    Size     Usable     Pages  Util   Pct
(MB)     (MB)      (MB)      (MB) (bytes)      Pages                  Util
--------- -------- --------- --------- ------- ---------- --------- ----- -----
50,208   49,004     1,204     9,412   4,096 12,545,024 9,544,454  76.1  76.1



Here a 49GB database can be reduced by 9.4GB = 19%, but it is only 76% used, so 5% could be reclaimed by defragging. Some people claim that TSM tries to allocate pages in a way that leaves you with as good as possible performance, and defragging the database will degrade performance. Its also possible that after a defrag, the database will quickly become defragmented again, as it inserts data into the tree. The following formula can be used to see how much space could be reclaimed by an unload/reload.

SELECT CAST((100 - (CAST(MAX_REDUCTION_MB AS FLOAT) * 256 ) /
(CAST(USABLE_PAGES AS FLOAT) - CAST(USED_PAGES AS FLOAT) ) * 100) AS
DECIMAL(4,2)) AS PERCENT_FRAG FROM DB


A high PERCENT_FRAG value can indicate problems. If you think your database needs a defrag, then if possible, take a copy and try that first. That will give you an indication of how much time is needed for the load.

## Extending the TSM database under AIX

Create a new file system in AIX using SMITTY

 	make LV
make FS on existing LV
mount new-filesystem


THEN in TSM




If you use incremental database backups, then remember that after an EXTEND DB the next DB backup must be a full backup.

### Formatting the TSM database and log

Legacy TSM database files and log files have to be formatted before they can be used. There are two different commands for this, and it is vitally important that you know the difference. If you want to add a file to the database or recovery log, then you use the DSMFMT command to format the file. The DSMSERV FORMAT looks similar but that command will format the whole recovery log and database. So just make things clear, DSMSERV FORMAT will wipe all your existing database and log files, so if you want to make a complete fresh start, that's what you use. DSMFMT will just format the file that you specify. The syntax of DSMFMT is

 dsmfmt -m -log tsmlogvol7 5


Which will format a 5 meg.log volume called tsmlogvol7. Size options are 'k' 'm' 'or 'g' and data type options are 'db' 'log' or 'data'

### Auditing the TSM database

The Audit process only applies to legacy TSM databases.

Richard Sims has correctly pointed out that a database audit with FIX=YES is a dangerous procedure. "Correcting database problems without TSM Support direction can result in worse problems, including data loss. Structural problems and inconsistencies in any database system can be much more complex than a vanilla utility can properly deal with. If one has a reason to believe that their TSM database has problems, they need to contact TSM Support for assistance in dealing with them, rather than attempt amateur surgery. IBM repeatedly advises customers NOT to attempt to fix database problems themselves".
I'd also suggest that if you run an audit, you always make sure you have a full database backup available first.

Database Audits are used to fix inconsistency problems between the database and its storage components. A full database audit can run for several hours, but it is possible to run smaller audits on parts of the database. As a general rule of thumb, a full database audit takes about 3 hours per million pages, and a 4 GB utilised database holds about a million pages. The actual times will mostly depend on the processing power of your server. An audit will write a lot of log records so if you normally run with your recovery log in 'ROLL FORWARD' mode it is advisable to put the log into 'NORMAL' mode before running an audit, then put it back into 'NORMAL' mode when the audit completes.

/dsmserv auditdb fix=yes admin detail=yes


Is a very quick check of the admin data

/dsmserv auditdb fix=yes archstorage detail=yes


will audit the archive storage, and runs for 1-2 hours depending on your database size

/dsmserv auditdb fix=yes diskstorage detail=yes


will audit the disk storage pools, and takes about 30 mins, depending on the size of the database, and how full the disk pools are. Best done when all the data is migrated out to tape.

/dsmserv auditdb fix=yes inventory detail=yes


This is the long running one, 8-12 hours.

The following information was supplied by Maureen O'Connor of Fiserv Technology in April 2007. Maureen has provided some excellent detail on how to estimate how long an aufit will take, and how to run audits against multiple TSM servers on one AIX server.

 Running an audit of the TSM database can be a very long and time-consuming process, and it is not well documented by IBM, so estimations can be difficult to make. Generally speaking, the best way to run the audit is to run it against the whole database, not just a piece of it, but if the db is very large, this can mean an extensive outage, so it should be planned well in advance. The audit follows 33 steps: ANR4726I The ICC support module has been loaded. ANR0990I Server restart-recovery in progress. ANR0200I Recovery log assigned capacity is 1000 megabytes. ANR0201I Database assigned capacity is 2500 megabytes. ANR0306I Recovery log volume mount in progress. ANR0353I Recovery log analysis pass in progress. ANR0354I Recovery log redo pass in progress. ANR0355I Recovery log undo pass in progress. ANR0352I Transaction recovery complete. ANR4140I AUDITDB: Database audit process started. ANR4075I AUDITDB: Auditing policy definitions. ANR4040I AUDITDB: Auditing client node and administrator definitions. ANR4135I AUDITDB: Auditing central scheduler definitions. ANR3470I AUDITDB: Auditing enterprise configuration definitions. ANR2833I AUDITDB: Auditing license definitions. ANR4136I AUDITDB: Auditing server inventory. ANR4138I AUDITDB: Auditing inventory backup objects. ANR4137I AUDITDB: Auditing inventory file spaces. ANR2761I AUDITDB: auditing inventory virtual file space mappings. ANR4307I AUDITDB: Auditing inventory external space-managed objects. ANR4310I AUDITDB: Auditing inventory space-managed objects. ANR4139I AUDITDB: Auditing inventory archive objects. ANR4230I AUDITDB: Auditing data storage definitions. ANR4264I AUDITDB: Auditing file information. ANR4265I AUDITDB: Auditing disk file information. ANR4266I AUDITDB: Auditing sequential file information. ANR4256I AUDITDB: Auditing data storage definitions for disk volumes. ANR4263I AUDITDB: Auditing data storage definitions for sequential volumes. ANR6646I AUDITDB: Auditing disaster recovery manager definitions. ANR4210I AUDITDB: Auditing physical volume repository definitions. ANR4446I AUDITDB: Auditing address definitions. ANR4141I AUDITDB: Database audit process completed. ANR4134I AUDITDB: Processed 187 entries in database tables and 255998 blocks in bit vectors. Elapsed time is 0:00:10. Each step is called based on the architecture; the DSMSERV utility runs several concurrently, 5-10 at a time, returning output as each step completes and picking up the next step in order. Steps 1-9 will finish almost immediately. Steps 10-16 will run next, and will take a slightly longer time, these follow definitions in order of creation. When Step 17 begins, it will trigger Step 33, and depending on how many entries there are in the database, the output from 33 will appear mixed with the output from Steps 18-32. Step 33 is reviewing all client data in the database, this is the longest running step in the audit process. Typical output from Step 33 (from a large database) will look like this: ANR4134I AUDITDB: Processed 8260728 entries in database tables and 0 blocks in bit vectors. Elapsed time is 1:05:00. ANR4134I AUDITDB: Processed 9035641 entries in database tables and 0 blocks in bit vectors. Elapsed time is 1:10:00. ANR4134I AUDITDB: Processed 9812999 entries in database tables and 0 blocks in bit vectors. Elapsed time is 1:15:00. ANR4134I AUDITDB: Processed 10663992 entries in database tables and 0 blocks in bit vectors. Elapsed time is 1:20:00. ANR4134I AUDITDB: Processed 11677212 entries in database tables and 0 blocks in bit vectors. Elapsed time is 1:25:00. ANR4134I AUDITDB: Processed 12014759 entries in database tables and 0 blocks in bit vectors. Elapsed time is 1:30:00.  Note this output refers to 'entries'. Entries are not a standard reference in TSM, this is a parsed view of data files, part of the occupancy. To estimate how many entries will be scrolled through the audit, run this formula on a command line within TSM: select sum(num_files)*3 from occupancy  The '3' refers to the three pieces to a file: the entry, a header for the entry, and an active/inactive flag. Remember that this is only an estimate, the reason for running the audit is possible corruption, there may be pieces missing or mis-filed. Entries are read anywhere from 500K to 1 million every five minutes, so based on the output from this formula, this is how to estimate the time for the audit to complete. Audits can be run on pieces of the database instead of the whole - a specific storage pool or the administrative portion - this can be a considerable time-saver, but if it is unknown what part of the database is corrupt, this may not be a worthwhile option. To run an audit, the TSM server instance must be down. If there are multiple TSM instances on a server, the DSMSERV executable must be in the primary server directory, but if the audit is running on a secondary instance, for example, parameters must be passed to operating system so the utility will know where it is looking for the database: AIX# export DSMSERV_DIR=/usr/tivoli/tsm/server/bin AIX# export DSMSERV_CONFIG=/usr/tivoli/tsm/server/bin//dsmserv.opt  To run an audit just on the administrative portion of the database (the fastest, 10-15 minutes), start the utility this way: AIX# at now dsmserv auditdb fix=yes admin detail=yes > /tmp/tsmauditadmin.log [ctl-D]  The process will run in the background, and a log will be kept; this log can be run with the tail -f command by multiple users to track the progress. To run the audit on the archived data (1-2 hours, depending on size of archives), enter this: dsmserv auditdb fix=yes archstorage detail=yes >/tmp/tsmauditarchive.log  To run the audit on the diskpool (very fast if all data is migrated), enter this: dsmserv auditdb fix=yes diskstorage detail=yes > /tmp/tsmauditdisk.log  To run on the client data only, not including the archives (still the longest running), enter this: dsmserv auditdb fix=yes inventory detail=yes > /tmp/tsmauditdata.log  Again, running on the inventory, while it can be run separately, it is almost a moot point. If any data is found to be damaged, location messages as well as the fix (usually a deletion) will output to the log as follows: ANR1777I afaudit.c(967: Object 0.62882489 is \WINDOWS\INF\DSUP.PNF for node (257), filespace \\\c$(1). ANR1777I afaudit.c(967: Object 0.62882490 is \WINDOWS\INF\DSUPT.PNF for node (257), filespace \\\c$ (1). ANR1777I afaudit.c(967: Object 0.62882491 is \WINDOWS\INF\DVD.PNF for node (257), filespace \\\c\$ (1). ANR4303E AUDITDB: Inventory references for object(0.62882489) deleted. ANR4303E AUDITDB: Inventory references for object(0.62882490) deleted. ANR4303E AUDITDB: Inventory references for object(0.62882491) deleted.  Be sure sufficient outage time is scheduled. Once an audit begins, it is not good practice to halt the process, because the current location of the audit is not truly known - a data file could be open, and halting may actually cause further corruption.

### Legacy Database size and disk setup

The TSM database is critical to open systems backup and recovery. It needs to be 100% available as without it, it is impossible to recover files. The 'incremental forever' philosophy behind TSM means that it is impossible to build a list of files needed to recover a server without the TSM database. If the TSM database setup is not designed correctly then the database will perform badly and this will affect your ability to fit backups within the overnight window.

TSM performance is very much dependent on the size of the database. TSM performance suffers if a database becomes too large, but there are no exact rules on how big too large is. The maximum possible size for a TSM database is 530GB. IBM recommend 120 GB as a general rule, with the caveat that 'when expiration, database restores, and other Tivoli Storage Manager admin processes take too long and client restores become too slow, it is too big'. Database backup and Expire Inventory are both CPU intensive processes that can be used to indicate server performance problems in general. The only sensible answer to 'how big should a TSM database be?' is to let you database grow until these processes start to become an issue. Expire Inventory should really run within 12 hours and should be processing 3 million pages an hour or more. Backups should run in 30 minutes and process 6 million pages per hour or more, but these are just general rules-of-thumb. The actual size will depend on how fast your hosting server is, how good your disks are and what level of service you need to provide.

A TSM Database consists of a number of files, called 'disks'. As TSM will schedule one concurrent operation for each database disk it makes sense to allocate a lot of small disks, rather than a few large ones. A disk file size of 2 GB seems to be about right (The maximum possible size for a disk volume is 8 TB). IBM recommends that these database disk files be spread over as many physical disks as possible. This makes sense for low or mid tier disk subsystems, as this means that multiple disk heads can be seeking, reading, and writing simultaneously, but as high tier subsystems perform most of their I/O in cache this is less of an issue.

Most operating systems allow you to stripe files over logical and physical disks, or partitions, and recommend that this be used for large performance critical files. It is very difficult to get any kind of consensus from the TSM user community on the benefits of disk striping. For example to quote two users:-
USERA; 250GB! database on a high tier EMC DMX disk subsystem. Disk striping introduced and database backup reduced by more than half.
USERB; 80GB database striped on a mid-tier IBM FASTT subsystem striping removed and database converted to RAID5. No impact on database backup times, expire inventory run times or client backup times.

TSM will allocate a default database and log file during a AIX usually in the server installation directory /usr/tivoli/tsm/server/bin These default files should be deleted and re-allocated to your strategic size and location.

### Recovering a 5.x database on a Windows Server

The basic steps you need to take to recover a legacy database are:
Prepare the files you need to do a restore
Format the database and logs
Restore the database
Sort out any storage pool issues

FILE PREPARATION

Obviously, you need a good backup of a TSM database, and you need to know the device class that was used for the backup. For illustration, we will assume the latest database backup is on a tape called T012456L and used a devclass called T_LTO3.
You also need a list of the database and log file names and sizes. If you use DRM then the best place to get this from is the latest prepare. If you don't use DRM, you can get this info when your TSM server is running with the commands
query dbvol f=d
query logvol f=d
That's all very well if you have a planned outage, but what if your database crashes and you don't have a prepare? You can still get the info with dsmserv commands, use
dsmserv display dbvolumes
dsmserv display logvolumes
Create two text files, one called DB.VOLS that contains the Database file names, paths and sizes and one called LOG.VOLS for the log files. The files should look like this, but use your own file names, paths and sizes. The file sizes are in MB, so this is a 20GB database.

DB.VOLS
"H:\TSMDB\EXT01\DB01.DSM"  5000
"H:\TSMDB\EXT02\DB02.DSM"  5000
"H:\TSMDB\EXT03\DB03.DSM"  5000
"H:\TSMDB\EXT04\DB04.DSM"  5000

LOG.VOLS
"H:\TSMRLOG\RLOG01.DSM"  4096
"H:\TSMRLOG\RLOG02.DSM"  4096


Place these files in your c:\program files\tivoli\tsm\server\ directory

FORMAT THE DATABASE

Navigate to the c:\program files\tivoli\tsm\server\ directory and run the following command. Note that the logs are described first, then the database, and that you need to say how many of each type of file you are formatting, so 2 log volumes and 4 database volumes.

 DSMSERV FORMAT 2 FILE:LOG.VOLS 4 FILE:DB.VOLS


When you run a DSMSERV FORMAT on a Windows server, it resets the registry entry for the TSM server, and this must be put back before you attempt the restore. Use REGEDIT and navigate to the correct registry entry for your server. If you just have one TSM server on this Windows box, it will be Server1, otherwise Server2-4 depending on which server you are working with. The Server1 key is HKEY-LOCAL-MACHINE\SOFTWARE\IBM\ADSM\CurrentVersion\Server\Server1 and you need to change the path entry from c:\progam files\tivoli\TSM\Server to Server1.

RESTORE THE DATABASE

Navigate to the c:\program files\tivoli\tsm\server\ directory and run the following command. The tape name and devclass are the ones we found before we started the restore, you substitute your own names.

 DSMSERV RESTORE DB VOLUMENAMES=T012456L DEVCLASS=T_LTO3 commit=yes

CLEAN UP

OK, now you have a copy of your TSM database as it was when it was backed up. Your problem now is that you may have data on your disk storage pools that is not recorded in your database, or your database will think data exists on disk that has been moved off. The database has also lost all record of any tape activity that has happened since the backup, so you need to get these two in step again.
Before you start TSM up, go into the dsmserv.opt file and add the lines
NOMIGRRECL
DISABLESCHED YES
EXPINT 0

These three commands will prevent migration, client schedule and expire inventory from running. Now start your server in the foreground and run command DISABLE SESSIONS to stop clients from contacting the server.

Audit your disk storage pools using AUDIT Volume volume_name FIX=YES, and that will hopefully fix any problems, but you may need to delete and redefine your disk volumes, discarding faulty data, to get migration to run clean.
Audit your tape library, and that will let TSM know the current location of all tapes.
check your latest saved volhist file for any tapes that have been deleted or updated since the backup ran. You will need to audit these tapes too. Once you complete the audits, back out the changes you made to the dsmserv.opt file, halt the TSM server, then start it normally and enable sessions again.

### Database and log Mirroring

There are three levels of mirroring, Hardware controlled, Operating Systems controlled and TSM controlled.

Mirroring protects the database from disk failure, and also subsystem or site failure if the mirroring is between subsystems or sites. Mirroring also offers some protection from system failure as the chance that at least one of the mirror writes was successful is much higher. TSM mirroring can detect if a partial write has occurred then a mirror volume can be used to construct valid images of the missing pages. TSM mirroring can complement hardware mirroring. It is best to mirror both the database and the recovery log to optimise availability and recoverability.

If you are using automatic database or logfile expansion with mirroring, then this will place both the primary file and the mirrored file in the same directory, as only one directory path can be specified. This means that the primary file and mirrored file could end up on the same disk, so they will need to be separated.

This sounds obvious, but the mirrors need to be on different disks. It is possible to place them on the same disk and that would be pretty pointless. It is also possible to mirror to three ways as well as two ways. With three-way mirroring you get three copies of the data.

Hardware mirroring (RAID1)
Most disk subsystems support RAID1 mirroring, which is expensive as it needs twice as much disk, and will not detect logical errors in the data. All data is mirrored even if it is corrupt.
Operating System Mirroring
IBM state that disk striping is suitable for a large sequential-access files that need high performance. AIX supports RAID0, RAID1 and RAID10. RAID0 is not really a good idea, as if a logical volume was spread over five physical volumes, then the logical volume is five times more likely to be affected by a disk crash. If one disk crashes, the whole file is lost. RAID1 is straight disk mirroring with two stripes and requires twice as much disk. RAID10 combines striping and mirroring, and also uses twice as much disk.
If AIX is mirroring raw logical volumes it is possible for it to overwrite some TSM control information, as they both write to the same user area on a disk. The impact would be that TSM would be unable to vary volumes online.
TSM mirroring
Software mirroring just applies to the legacy database. If TSM is managing the mirror and it detects corrupt data during a write, it will not write the corrupt data to the second copy. TSM can then use the good copy to fix the corrupt mirror. TSM also mirrors at transaction level, and hardware at IO level. Hardware will always mirror every IO, but TSM will only mirror complete transactions. This also protects the mirror from corruption.