If your TSM performance is not too hot, there could be a lots of reasons why. Here's a list of some of them.
size of files being backed up
number of files being backed up
rate that files can be read from disk
concurrent read datastreams from the same disk
rate that client can send data
network topology between client/server
rate that server can receive data
concurrent data streams into the server
tape drive speed (streaming, start/stop)
bus speed to tape drive
concurrent data streams for multiple tape drives sharing a single bus
compressibility of data
I/O capability of tsm server
cpu speed of tsm server
anti-virus software can slow backups down
There is no easy way to work out exactly what is causing your problem. A good starting point is to find out if the problem is with TSM, or with the hardware, or the network. FTP a big file from the affected client to the TSM server disk, and see how long it takes. FTP will always be a bit faster than TSM, as it has no database overhead. However if the FTP times are slow, the problem is probably outside TSM. A common network problem is mixed full duplex/half duplex environments.
If the backup throughput from a Windows server suddenly dies, or if you start to backup a new server and it is going very slow, a very common cause is that the network speed and duplex settings are wrong. They should be set to 100MB full duplex, not AUTO. On a Windows 2000 server, go into
Start
Settings
Network and dialup connections
Click on the LAN symbol
click on properties
Click on Configure
Open the Advanced window
Page down to the 'Speed and Duplex' panel
Check the value
To check this from an NT server, from the windows server go into
Start
Settings
Network and dialup connections
Hit the 'Configure' button
Take the 'Advanced' window
Select the 'Speed and Duplex' option
Check that the setting in the drop down window is set to 100MB full
You can focus in on a problem by adding a trace parameter to your tsm.opt file. The parameter is
trace flags instr_client_detail
In addition to the summary information you usually get after backup
you'll find something like this:
------------------------------------------------------------------
Final Detailed Instrumentation statistics
Elapsed time: 1502.420 sec
Section Total Time(sec) Average Time(msec) Frequency used
------------------------------------------------------------------
Client Setup 15.081 15081.0 1
Process Dirs 331.908 191.4 1734
Solve Tree 0.000 0.0 0
Compute 1.021 0.0 76446
Transaction 43.535 0.2 244347
BeginTxn Verb 0.000 0.0 269
File I/O 733.274 8.2 89169
Compression 0.000 0.0 0
Encryption 0.000 0.0 0
Delta 0.000 0.0 0
Data Verb 474.198 6.2 76446
Confirm Verb 0.251 16.7 15
EndTxn Verb 195.729 727.6 269
Client Cleanup 1.612 1612.0 1
------------------------------------------------------------------
You need to bounce DSMCAD to start, stop or change trace parameters.
you can use this information to get an idea where most time gets wasted. If just some of your clients are performing badly, then compare traces between good and bad clients.
Efficient use of buffers can speed up TSM backups and restores by up to 30%. In this context, buffering means both preloading data into central storage, and grouping data together into larger chunks for efficient transfer. Do a 'q db f=d' command to
view the cache hit percentage on your server during high activity. If it is
below 98 percent or so, you need to investigate improving your buffering.
The TSM parameters which control buffering are -
SELFTUNEBUFPSIZE BUFPOOLSIZE These two parameters are complementary. You can set the size of your bufferpool yourself, using BUFPOOLSIZE, or you can let the system decide what to use by setting SELTUNEBUFPSIZE to 'YES' These parameters determine how much CPU cache to use on your TSM server. The recommendation is to use about 10% of physical memory, with a target of keeping the cache hit ratio at 98% or higher. The recommendation is to use SELFTUNEBUFPSIZE
Beware of setting BUFPOOLSIZE too high, as that can cause TSM to hold so much memory for its own use that the Operating System doesn't have enough. This will result in high paging, and poor TSM performance.
Use the command q db format=detail to see what you database cache hit rate is. If the system is paging, then the results of this command will be misleading. If the cache hit rate is too low, try raising the value of BUFFPOOLSIZE, 1MB at a time. You have to restart the server for the values to take effect. Check that the increase has not caused paging, then repeat the q db command. to see if the cache hit rate has risen.
TCPNOdelay Set this to YES
USELARGEBUFFERS The default setting is USELARGEBUFFERS YES, make sure its set on both the server and the clients
DISKBUFFSIZE, LARGECOMMBUFFERS LARGECOMMBUFFERS is a client parameter which seems to be interchangeable with USELARGEBUFFERS, it should be set to YES. If only life were so simple. For the large buffers to take effect, every single link in your network must also be configured for large buffers. If you have fast ethernet then make sure you explicitly configure the speeds on the switch ports rather than setting them to autodetect, to prevent transfer size mismatches
LARGECOMMBUFFERS=YES was replaced by DISKBUFFSIZE=nn in TSM 5.3 DISKBUFFSIZE=32 seems to work well for Windows clients at least.
TXNGROUPMAX
TXNBYTELIMIT These pair of parameters are used to batch up small file transfers, so the transfer overhead on an individual file is shared out. The default sizes are quite small. You should try experimenting with larger sizes, to find the optimum for your configuration. IBM Recommend setting TXNGroupmax at 256 and TXNBytelimit to 2048 when the primary storagepool is on Disk. For tape, txnbytelimit=2097152 seems to work well for LTO, DLT and 9940 drives, while 25600 seems best for 9840 and 3590 devices.
If you increase TXNGroupmax and TXNBytelimit, keep an eye on your recovery logs, as they will need more space. If you find that performance actually gets worse, it possible that this is due to faults on your network, which are causing a lot of retries. Retries will take longer with bigger data chunks, which can totally offset the benefits of lower transport overheads.
TCPWindowsize
TCPBUFFSIZE Setting depends on your TSM server platform. 63 is best for an Windows servers, and 64 for a UNIX servers. If a Windows 2000 server is communicating with Windows 2000 clients only, then the TCPW parameter can be larger, as Win2k supports TCP window scaling. Try a value of 512 for TCPBUFFSIZE, this seems to work well for WIN2K clients.
RESOURCEUTILIZATION is a flag which you set in the client options file, which enables multiple backup streams. The resources are the number of control sessions (sessions
that figure out what to back up) and the number of transfer sessions
(sessions that actually back up or archive the data). If you set RESOURCEUTILISATION to 8 on a client, then it will use not necessarily use 4 concurrent data transfer sessions and 4 control sessions. RESOURCEUTILIZATION just provides a guideline for the number
of resources the client should use. The number of concurrent sessions you get will be based on the real-time performance characteristics of the client, and the value of RESOURCEUTILIZATION. The higher the RESOURCEUTILIZATION value, the more producer/consumer sessions the client may use, but if the system is starved for other resources, or the number of files to process does not warrant it, then a larger number of sessions may
not be used, even with a large RESOURCEUTILIZATION value.
Few sites can afford the luxury of keeping all their TSM backup data on Disk (Anyone out there?) However, if you're recovering a Netware server, or even a large directory structure, then the restore goes a lot faster if the directories are held in a separate, disk storage pool
Set up a disk storage pool for directories, and allocate a management class which sends the directories to it. This disk storage pool should not require a lot of space, since directories are typically very small.
Then specify option DIRMc directorymgmtclassname.
EXPinterval This parameter specifies how long between automatic expiration of backup and archive files. This process is very CPU intensive, and needs to run at a quiet time. Its best to set EXPinterval to 0, and run expiration from an Admin schedule.
Logpool size
This parameter determines the size of the recovery log buffer pool. If the
buffer pool is not big enough, transactions will wait while recovery
records are written to the log. You can see if this is a problem by
using the command q log format=detail The command will show
the wait percentage, which ideally should be 0. If its not 0, try increasing
the Logpoolsize parameter, but make sure it does not affect overall
system memory usage. A logpoolsize of about 4096 is about standard.
TSM Server caching is designed to optimize restore times but sites have experienced slow migration times with caching active. If you are having problems with migration, consider turning caching off, but be aware that this could affect restore speeds.
The following two SQL queries are based on an IBM white paper and are intended to help you decide if your TSM server disks need tuning. The basic idea is to look at how fast your database backups and expire inventory are going, and if they are below 'normal' figures then you might have disk issues.
Database Backups
Run the following SQL query on your server. The query is just shown as one long line so you can cut and paste it without having to remove end-of-line markers.
select activity, ((bytes/1048576)/cast ((end_time-start_time) seconds as decimal(18,13))*3600) "MB/Hr" from summary where activity='FULL_DBBACKUP' and days(end_time) - days(start_time)=0
output looks something like
IBM state that if the backup is process less than about 28 GB per hour per hour then this might indicate a disk problem and further investigation is advised.
Another possible indication is expire inventory processing. Try the following SQL query
select activity, cast((end_time) as date) as "Date", (examined/cast ((end_time-start_time) seconds as decimal(24,2))*3600) "Objects Examined Up/Hr" from summary where activity='EXPIRATION'
output looks something like
It is difficult to say what is acceptable with this query as so many factors can affect the throughput. If the throughput drops suddenly then this may indicate possible disk problems. The query above is clearly indicating a potential problem after Feb 02.
Try to spread your database and log volumes across SCSI controllers.
Use several small volumes for disk pools rather than a small number of large volumes. Sessions lock volumes so more volumes means more simultaneous sessions.
Consider defining more TSM servers, to split your database. The hardware configuration
dictates the size of the database that you can support. If your database backup is taking more than a couple of hours, then either you need a bigger TSM server, or two servers. Typically, an H70 RS/6000 will support a 70GB database, but that is too big for an R40.
The options which affect tape-to-tape copy most, are movebatchsize, movesizethresh and bufpoolsize. Bufpoolsize is explained above.
Movebatchsize and Movesizethresh determine how many files are grouped together and moved in a single transaction. Movebatchsize is the number of files which will be grouped, movesizethresh is the cumulative size of all the files that will be moved. Files are batched up until one of these thresholds is reached, then the files are sent. The default for movebatchsize is 40, but consider setting this to 1000 (the maximum), and set movesizethresh to 500. However, if the numbers are set high, then you will need more space in the recovery log. If you change the settings, keep an eye on the log for a while, and make sure it is not getting too full.
These parameters, and TXNGROUPMAX, can be dynamically changed by TSM, if SELFTUNETXNsize is set to YES.
There is a new (TMS4.2) parameter, TAPEIOBUFS, which can speed up access to 3590 tapes on AIX servers. The default value is 1, and it can be set up to 9. I have no experience to make a recommendation on this one.