Navigation Bar

TSM Performance Tuning

General Angles
Windows Network Duplex Settings
Trace options
Buffer Parameters
Multi-streaming
Netware Directories
Server Operations
Indications of disk problems
Hardware Configuration
Tape Operations

General performance considerations

If your TSM performance is not too hot, there could be a lots of reasons why. Here's a list of some of them.

  • size of files being backed up
  • number of files being backed up
  • rate that files can be read from disk
  • concurrent read datastreams from the same disk
  • rate that client can send data
  • network topology between client/server
  • rate that server can receive data
  • concurrent data streams into the server
  • tape drive speed (streaming, start/stop)
  • bus speed to tape drive
  • concurrent data streams for multiple tape drives sharing a single bus
  • compressibility of data
  • I/O capability of tsm server
  • cpu speed of tsm server
  • anti-virus software can slow backups down

There is no easy way to work out exactly what is causing your problem. A good starting point is to find out if the problem is with TSM, or with the hardware, or the network. FTP a big file from the affected client to the TSM server disk, and see how long it takes. FTP will always be a bit faster than TSM, as it has no database overhead. However if the FTP times are slow, the problem is probably outside TSM. A common network problem is mixed full duplex/half duplex environments.

back to top


Windows Network Duplex Settings

If the backup throughput from a Windows server suddenly dies, or if you start to backup a new server and it is going very slow, a very common cause is that the network speed and duplex settings are wrong. They should be set to 100MB full duplex, not AUTO. On a Windows 2000 server, go into
Start
Settings
Network and dialup connections
Click on the LAN symbol
click on properties
Click on Configure
Open the Advanced window
Page down to the 'Speed and Duplex' panel
Check the value

To check this from an NT server, from the windows server go into
Start
Settings
Network and dialup connections
Hit the 'Configure' button
Take the 'Advanced' window
Select the 'Speed and Duplex' option
Check that the setting in the drop down window is set to 100MB full


back to top

Trace Parameters

You can focus in on a problem by adding a trace parameter to your tsm.opt file. The parameter is


 trace flags instr_client_detail


 In addition to the summary information you usually get after backup
 you'll find something like this:

 ------------------------------------------------------------------
 Final Detailed Instrumentation statistics
 Elapsed time:  1502.420 sec
 Section      Total Time(sec)  Average Time(msec)  Frequency used
 ------------------------------------------------------------------
 Client Setup       15.081        15081.0              1
 Process Dirs      331.908          191.4           1734
 Solve Tree          0.000            0.0              0
 Compute             1.021            0.0          76446
 Transaction        43.535            0.2         244347
 BeginTxn Verb       0.000            0.0            269
 File I/O          733.274            8.2          89169
 Compression         0.000            0.0              0
 Encryption          0.000            0.0              0
 Delta               0.000            0.0              0
 Data Verb         474.198            6.2          76446
 Confirm Verb        0.251           16.7             15
 EndTxn Verb       195.729          727.6            269
 Client Cleanup      1.612         1612.0              1
 ------------------------------------------------------------------

You need to bounce DSMCAD to start, stop or change trace parameters.

you can use this information to get an idea where most time gets wasted. If just some of your clients are performing badly, then compare traces between good and bad clients.

Other trace options are

      tracefile    output_file_name.txt
      traceflag    perform

back to top


Buffering

Efficient use of buffers can speed up TSM backups and restores by up to 30%. In this context, buffering means both preloading data into central storage, and grouping data together into larger chunks for efficient transfer. Do a 'q db f=d' command to view the cache hit percentage on your server during high activity. If it is below 98 percent or so, you need to investigate improving your buffering.

The TSM parameters which control buffering are -

SELFTUNEBUFPSIZE
BUFPOOLSIZE
These two parameters are complementary. You can set the size of your bufferpool yourself, using BUFPOOLSIZE, or you can let the system decide what to use by setting SELTUNEBUFPSIZE to 'YES' These parameters determine how much CPU cache to use on your TSM server. The recommendation is to use about 10% of physical memory, with a target of keeping the cache hit ratio at 98% or higher. The recommendation is to use SELFTUNEBUFPSIZE

Beware of setting BUFPOOLSIZE too high, as that can cause TSM to hold so much memory for its own use that the Operating System doesn't have enough. This will result in high paging, and poor TSM performance.

Use the command q db format=detail to see what you database cache hit rate is. If the system is paging, then the results of this command will be misleading. If the cache hit rate is too low, try raising the value of BUFFPOOLSIZE, 1MB at a time. You have to restart the server for the values to take effect. Check that the increase has not caused paging, then repeat the q db command. to see if the cache hit rate has risen.

TCPNOdelay
Set this to YES

USELARGEBUFFERS
The default setting is USELARGEBUFFERS YES, make sure its set on both the server and the clients

DISKBUFFSIZE, LARGECOMMBUFFERS
LARGECOMMBUFFERS is a client parameter which seems to be interchangeable with USELARGEBUFFERS, it should be set to YES. If only life were so simple. For the large buffers to take effect, every single link in your network must also be configured for large buffers. If you have fast ethernet then make sure you explicitly configure the speeds on the switch ports rather than setting them to autodetect, to prevent transfer size mismatches
LARGECOMMBUFFERS=YES was replaced by DISKBUFFSIZE=nn in TSM 5.3 DISKBUFFSIZE=32 seems to work well for Windows clients at least.

TXNGROUPMAX
TXNBYTELIMIT

These pair of parameters are used to batch up small file transfers, so the transfer overhead on an individual file is shared out. The default sizes are quite small. You should try experimenting with larger sizes, to find the optimum for your configuration. IBM Recommend setting TXNGroupmax at 256 and TXNBytelimit to 2048 when the primary storagepool is on Disk. For tape, txnbytelimit=2097152 seems to work well for LTO, DLT and 9940 drives, while 25600 seems best for 9840 and 3590 devices.
If you increase TXNGroupmax and TXNBytelimit, keep an eye on your recovery logs, as they will need more space. If you find that performance actually gets worse, it possible that this is due to faults on your network, which are causing a lot of retries. Retries will take longer with bigger data chunks, which can totally offset the benefits of lower transport overheads.

TCPWindowsize
TCPBUFFSIZE

Setting depends on your TSM server platform. 63 is best for an Windows servers, and 64 for a UNIX servers. If a Windows 2000 server is communicating with Windows 2000 clients only, then the TCPW parameter can be larger, as Win2k supports TCP window scaling. Try a value of 512 for TCPBUFFSIZE, this seems to work well for WIN2K clients.


back to top

Multi-streaming

RESOURCEUTILIZATION is a flag which you set in the client options file, which enables multiple backup streams. The resources are the number of control sessions (sessions that figure out what to back up) and the number of transfer sessions (sessions that actually back up or archive the data). If you set RESOURCEUTILISATION to 8 on a client, then it will use not necessarily use 4 concurrent data transfer sessions and 4 control sessions. RESOURCEUTILIZATION just provides a guideline for the number of resources the client should use. The number of concurrent sessions you get will be based on the real-time performance characteristics of the client, and the value of RESOURCEUTILIZATION. The higher the RESOURCEUTILIZATION value, the more producer/consumer sessions the client may use, but if the system is starved for other resources, or the number of files to process does not warrant it, then a larger number of sessions may not be used, even with a large RESOURCEUTILIZATION value.


back to top

Directory structure restores

Few sites can afford the luxury of keeping all their TSM backup data on Disk (Anyone out there?) However, if you're recovering a Netware server, or even a large directory structure, then the restore goes a lot faster if the directories are held in a separate, disk storage pool
Set up a disk storage pool for directories, and allocate a management class which sends the directories to it. This disk storage pool should not require a lot of space, since directories are typically very small.
Then specify option DIRMc directorymgmtclassname.

back to top


Server tasks

EXPinterval
This parameter specifies how long between automatic expiration of backup and archive files. This process is very CPU intensive, and needs to run at a quiet time. Its best to set EXPinterval to 0, and run expiration from an Admin schedule.

Logpool size

This parameter determines the size of the recovery log buffer pool. If the buffer pool is not big enough, transactions will wait while recovery records are written to the log. You can see if this is a problem by using the command q log format=detail The command will show the wait percentage, which ideally should be 0. If its not 0, try increasing the Logpoolsize parameter, but make sure it does not affect overall system memory usage. A logpoolsize of about 4096 is about standard.

TSM Server caching is designed to optimize restore times but sites have experienced slow migration times with caching active. If you are having problems with migration, consider turning caching off, but be aware that this could affect restore speeds.

back to top


Indications of disk problems

The following two SQL queries are based on an IBM white paper and are intended to help you decide if your TSM server disks need tuning. The basic idea is to look at how fast your database backups and expire inventory are going, and if they are below 'normal' figures then you might have disk issues.

Database Backups

Run the following SQL query on your server. The query is just shown as one long line so you can cut and paste it without having to remove end-of-line markers.
select activity, ((bytes/1048576)/cast ((end_time-start_time) seconds as decimal(18,13))*3600) "MB/Hr" from summary where activity='FULL_DBBACKUP' and days(end_time) - days(start_time)=0
output looks something like

ACTIVITY                     Date     			  MB/Hr
------------------     ----------     -----------------
FULL_DBBACKUP          2005-01-30                 31026
FULL_DBBACKUP          2005-02-06                 33976

IBM state that if the backup is process less than about 28 GB per hour per hour then this might indicate a disk problem and further investigation is advised.
Another possible indication is expire inventory processing. Try the following SQL query
select activity, cast((end_time) as date) as "Date", (examined/cast ((end_time-start_time) seconds as decimal(24,2))*3600) "Objects Examined Up/Hr" from summary where activity='EXPIRATION'
output looks something like

ACTIVITY                     Date                Objects Examined Up/Hr
------------------     ----------     ---------------------------------
EXPIRATION             2005-01-24                   2086078.85587918800
EXPIRATION             2005-01-26                   1430425.08811519200
EXPIRATION             2005-01-26                   2309000.17643557200
EXPIRATION             2005-01-27                   2343761.01158234400
EXPIRATION             2005-02-04                    579753.49273113600
EXPIRATION             2005-02-04                     64950.70455612000
EXPIRATION             2005-02-06                    131093.51240872800

It is difficult to say what is acceptable with this query as so many factors can affect the throughput. If the throughput drops suddenly then this may indicate possible disk problems. The query above is clearly indicating a potential problem after Feb 02.

back to top

Hardware Configuration

Try to spread your database and log volumes across SCSI controllers.

Use several small volumes for disk pools rather than a small number of large volumes. Sessions lock volumes so more volumes means more simultaneous sessions.

Consider defining more TSM servers, to split your database. The hardware configuration dictates the size of the database that you can support. If your database backup is taking more than a couple of hours, then either you need a bigger TSM server, or two servers. Typically, an H70 RS/6000 will support a 70GB database, but that is too big for an R40.

back to top


Tape to Tape copy performance

The options which affect tape-to-tape copy most, are movebatchsize, movesizethresh and bufpoolsize. Bufpoolsize is explained above.

Movebatchsize and Movesizethresh determine how many files are grouped together and moved in a single transaction. Movebatchsize is the number of files which will be grouped, movesizethresh is the cumulative size of all the files that will be moved. Files are batched up until one of these thresholds is reached, then the files are sent. The default for movebatchsize is 40, but consider setting this to 1000 (the maximum), and set movesizethresh to 500. However, if the numbers are set high, then you will need more space in the recovery log. If you change the settings, keep an eye on the log for a while, and make sure it is not getting too full.


These parameters, and TXNGROUPMAX, can be dynamically changed by TSM, if SELFTUNETXNsize is set to YES.

There is a new (TMS4.2) parameter, TAPEIOBUFS, which can speed up access to 3590 tapes on AIX servers. The default value is 1, and it can be set up to 9. I have no experience to make a recommendation on this one.

back to top


Copyright © Lascon Storage Ltd. 2000 to present date. By entering and using this site, you accept the conditions and limitations of use