IBM Spectrum Protect Performance Tuning
General performance considerations
It is worth noting that IBM do not provide extensive analysis of performance problems as part of their normal support package, but will assist as a billable service. so it is maybe a good idea to see what you can do yourself, before paying for help.
Do not consider Spectrum Protect performance as a one-time fix, you need to monitor your system regularily to make sure performance is not degrading. Two ways to do this, are to record the backup times for a problem client, then if the backup time starts extending, you may have an overall systems issue. Another way is to trap the ANR0987I end of process message for Expire Inventory and record the time. As this is a resource intensive server process, it can be a way of indicating server issues.
If you see a problem developing, consult the IBM documentation, and maybe some of the stuff below, and see if you can identify the problem. If you find a problem, and a solution, make ONE change, and track the result for a few days before trying anything else. Keep a record of what you changed, and what the original parameter was so you can back out if necessary.
I used to work for a manager who liked to get the last drop of use out of any piece of equipment and would defer any upgrade to save a bit of money. The consequence was expensive performance degredation and space errors. Make sure your Spectrum Protect servers have enough system memory, and enough disk capacity to handle the overnight backups. Make sure your backup-archive clients have enough memory and that you have enough network bandwidth. Don't let anything get close to its limit, as performance will suffer.
I had another manager who always wanted to go a bit further and pare a few more subseconds off response times, expending effort and money in the process. Your performance will never be perfect, so work out what is good enough and when you get there, stop trying to improve, just go back to regular monitoring.
If backup performance degrades quite quickly, look for what has changed. Have you just added lots of new clients, or even a few very large clients. Have you installed a new O/S release, has your hardware beed upgraded, or has your Network configuraton changed. Have any of these environments had problems lately? Change is always a good place to start.
Here's some other items that can cause perfomance issues.
- size of files being backed up
- number of files being backed up
- rate that files can be read from disk
- concurrent read datastreams from the same disk
- rate that client can send data
- network topology between client/server
- rate that server can receive data
- concurrent data streams into the server
- tape drive speed (streaming, start/stop)
- bus speed to tape drive
- concurrent data streams for multiple tape drives sharing a single bus
- compressibility of data
- I/O capability of tsm server
- cpu speed of tsm server
- anti-virus software can slow backups down
There is no easy way to work out exactly what is causing your problem. A good starting point is to find out if the problem is with TSM, or with the hardware, or the network. FTP a big file from the affected client to the TSM server disk, and see how long it takes. FTP will always be a bit faster than TSM, as it has no database overhead.
However if the FTP times are slow, the problem is probably outside TSM. A common network problem is mixed full duplex/half duplex environments.
IBM has some performance related documentation for each TSM release. They can be found here
back to top
For V7.1 http://pic.dhe.ibm.com/infocenter/tsminfo/v7r1/topic/com.ibm.itsm.perf.doc/c_performance.html
For V8.1 https://www.ibm.com/support/knowledgecenter/SSEQVQ_8.1.0/perf/b_perf_tuning_guide2.pdf
For V8.1.7 https://www.ibm.com/support/knowledgecenter/en/SSEQVQ_8.1.7/perf/t_tuning_parts.html
Spectrum Protect Server performance
It is best to avoid resource contention between TSM backups and TSM maintenance tasks. Schedule server maintenance operations to run outside of the client backup window, with as little overlap as possible. Rather than let these processes start automatically, schedule them to run at specific times and try to avoid overlap. The exact times for each step will depend on your site and how much work is happening, so you might need a little trial and error to get the best times for you. Some tasks need to run to completion, and some can be stopped before the next task starts. However be aware that if you consistenty stop tasks like expiration and pool migration, you are likely to run out of storage space. Here is a suggested schedule.
08:00 - storage pool backup - run to completion
11:00 - expiration, halt at 14:00
14:00 - storage pool migration, halt at 16:00
16:00 - reclamation, halt at 18:00
18:00 - database, volhist and device config backup - run to completion
20:00 - client backup - run to completion
Ideally, An IBM Spectrum Protect server should run as a single instance on its own physical server as it tends to be resource hungry. There is a DBMEMPERCENT parameter that determines how much system memory Spectrum Protect can use, and the default value is 'AUTO', which means the database manager sets the percentage automatically to a value of approximately 70 to 80 percent of system RAM.
If you do run more than one server instance on a machine, then you may need to configure DBMEMPERCENT for best performance. You should set the DBMEMPERCENT option for each instance to dedicate a portion of memory. If you run other applications besides Spectrum Protect on a machine, you will need to lower DBMEMPERCENT to allow those applications for get adequate memory.
Volume History File too large
You need to regularily prune your volume history file, as if it gets too large that can cause backup and sequential media interaction performance degradation. If you use DRM, then this pruning is managed by the SET DRMDBBACKUPEXPIREDAYS command and will happen automatically. If you don;t use DRM you will need to do it manually, most conveniently by including a DELETE VOLHISTORY command in your housekeeping script.
Spectrum Protect needs volume history information to restore the database, so at a minimum, it will not let you remove the most current database snapshot entry by deleting volume history. However, you might want to restore the database from older backups, so make sure that you don't delete any volume history that you need to restore your oldest database backup, or any other data that you will need.
This parameter specifies how long between automatic expiration of backup and archive files. This process is very CPU intensive, and needs to run at a quiet time. Its best to set EXPinterval to 0, and run Expire Inventory from an Admin schedule as described above. That way, you can make sure you interleave the process inbetween other important processes.
The IBM Spectrum Protect server reads and writes storage pool data with non-buffered I/O, or in other words, it does not use file system cache. You can change this, but it is not generally recommended, as it can both slow things down and increase CPU usage. However, file system cache might help performance if your disks storage system has a small cache, or does not provide read-ahead capability. If you decide to change it, measure the result carefully to make sure you get a significant performance improvement.
You change this with an entry in the dsmserv.opt file, and you can manage container storage pools and other pools independently. The parameter are:
DIOENABLED NO for container storage pools
DIRECTIO NO for all other storage pools
Once you make the change, restart the spectrum protect server to bring the changes in. If performance does not improve, back the changes out again.
Indications of disk problems
The following two SQL queries are based on an IBM white paper and are intended to help you decide if your TSM server disks need tuning. The basic idea is to look at how fast your database backups and expire inventory are going, and if they are below 'normal' figures then you might have disk issues.
Run the following SQL query on your server. The query is just shown as one long line so you can cut and paste it without having to remove end-of-line markers.
select activity, ((bytes/1048576)/cast ((end_time_start_time) seconds as decimal(18,13))*3600) "MB/Hr" from summary where activity='FULL_DBBACKUP' and days(end_time) - days(start_time)=0
output looks something like
IBM state that if the backup is process less than about 28 GB per hour per hour then this might indicate a
disk problem and further investigation is advised.
Another possible indication is expire inventory processing. Try the following SQL query
select activity, cast((end_time) as date) as "Date", (examined/cast ((end_time_start_time) seconds as decimal(24,2))*3600) "Objects Examined Up/Hr" from summary where activity='EXPIRATION'
output looks something like
It is difficult to say what is acceptable with this query as so many factors can affect the throughput. If the throughput drops suddenly then this may indicate possible disk problems. The query above is clearly indicating a potential problem after Feb 02.
Your replication is working properly if you see the same number of replicated files on both source and target servers. You can check this with the QUERY REPLNODE command.
If your file counts do not match, you can use two server options to try to speed up replication.
REPLBATCHSIZE - how many files to include in a batch transaction.
REPLSIZETHRESH - the size of the batch, in megabytes.
both parameters default to 4096. You can increase these parameters to see if performance improves, but if you do, you might need to double the size of your active log. It would be best to increase one parameter at a time, by a small amount, say 10%, then monitor the result for a few replicatons, watching the active log size, and hopefully, the improvement in replication performance. If you make the transaction size too large, this can make other server processes to slowly, so monitor the other processes too, and if they are affected, reduce the REPLSIZETHRESH parameter again.
Server-side Data Deduplication
You can tune the way data deduplication works on normal storage pools (e.g. not container storage pools) with 2 parameters on the IDENTIFY DUPLICATES command.
IDENTIFY DUPLICATES NUMPRocess=n DURation=min
The NUMPR parameter is the number of duplicate identification processes that will run and can take a value between 1 and 50. The usual value is between 25 and 40, but don't allocate more processes than the number of processor cores available on your server.
DUR is the time limit in minutes, at which time the proceses will be cancelled. The value can be between 1 and 9999. You do not want to run deduplication identification at the same time as storage pool backups as they will contend. You should schedule the backup to run and complete first.
You can also alter the theshold for reclamation of a deduplicated storage pool. The default is that the pool will be reclaimed to 60%, but you can alter this up or down, to reclaim more data, or to complete the task within the alloted time, but process less data.
If you deduplicate large amounts of data you can get deadlocks on your server. If so, you would need to increase the DB2 LOCKLIST parameter. DB2 normally handles the locklist automatically and increases or decreases it as needed, redriving any affected transactions. However if it gets a contention issue and is hit with a large amount of data, then the recovery time is excessive, and transactions fail. Contact IBM for advice before altering DB2, but the command you would use is
db2 "connect to db name"
db2 "update db cfg for db name using LOCKLIST nnnnnnn immediate"
Where nnnnnnn is depends on the amount of concurrent data that is moved; 122000 for 500 GB; 244000 for 1 TB and 1220000 for 5 TB
back to top
Spectrum Protect Database performances
Getting your hardware configuration correct is probable the most important factor in good spectrum protect database performance. This is explored in detail in the database page, but very briefly, your database must be spread over at least 4 LUNs or volumes, and be on different volumes than the active log.
If the DB2 INTRA_PARALLEL option mistakenly set to YES this can degrade your database transaction performance.
If it is set to YES, then run to following commands from the DB2 commandline to fix it
db2 attach to TSM server instance
db2 update dbm cfg using INTRA_PARALLEL NO
DB2 and LDAP
DB2 can be configured to use LDAP user authentication, and this can slow the authentication between the Server and DB2, especially if LDAP is broken. This will really slow down processes like inventory file expiration that are heavy database users. The recommendation is that you should consider disabling LDAP user authentication if issues cannot be fixed.
Table Reorgs and RUNSTATS
The DB2 system that underlies IBM Spectrum Protect should automatically reorganise the database tables and indexes, and also run RUNSTATS to optimise the paths through the database. If this stops working it will slow your server down. Potential issues and fixes for reorgs are discussed on the IBM SP Database and Log page.
'RUNSTATS' is used to optimise the access paths to a DB2 database. TSM should run Runstats regularily, but how can you check when runstats last ran? This could be important if you see database performance starting to suffer. To find out, run the following
Start up a DB2 command line,
Windows go to Start-Programs-IBM DB2-Command Line tools-Command Window
UNIX, su - db2inst1 (db2inst1 is the default instance, if you change the instance name or have multiple instances, you need to su to the correct userid for your instance). You then type 'db2' to open the DB2 command line
From the db2 command line type
db2= select stats_time,SUBSTR(TABNAME,1,40) from syscat.tables where tabsChema='TSMDB1' AND stats_time is not null order by stats_time
The output should look something like below, where the first column contains the date when runstats last ran against the table in column 2.
back to top
Spectrum Protect Client performance
If you think that your problems may be down to subset of clients, then you can investigate the performance of all the clients on your server by checking the server accounting records.
Accounting records are held on the IBM Spectrum Protect server host, and a record is wrtten for every client, when a session ends. Accounting is switched off by default, and can set switched on with the SET ACCOUNTING command. Accounting logs are called dsmacct.log and are stored in the server directory by default, but this can be changed by setting the DBMSERVE_ACCOUNTING_DIR variable.
The account values consist of 31 records separated by commas, really useful for loading into a spreadsheet. Some of the useful fields are
4 - date
5 - time
6 - client name
17 - data backed up in KB
20 - data sent from client to server in KB
21 - session duration in seconds
22 - idle wait time in seconds
23 - comms wait time in seconds
24 - media wait time in seconds
The IBM Manuals tell you what the rest of the fields are
'idlewaittime' is the time a session waits to receive a work request from the client after a previous work request completes. A work request is a backup, archive, query, or any other client command. For example, a client connects to the server and submits a request to do a selective backup. The backup completes and the server sends completion status to the client. The server then waits for the next request to be submitted. If the client responds 6 minutes later with a request, then idlewaittime for this segment of the session will be 6 minutes.
'commwaittime' is the time the server waited to receive data from or send data to a client. This occurs within a work request. For example, a client submits a request for a backup of a 1M file. The file data has to be sent in small chunks to the server requiring a number of receives to get the data from the client and a number of sends to acknowledge its receipt. Commwait begins when a receive or send data request is made to the communications layer and stops when the receive or send completes. For example after the client sends some data, it will wait for an acknowledgement, if it takes 5 seconds until this acknowledgement is received, the
commwaittime will be 5 seconds.
'mediawaittime' is the time the session waited for tapes to be mounted and made ready for input or output. Mediawaittime is independent of idlewaittime, commwaittime, and process time. It can be larger than duration, idlewaittime, commwaittime and process time because of the overlap in processing that takes place within the server for the session.
IBM provides a Perl script to collect IBM Spectrum Protect V8 server monitoring data. This can be useful for collecting ongoing server performance data to detect when problems appear, and also for providing support data to IBM. Details of the script can be found here
In older versons of the TSM client, you enabled client tracing by putting the trace flags instr-client-detail parameter in the dsm.opt file as tracing was disabled by default. IBM Storage Protect clients use the the enableinstrumentation option and the default setting of this is yes. Collecting tracde data has no performance impact for the client, so it is always enabled and gathered, unless you decide to switch it off.
In addition to the summary information you usually get after backup you'll find something like this:
You need to bounce DSMCAD to start, stop or change trace parameters.
you can use this information to get an idea where most time gets wasted. If just some of your clients are performing badly, then compare traces between good and bad clients.
Other trace options are
Client Side buffer parameters
Set this to YES
DISKBUFFSIZE should be set to YES. For the large buffers to take effect, every single link in your network must also be configured for large buffers. If you have fast ethernet then make sure you explicitly configure the speeds on the switch ports rather than setting them to autodetect, to prevent transfer size mismatches
This parameters is used to batch up small file transfers, so the transfer overhead on an individual file is shared out. The default setting for TXNBytelimit is 25600 and refers to the number of bytes transferred in one batch.
If you increase TXNBytelimit, keep an eye on your recovery logs, as they will need more space. If you find that performance actually gets worse, it possible that this is due to faults on your network, which are causing a lot of retries. Retries will take longer with bigger data chunks, which can totally offset the benefits of lower transport overheads.
See the Network tuning section below.
RESOURCEUTILIZATION is a flag which you set in the client options file, which enables multiple backup streams. The resources are the number of control sessions (sessions that figure out what to back up) and the number of transfer sessions (sessions that actually back up or archive the data). If you set RESOURCEUTILISATION to 8 on a client, then it will use not necessarily use 4 concurrent data transfer sessions and 4 control sessions.
RESOURCEUTILIZATION just provides a guideline for the number of resources the client should use. The number of concurrent sessions you get will be based on the real-time performance characteristics of the client, and the value of RESOURCEUTILIZATION. The higher the RESOURCEUTILIZATION value, the more producer/consumer sessions the client may use, but if the system is starved for other resources, or the number of files to process does not warrant it, then a larger number of sessions may not be used, even with a large RESOURCEUTILIZATION value.
Client side Deduplication
If you are short of network capacity, then one option to consider is client side deduplication. This reduces the amount of data to transfer by eliminating redundant
data. It can be very effective when combined with client compression, as the data is first deduplicated, then compressed. However don't use client compression with server side deduplication as that could be both slower and could use more back end capacity than deduplication on its own.
If you want to use it, then you must both have client-side data deduplication enabled in the node definition on the Spectrum Protect server, and the client data must be directed to a 'file' type storage pool that is enabled for data deduplication. You then have three parameters that you can use to control deduplication, in the dsm.opt file, or in a client option set at the server.
The dedupcachepath option points to the client-side data deduplication cache.
The dedupcachesize option defines the maximum size of the data deduplication cache file. When that maximum is reached, the contents of the cache are deleted and new entries are added. The size is given in megabytes, with a range of 1 - 2048 and a default of 256.
Finally, you need deduplication yes to active it
Directory structure restores
If you're recovering a Windows server, or even a large directory structure, then the restore goes a lot faster if the directories are held in a separate, disk storage pool
Set up a disk storage pool for directories, and allocate a management class which sends the directories to it. This disk storage pool should not require a lot of space, since directories are typically very small.
Then specify option DIRMC directory_mgmtclass_name.
back to top
If domain name resolution (DNS) is not correctly configured and responding quickly, that can cause slow server connect times from clients. If this happens, speak to your system and network administrators and get the issue fixed.
If possible, use a dedicated local area network or a SAN for your backups.
Tuning TCP/IP settings
IBM provides default values for TCP/IP settings and in general they work well. However you can tune them, but if you do, change things consistently and incrementally, and monitor performance carefully afterwards to make sure performance did not get worse.
If you are running over a high latency, fast network then bigger TCP/IP windows might help. If you change the default TCPWINDOWSIZE client and server options, then you change the way that flow control works in TCP and so you must enable TCP window scaling (Transfer Connect Protocol is the high level layer of TCP/IP, the one that applications interact with).
TCP flow control and the sliding window
TCP works with a pair of buffers, a SEND buffer at the Client side and a RECEIVE buffer at the server side (for client - server transmission). TCP needs to control the flow of data between the buffers, and for that it uses a 'sliding window'.
The Receiver checks the receive buffer and advertises the amount free back to the Sender system, lets say the receive buffer is empty, so 63 bytes are free.
The Sender has lined the data up ready for transmission and has placed a 'window' over the data that starts with the first byte, and ends at byte 63, as the send buffer has 63 bytes. The Sender therefore sends 63 bytes off to the Receiver, bundled into 3 packets.
The Receiver accepts the packets, processes the first 2 and advertises back to the Sender that it has 40 bytes free. Packet 3 is using the other 23 bytes in the buffer.
The Sender now 'slides' the window along the data to byte 41, as it knows that the first 40 bytes have been sent successfully. The next 23 bytes are unacknowledged data, so the window end moves to byte 103, and bytes 64 to 103 can be transmitted.
This process then continues, with Sender and Receiver staging data in their buffers as required, until the end of data is reached. However, if the Receive buffer ever fills up, then the Receiving system advertises a receive window size of zero back to the Sender and no more data can be sent until the Receiver clears out buffer space. This is obviously not optimal as it slows the process down. So the point behind TCP/IP tuning is to make those pairs of buffers the best size so that the receiving application reads data as fast as the sending system can send it, and the receive window stays at or near the size of the receive buffer. The problem is that there is no one best value for these buffer sizes, even within IBM Spectrum Protect applications. For example, the optimum buffers might be different for backup-archive client operations and IBM Spectrum Protect for Virtual Environments operations, so you might have to use a value that is a compromise between them.
So, you want to tune your TCP buffers using TCPWINDOWSIZE. First of all, check how big the buffer space is on the Network Adapter. You don't want to make your TCPWINDOWSIZE bigger than this. Then try increasing your TCPWINDOWSIZE, maybe by doubling it, and check to see what the performance impact is. It might be worse. With a bit of trial and error, you should find the best value. Remember, when you are running an incremental backup, both client and server act as receivers of data. The server sends metadata back to the client so it can work out what to back up, and the client sends backup data up to the server. You can specify TCPWINDOWSIZE option at both the Spectrum Protect server and the client, but you cannot differentiate between buffers. You specify one value, which is used as the size for both the send and receive windows. While the value of TCPWINDOWSIZE can be different on both server and client, the effective buffer size will be the smaller number.
Many Windows systems can automatically tune the TCP receive window size, adjusting the receive window as needed for optimal performance. If your system support this, and it is enabled, then consider setting the IBM Spectrum Protect server TCPWINDOWSIZE option to 0. Setting the option to 0 means that server sessions use the TCP window size for the operating system.
For restores, the data flow is Server to Client and the Receive buffers are at the Client side. If the Client is lacking in power then it might not be able to process larger buffers fast enough, and so might return an excessive number of zero size windows. In this case, you might need to reduce the size of the window. This can also be a problem for VMware restores, because IBM Spectrum Protect writes the data to the vStorage API where VMware does some processing, and so the cient has to do even more work to empty those Receive buffers.
back to top
Tape to Tape copy performance
The options which affect tape-to-tape copy most, are movebatchsize, movesizethresh and bufpoolsize. Bufpoolsize is explained above.
Movebatchsize and Movesizethresh determine how many files are grouped together and moved in a single transaction. Movebatchsize is the number of files which will be grouped, movesizethresh is the cumulative size of all the files that will be moved. Files are batched up until one of these thresholds is reached, then the files are sent. The default for movebatchsize is 1000, which is the maximum, and the default for movesizethresh is 4096. It is possible to increase movesizethresh up to as far as 32768. However, if the numbers are set high, then you will need more space in the recovery log. If you change the settings, keep an eye on the log for a while, and make sure it is not getting too full.
These parameters, and TXNGROUPMAX, can be dynamically changed by TSM, if SELFTUNETXNsize is set to YES.
Number of Tape Drives
You need to configure enough tape drives to cater for your workload, which includes enough for all clients that backup direct to tape simultaneously, and for housekeeping work that runs in the backup window.
Using high performance tape drives
IBM recommends that if you are using high performance tape drives with IBM Spectrum Protect, then these parameters hould give best performance.
At the Server
At the Client
back to top