SSIS – SQL Server Integration Services is an improved version of DTS (Data Transformation Services) and is widely being used for ETL (Extract, Transform and Load) operations. It provides an excellent set of tasks and features that can be used in all data related activities such as extraction and applying any transformations with great performance. And above all! This package, widely being used by Database Developers and Database Administrators, comes free with SQL Server installation so if you are using SQL Server you do not need to apply/buy any other licenses for SSIS.
This article describes best practices to enhance and boost performance of a SSIS package. You may find several articles on the Internet to boost the performance, but this gives you a holistic picture and represents my views
Handling Indexes while data transformation
When working with any data in RDBMS, indexing plays a vital role in boosting search operations. Clustered and Non-Clustered Indexes both may exist in source and/or destination data. When amount of data is small, it doesn’t matter if indexes are managed run-time or indexes are managed separately. However, with a large amount of data involved while transformation, unmanaged indexes hit the performance greatly. You can see a drastic difference in the performance when transforming large amount of un-indexed data.
To handle this, first make the target table a heap by dropping all indexes created on it. Then, transfer the data to heap. At the end of the transformation, create an index on the target table. Here is a flowchart to do this activity.
Legend: NCI = Non Clustered Indexes; CI = Clustered Indexes
Avoid using select * in data flow task
Data Flow task is one of the most commonly used task of Integration Service. In Data Flow task, SSIS uses a buffer to do Transfer and Transformation task. The size of the buffer is dependent on several factors. One of them is the Estimated Row Size. The Estimated Row Size is determined by summing up the maximum size of all columns in a row.
Estimated Row Size = Sum of Maximum size of ALL columns in a Row
So more the number of columns at source, less number of rows in the buffer at destination.
Hence, it is recommended to select a minimum number of columns at source which are required at destination. So if we use the query “select * ” on source connection it will reduce the buffer size. Even if we need all the columns from source, it is better to write down the name of the columns rather than using asterisk (*). When we use asterisk (*), it takes another round for the source to gather the metadata about the columns.
Avoid using Table or View name in Variables
SSIS has provided a great feature of using variables and expressions within Script Task, Connection Managers and alike. It is recommended to avoid using Table name or View name in a variable for a source/destination connection manager. This is because, SSIS internally queries using “select *” clause on the variable used. And as described earlier, this can reduce your performance
Destination Connection settings
While setting up OLE DB Destination connection, consider following settings.
Data Access Mode : This setting provides the ‘fast load’ option which internally uses a BULK INSERT statement for uploading data into the destination table instead of a simple INSERT statements (for each single row) as in the case for other options. So unless you have a reason for changing it, don’t change this default value of fast load. If you select the ‘fast load’ option, there are also a couple of other settings which you can use as discussed below.
Keep Identity : By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. If you check this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.
Keep Nulls : Again by default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during insert into the destination table if NULL value is coming from the source for that particular column. If you check this option then default constraint on the destination table’s column will be ignored and preserved NULL of the source column will be inserted into the destination.
Table Lock : By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at same time. It specifies a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems.
Check Constraints : Again by default this setting is checked and recommendation is to un-check it if you are sure that the incoming data is not going to violate constraints of the destination table. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load.
Setting up Rows Per Batch and Maximum Insert Commit Size Settings
Rows per batch : The default value for this setting is -1 which specifies all incoming rows will be treated as a single batch. You can change this default behavior and break all incoming rows into multiple batches. The allowed value is only positive integer which specifies the maximum number of rows in a batch.
Maximum insert commit size : The default value for this setting is ‘2147483647’ (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records. You might be wondering, changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow tremendously specifically during high volume data transfers.
The above two settings are very important to understand to improve the performance of tempdb and the transaction log. For example if you leave ‘Max insert commit size’ to its default, the transaction log and tempdb will keep on growing during the extraction process and if you are transferring a high volume of data the tempdb will soon run out of memory as a result of this your extraction will fail. So it is recommended to set these values to an optimum value based on your environment.
Avoid Asynchronous Transformation wherever possible
SSIS runtime executes every task other than data flow task in the defined sequence. Whenever the SSIS runtime engine encounters a data flow task, it hands over the execution of the data flow task to data flow pipeline engine.
The data flow pipeline engine breaks the execution of a data flow task into one more execution tree(s) and may execute two or more execution trees in parallel to achieve high performance.
Synchronous transformations get a record, process it and pass it to the other transformation or destination in the sequence. The processing of a record is not dependent on the other incoming rows.
Whereas the asynchronous transformation requires addition buffers for its output and does not utilize the incoming input buffers. It also waits for all incoming rows to arrive for processing, that’s the reason the asynchronous transformation performs slower and must be avoided wherever possible. For example, instead of using Sort Transformation you can get sorted results from the source itself by using ORDER BY clause.
Use SSIS performance Counter where required
SSIS has great event logging mechanism that helps analyse the performance of the package and its components. SSIS has also introduced System Performance counter counters to monitor the performance of your SSIS runtime and data flow pipeline engines.
- SSIS Package Instance counter indicates the number of SSIS packages running on the system;
- Rows read and Rows written counters indicate the total number of rows coming from the source and total number of rows provided to destination;
- Buffers in use and Buffer memory counters indicate the total number buffers created and amount of memory used by them;
- Buffer spooling is a very important counter and tells about number of buffers (which are not currently in use) written to the disk when physical memory runs low;
- BLOB bytes read, BLOB bytes written and BLOB files in use counters give detail about the BLOB data transfer and tells about number of BLOB bytes read, written and total number of files that the data flow engine currently is using for spooling BLOB data etc.
I hope this article is useful.