Recently I had an interview where I was asked about the ADO.NET Bulk Copy class. Frustratingly, I had used it about 12 months prior and couldn’t remember much about it. So this post serves the dual purpose of providing a tutorial for the class and reinforcing my general knowledge.
Overview
The SqlBulkCopy class can be found in the System.Data.SqlClient namespace, and it was introduced in .NET 2.0. As the name suggests, it allows for the bulk insert of data into a SQL Server table. The benefit of this over a traditional insert is that the operation is minimally logged and therefore much faster at inserting a large number of rows.
Transactions
Transactions management is applicable to the SqlBulkCopy class, and consequently a SqlTransaction can be passed to the constructor if required. Rolling back will cause all rows that have been inserted by the SqlBulkCopy object to be reversed out. So the operation is atomic and reliable.
Testing SqlBulkCopy
I set up the following table to test the SqlBulkCopy class. My target platform for this series of tests was SQL Server 2012 Developer Edition 11.0.2100.60.
CREATE TABLE [Customer] ( ID INT IDENTITY(1,1) NOT NULL, FirstName VARCHAR(50) NOT NULL, LastName VARCHAR(100) NOT NULL, DOB DATETIME NOT NULL, Phone VARCHAR(20) NULL, City VARCHAR(50) NULL, CONSTRAINT PK_Customer PRIMARY KEY(ID) )
This should be sufficient to test a few things that I am curious about – non-nullable columns, an identity column and a basic primary key.
Here is the C# code I am using to run the test:
public static void BulkCopyExample(string connectionString, DataTable rowsToInsert, int batchSize) { try { using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connectionString)) { //Batch Size bulkCopy.BatchSize = batchSize; //Destination Table Name bulkCopy.DestinationTableName = "Customer"; //Write the rows to the Table bulkCopy.WriteToServer(rowsToInsert); //Close bulkCopy.Close(); } } catch (Exception ex) { Console.WriteLine(ex.Message); Console.ReadLine(); throw; } }
Column Mappings
The code above is pretty simple, but there’s one key thing missing. There are no column mappings, so the SqlBulkCopy compoment doesn’t know how to map the provided DataTable to the underlying SQL Table.
Non-Nullable Columns
When the mappings aren’t done properly (or at all), you may receive an exception similar to the following:
The given value of type String from the data source cannot be converted to type datetime of the specified target column.
This particular exception is related to the DOB column, which needs to have a valid DateTime provided.
Identity Column
Simply omit the identity column from the ColumnMappings collection on the SqlBulkCopy object, and SQL Server will allocate the next available identity value for that column.
Here is the revised code, showing all of the columns neatly mapped. Note that nullable columns in the underlying table do not need to be mapped. I’ve omitted the City mapping to prove this.
//Column Mappings bulkCopy.ColumnMappings.Add("FirstName", "FirstName"); bulkCopy.ColumnMappings.Add("LastName", "LastName"); bulkCopy.ColumnMappings.Add("DOB", "DOB"); bulkCopy.ColumnMappings.Add("Phone", "Phone");
Results
The sample row goes into the table. Note the City column is null because the mapping was omitted.
Keeping Track of Progress
The NotifyAfter property of the SqlBulkCopy object determines how frequently the SqlRowsCopied event is fired. For example:
//Notify After bulkCopy.NotifyAfter = 1000; //Listen for the Event bulkCopy.SqlRowsCopied += new SqlRowsCopiedEventHandler(bulkCopy_SqlRowsCopied);
This will call the SQLRowsCopiedEventHandler after every 1000 rows. Note that there is a performance overhead associated with this call. In the event handler, a SqlRowsCopiedEventArgs is received and this can be used to get the current rowcount.
Experimenting with Batch Size
The BatchSize property determines how many rows will be sent to SQL Server in each round trip. Smaller batch sizes will result in more trips to SQL Server, but larger batch sizes will take longer to commit. The optimal batch size will depend on a number of factors including the length of each row, and the total of number of rows being inserted.
To test the effect of different batch sizes on performance, I inserted 1,000,000 identical rows using varying batch sizes and measured the total time taken to complete the task. The table was truncated after each test to guarantee a level playing field.
The results are pretty interesting – a small batch number results in lousy performance, whereas a batch size of around 70,000 seems to be optimal for my table and the sample row I was working with. It would be interesting to re-run this test for rows of differing lengths.
No comments:
Post a Comment