SQL Server Temp Tables: When Do I Create My Indexes?
Posted by Brad Wood
May 13, 2009 07:31:00 UTC
My last blog post was originally the start of this one, but I got so carried away talking about the different types of temp tables I split it off. Today's topic is when to create indexes on SQL temp tables-- before or after you add the data. Many people out there seem to hold the same opinion on the subject. I don't like taking other people's word and I also like doing things the hard way. Therefore I created a series of tests to see which way really was fastest.Most web sites out there seem to think if you are using a temp table or global temp table (or a regular table for that matter) with indexes on one or more columns and you are going to to load a large amount of data into the table, it is faster to add the data first and THEN create your indexes. If you create the indexes first, SQL Server has to rearrange the data over and over again as the data gets added. This causes lots of page splits and fragmentation. Asking SQL Server to create the indexes after the data has been loaded into the table allows it to sort the data into the index all at once. Sounds plausible enough, but is it true? Below are the tests I devised to find out. I ran them on SQL Server 2005 Express on my home PC (Windows XP 1.8 Ghz, 1 Gig RAM, single IDE Drive) Yeah, I know-- not really server specs. You'll get over it-- I did. I needed lots of data for my test-- and for good cardinality it needed to be random. I played with several methods of generating large amounts of random strings or numbers, but finally gave up in favor of the newid() SQL function. It returns a 36 character string which I stored in a varchar(200) column in a temp table. (TODO: Try other data types) I figured 3 million records should be a healthy data set. To be fair, I figured I should give it a go with both a clustered and nonclustered index. For the fun of it, I devised two different tests: One that dumped all 3 Mil records in my temp table all at once, and another that pushed them in in spurts of 10 at a time. (300,000 spurts to be exact). I set NOCOUNT on like so:
[code]SET NOCOUNT ON[/code]I dropped and re-created my temp table before each test like so:
[code]DROP TABLE #temp CREATE TABLE #temp (test varchar(200)) [/code]My index creations looked like this:
[code]CREATE CLUSTERED INDEX foo ON #temp (test) CREATE NONCLUSTERED INDEX foo ON #temp (test)[/code]I created 3 Million records in my temp table like so:
[code]DECLARE @counter AS int SET @counter = 1 WHILE @counter <= 300000 BEGIN INSERT INTO#temp (test) SELECT newid() UNION SELECT newid() UNION SELECT newid() UNION SELECT newid() UNION SELECT newid() UNION SELECT newid() UNION SELECT newid() UNION SELECT newid() UNION SELECT newid() UNION SELECT newid() SET @counter = @counter + 1 END [/code]I used the code above directly to populate my temp table a few records a time. For the tests where I inserted all 3 Million in a single statement, I created a second temp table which I populated before hand, and then transferred all 3 million at once like so:
[code]INSERT #temp SELECT test FROM #temp2[/code]I know my naming conventions leave something to be desired, but with counseling you should be alright. I ran each test about 3 times and took the average to help weed out anomalies. The numbers in the "before" columns are the time it took to insert the records into a temp table with the indexes already created. The numbers in the "after" columns are the time to insert the records into a non-indexed table plus the time to create the index on the populated table. The "difference" column shows how many seconds the the "after" beat the "before". For those of you bored enough to still be with me, here is what I found:
- On average it is faster to wait and create the indexes after populating the data, but results did vary.
- Clustered indexes are faster than non clustered indexes if created prior to populating the data.
- Clustered and Nonclustered indexes take about the same amount of time when created on a pre-populated table.
- The biggest difference was seen when inserting data in spurts with a Nonclustered index.
- There was virtually no difference when doing a mass insert with a Clustered index.
- Inserting data in spurts was across-the-board slower than a single insert. While I believe that to be true, keep in mind the spurts columns include the time spent on the WHILE loop and newid() functions, while the All-In-One columns do not since I pre-loaded the data into a second temp table.
- There are a number of variables I didn't cover that could affect the outcome like:
- Speed of the server
- Size of the data
- Number of indexes
- Cardinality of the data
- Sorting the data prior to inserting
- Index fill factor (to prevent page splits)
- Other data types
Tags: Performance, SQL
I was searching google on this topic and I stumbled upon your blog post. I just wanted to point out that you populated your temp table with 300,000 records and not 3 million as you stated.
@Mark: The loop does run only 300,000 times, but it inserts 10 records per iteration. 300,000 x 10 = (wait for it...) 3,000,000. :)
100% right - my bad!
Nicely researched, I'd of had to do this and you saved me some time, Thanks!
I've been told that applying indexes to a temp table after creation will cause it not to be available for reuse cross-sessions (i.e. the structure will have changed so SQL Server will rebuild the entire temp table on each use, rather than reusing the structure cross sessions)
You should expand your testing to cover that scenario. It may not matter for one-offs, but for temp tables used in a highly trafficked transactional db, every bit counts (so to speak :)
@Patrick: Thanks for the input.
I'm not sure if I follow you though. SQL Server's #temp tables (denoted with a single pound sign) are by nature specific to a given session. Therefore they would have to be re-created for each new database session by design.
Perhaps you are think of global ##temp tables (denoted with a double pound sign) which are shared by all sessions on the database. But even that doesn't make sense because the data in the table would not be globally available if the table was re-created.
I can see how DDL statements after table population might cause your execution plan to be recompiled, but that doesn't appear to be what you are saying.
Can you provide a reference to support your theory?
@Patrick is referring to the temporary object reuse feature that was introduced in SQL Server 2005. Basically at the end of a session, local temp tables are no longer dropped they actually get truncated and can be reused in other sessions assuming several conditions are met. See this blog from Christian Bolton which explains it properly:
I wonder if you ordered the query (same as cx) prior to cx creation if that would matter. I guess it depends on how SQL sets up the index. Seems like it could.
I have created a series of temp tables the first being #table_1. I then create create #table_2 by selecting INTO #table_2 from #table_1 and adds some additional columns. I repeat this procedure several times. After I have created #table_1 I index several key columns which are used to join onto other tables that will from part of #table_2, #table_3 etc. Will this index carry on through to all the temp tables? ie #table_2 is created from #table_1 so does that imply #table_2 is now also indexed?