SQL Server Gotcha: Implicit Unicode Conversion

A recent thread on CF-Talk brought up this very good topic. It deals with the performance hit you can get when SQL Server implicitly converts your data to nchar and nvarchar for you when you have enabled Unicode support in your data sources. Unicode text cannot be stored in normal char or varchar fields. It must use nchar or nvarchar. These data types use two bytes per character, which means you can only store half as much text in them (limit 4,400 instead of 8,800). The problem is two fold:

SQL Server cannot directly compare a varchar and nvarchar value so it must convert one.
String manipulation or conversion on an indexed column will render the index useless

Prepare to be converted

When it comes to comparing a varchar and nvarchar, the lowest common denominator must be found. An nvarchar cannot be converted to a varchar because data would potentially be lost. Therefore, the varchar must be converted to an nvarchar. If you have a table with a char or varchar column and you wish to compare it to an nvarchar, every row in the table will need to be converted first before it can be compared to your value.

What index?

Indexes on a table effectively store a copy of the data in that column (or those columns) in a pre-sorted fashion that makes it easy to find. Imagine if you kept a separate address book sorted by middle name in case you wanted to quickly find any friend with "Fred" as a middle name.

[code]SELECT *
FROM addresses
WHERE middle_name = 'Fred'
[/code]

Your pre-sorted index on middle name would make this easy. You can skip straight to the F's, then back to the Fr's etc. Now, imagine you wanted to find all friends who had the letter "u" as the third letter in their middle name:

[code]SELECT *
FROM addresses
WHERE substring(middle_name,3,1) = 'u'
[/code]

Now, your address book sorted by middle name isn't really useful anymore. You pretty much have to go through every record in pull out the third character of every middle name. This is basically what can happen if you are selecting against a char or varchar column in your database with a Unicode value in an nchar or nvarchar variable since every value in the table must be converted before it can be compared. Let's look at this in action.

Example

If you have SQL Server 2005, Open up a New Query window Management Studio and paste in the following code:

[code]DECLARE @tmp AS TABLE
(myvarchar  varchar(50) PRIMARY KEY)

DECLARE @myparam AS varchar(50)

INSERT INTO @tmp VALUES ('test')
SET @myparam = 'test'

SELECT *
FROM @tmp
WHERE myvarchar = @myparam
[/code]

Click on the "Query" menu and choose "Include Actual Execution Plan" and click "Execute". Look at your Execution plan. Since the data types between the myvarchar column and the @myparam value matched, you will see a Clustered Index Seek was used and the predicate was
myvarchar = [@myparam]

Now, change @myparam to an nvarchar like so:

[code]DECLARE @myparam AS nvarchar(50)[/code]

Run again and look at the execution plan. First you will see we have changed to a "Clustered Index Scan" and the predicate for our select is:
CONVERT_IMPLICIT(nvarchar(50),[myvarchar],0)=[@myparam]

SQL Server implicitly converted every value in the myvarchar column to an nvarchar for us. In our quaint little example, the performance difference between both is negligible, but imagine a table with a few million records. The difference between an Index Seek and an Index Scan can go from milliseconds to minutes. As far as the fix goes, I'm not positive on this, but I think you may need to create two data sources-- one that uses Unicode format strings, and another that doesn't and use the appropriate data source based on the table you are hitting. Of course, if your app really uses Unicode strings, your database tables should probably already be nvarchars anyway. Also permissible, is to cast your nvarchar into a new variable of type varchar prior to running the select. Just make sure you aren't going to lose any data though.

SQL Server Gotcha: Implicit Unicode Conversion