Monday, August 31, 2015

Why do you need a Unicode?

Databases do not store only transnational data, sometimes that happens that the very big chunk of the data is a text.
You might try to normalize it, but it is not always possible and you dump text into VARCHAR columns.

Sometimes it happen that majority of the data in a huge table are just those text data.
Just imagine, every time you are doing table scan request your server reads all these data from disk into memory to process it.
It is not only the disk issue, these heavy read operations might push out of memory other database components, which may cause even more future reads.

On the top of that problem your database architect decided to use Unicode for all CHARs and VARCHARs...

What are database architect's reasons for using Unicode?
- Customers say they will always put it in English now, but they have worldwide client. What if you want to store a customer name which uses non-English characters? Or the name of a place in another country?
- Unicode supports Scientific, Technical and Mathematical character sets;

- To have ALL text columns in Unicode is much easier to be consistent. Unicode behaves the same no matter whose computer you run the app on;
- It's easier to support Unicode from the beginning, than try to retrofit it later;
- I do not want to put time bomb in the application. It is sound always better to be safe than sorry.

All these arguments are from the category "WHAT IF".
There are sometimes reasonable "WHAT IFs", then that has to be addressed in business requirements, but if business is not require something explicitly, that should be avoided.

I know a lot of people will disagree with me and claim that I'm not experienced enough in international database development.
Just want to answer them that I actually do a lot of international database development experience.
I've started my IT career as non-English speaker in not localized DOS 3.0 environment. I went through a character coding nightmare, which native English speakers can't even imagine and I know what I'm talking about.

Let's start.

You've asked to use Unicode because the company might have clients with foreign names, addresses and product names.

You have to ask the question: Will you have in you have in your database any international charsets from the list below?:
- Indian: "কখগঘঙচছজঝঞট";
- Arabic: "ݐݑݒݓݔݕݖݗݘݙ";
- Georgian: "ᦍᦎᦏᦐᦑᦒᦓᦔᦕᦖᦗᦘᦙ";
- Cyrillic: "БГДЖЗЛФХЦЧШЩЪЫЭЮЯ";
- Korean: "ᄒᄓᄔᄕᄖᄗᄘᄙᄚᄛᄜᄝᄞᄟ"
- Japanese: "༡༢༣༤༥༦༧༨༩༪༫༬༭༮༯";
- Chinese: "樂洛烙珞落酪駱亂卵欄爛蘭";
- Braille: "⢌⢍⢎⢏⢐⢑⢒⢓⢔⢕⢖⢗⢘⢙"
- Mathematical: "∭∛∜∮∯∰∱∲≉≊≋≌≍";

If his/her answer will be: "- No, we must not allow any of those to be allowed as Names, Addresses or Products. Otherwise nobody will be able to read it!"

Hold on, but you can have a business in Canada, Mexico, Germany or any other country, which uses Latin alphabet with extra non-English characters. Do you still need Unicode?

Here is a demo:
USE TestDB
GO
CREATE TABLE
tbl_Test_Unicode(
    [Language_ID] TINYINT IDENTITY (1,1),
    [Language] VARCHAR(20) NOT NULL,
    [Unicode] NVARCHAR(100) NOT NULL,
    [NonUnicode] VARCHAR((100) NULL
);
GO
INSERT INTO
tbl_Test_Unicode([Language], [Unicode])
VALUES ('Spanish',N'ÁÉÍÑÓÚÜ¿¡áéíñóúü'),
('French',N'ÙÛÜŸÀÂÆÇÉÈÊËÏÎÔŒùûüÿàâæçéèêëïîôœ'),
('German',N'ÉËÏÓÖÜéëïóöü'),
('Norwegian',N'ÅÆÂÉÈÊØÓÒÔåæâéèêøóòô'),('Italian',N'ÀÈÉÌÒÓÙàèéìòóù'),('Chech',N'ÚŮÝŽÁČĎÉĚÍŇÓŘŠŤúůýžáčďéěíňóřšť'),('Polish',N'ĄĆĘŁŃÓŚŹŻąćęłńóśźż'),('Belarusian',N'АаБбВвГ㥴ДдЖжЕеЁёЖжЗзЬьІіЙйКкЛлМмНнОоПпРрСсТтУуЎўФфХхЦцЧчШшЫыЭэЮюЯя');
GO
UPDATE tbl_Test_Unicode
SET [NonUnicode] = [Unicode];
GO
SELECT *,
    CASE [NonUnicode] WHEN [Unicode] THEN 'Y' ELSE 'N' END "Are they equal?"
FROM tbl_Test_Unicode;
GO
DROP TABLE tbl_Test_Unicode;
GO

Here is a result:

The secret is that default SQL Server Collation and code page already contain all necessary characters. You can use following script to prove it:
DECLARE @b TINYINT = 0, @String VARCHAR(300) = '';
WHILE @b < 255 SELECT @b += 1, @String += CASE @b % 32 WHEN 0 THEN CHAR(10) ELSE '' END + CHAR(@b);
PRINT @String;

As you can see Spanish, French, German, Norwegian, Italian are NOT REQUIRE Unicode!
Discrimination starts with Eastern Europe languages, but they are translated pretty well and still readable.
Problem only starts with Cyrillic, Arabic, Chinese, etc.
Their symbols are replaced by question marks, but 99% of people in your company can't read it anyway, right?

Is that acceptable? Can you live with that?

If answer is "Yes" - you are lucky: can use Non-Unicode and save a lot of disk space!


No comments:

Post a Comment