Saturday, January 30, 2016

Index Rebuild vs Reorganize

Have you ever asked that question which ALTER INDEX operation to use Rebuild or Reorganize?

And you probably know the answer from MSDN: https://msdn.microsoft.com/en-us/library/ms189858.aspx

It states:


How many times you've followed that rule?
How many times you've answered that question on a job interview?

And now ask yourself: Were you right then? Have you ever tested that approach?

My answers to these two questions are: "No" and "Yes".

That happened that two years ago I serviced one client with very busy database, which suffered from high Index Fragmentation.
I've tried to use approach from MSDN to address Fragmentation by Rebuild or Reorganize.
Very quickly I've noticed that "Reorganize" runs too long in their environment and Index Rebuild works much quicker for tables with Fragmentation as low as 10%-15%.

At that time I did not have time and desire to deeply research that case, but very heavy hesitation has shaken my believe in MSDNs Re-Indexing rule.

Recently I've hit same issue again and finally decided to do a research.

At first I've created test table with clustered index and fill factor 70% in my test Database:
USE TestDB;
GO
IF EXISTS ( SELECT * FROM sys.tables WHERE name = 'tbl_Test_Index_Rebuild')
    DROP TABLE tbl_Test_Index_Rebuild;
GO
CREATE TABLE tbl_Test_Index_Rebuild(
    ID INT,
    A CHAR(68)
);
GO
CREATE CLUSTERED INDEX CIX_tbl_Test_Index_Rebuild ON tbl_Test_Index_Rebuild(ID) WITH (FILLFACTOR=70);
GO

Then Ive started building cases:

Case #1.

USE TestDB;
GO
TRUNCATE TABLE tbl_Test_Index_Rebuild
GO
;WITH
Pass0 AS (SELECT 1 AS C UNION all SELECT 1), --2 rows
Pass1 AS (SELECT 1 AS C FROM Pass0 AS A, Pass0 AS B),--4 rows
Pass2 AS (SELECT 1 AS C FROM Pass1 AS A, Pass1 AS B),--16 rows
Pass3 AS (SELECT 1 AS C FROM Pass2 AS A, Pass2 AS B),--256 rows
Pass4 AS (SELECT 1 AS C FROM Pass3 AS A, Pass3 AS B),--65536 rows
Pass5 AS (SELECT 1 AS C FROM Pass4 AS A, Pass4 AS B) --4294967296 rows
INSERT INTO tbl_Test_Index_Rebuild(ID, A)
SELECT TOP 700000 ROW_NUMBER() OVER( ORDER BY C), 'A' FROM Pass5;
GO
ALTER INDEX CIX_tbl_Test_Index_Rebuild ON tbl_Test_Index_Rebuild REBUILD WITH (MAXDOP=1);
GO
INSERT INTO tbl_Test_Index_Rebuild(ID, A) SELECT Fragment+1, 'A' FROM (SELECT TOP 30 0 AS MsgId FROM sys.messages) AS Msg,
(SELECT TOP 300 (ROW_NUMBER() OVER\( ORDER BY message_id)-1)*70 AS Fragment FROM sys.messages) AS Fragments
OPTION (MAXDOP 1);
GO
SELECT CASE WHEN i.fill_factor = 0 OR (ps.index_level > 0 and i.is_padded = 0)
        THEN 100 ELSE i.fill_factor END AS fill_factor,
        ROUND(ps.avg_fragmentation_in_percent,3) AS [AVG Frgmnt %],
        ROUND(ps.avg_page_space_used_in_percent,3) AS [AVG Space USE %],
        ps.fragment_count,
        ROUND(ps.avg_fragment_size_in_pages,3) AS [AVG Frgmnt Size],
        ps.page_count, ps.record_count,
        (ps.record_count / ps.page_count) AS AVG_Records_per_Page
    FROM sys.indexes i
    CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), i.object_id, i.index_id, NULL, 'DETAILED') ps
    WHERE i.name = 'CIX_tbl_Test_Index_Rebuild' and ps.index_level = 0;
GO

Here is the returned result:
You can see that table has 709000 rows on 10300 pages and has only 5.835% fragmentation.

Lets look deeper in what I just did:
At first I've inserted 700000 records into the table. It created 7K pages and filled them with 100% of their capacity and  and it looked like this:

The next step was "Index Rebuild" to enforce 70% Fill Factor and number of pages has increased to 10K with 70% of their capacity filled. It looked like this:

On the next step I inserted 30 records on the first 300 pages. That action caused Page splits for those 300 pages, which produced 600 fragments. Now if you divide 600 by total number of pages 10300, you almost get the Average fragmentation I've got: 5.825% (the actual number is little bit different because there are more calculable variables). As a result pages looked like these:

(the bluish ones - are 600 fragmented pages and have filled only 50%)

The last query just produced the fragmentation report.

Now we are ready to start our defragmentation procedure.
Before running Altering the index I started Query Profiler to capture the results:


ALTER INDEX CIX_tbl_Test_Index_Rebuild ON tbl_Test_Index_Rebuild REORGANIZE;

After index been reorganized, I've run original script again to reproduce fragmentation back to 5.835% and run Index rebuild:
ALTER INDEX CIX_tbl_Test_Index_Rebuild ON tbl_Test_Index_Rebuild REBUILD;

Here are my results from Query Profiler:

Amazingly Index Rebuild used 8 times less reads than Reorganize, 3 times less writes and finished 4 times faster!!!!

Have you expected that? I have not.

I did not believe my eyes. I started hunt for fragmentation values when Reorganize would out perform Index rebuild.

Case #2

I've started that case by filling 10K pages by 70% of data. Then I've up-loaded  all pages to almost 100% of their capacity, so they looked like this:

 Then I added couple of records to first 300 pages to split them and produce the fragmentation like this:
As a result I've got following:

960600 records in 10300 pages with the same fragmentation of 5.835%.

I've used following script to produce that picture:
INSERT INTO tbl_Test_Index_REBUILD(ID, A)
SELECT Fragment+1, 'A' FROM (SELECT TOP 26 0 AS MsgId FROM sys.messages) AS Msg,
(SELECT TOP 10000 (ROW_NUMBER() OVER( ORDER BY message_id)-1)*70 AS Fragment FROM sys.messages) AS Fragments
OPTION (MAXDOP 1);
GO
INSERT INTO tbl_Test_Index_REBUILD(ID, A)
SELECT Fragment+1, 'A' FROM (SELECT TOP 2 0 AS MsgId FROM sys.messages) AS Msg,
(SELECT TOP 300 (ROW_NUMBER() OVER( ORDER BY message_id)-1)*70 AS Fragment FROM sys.messages) AS Fragments
OPTION (MAXDOP 1);
GO

Please note: The fragmentation is the same as in the first case, but Average used space on each page is completely different. That is what makes the real difference, not a fragmentation.

When I run Reorganize and Rebuild again I've got the following results in Profiler:


Index Rebuild used 6.5 times lees Reads, ~30% less writes and finished also about 30% faster.
Not so impressive as in the first case, but still "Rebuild" outperformed "Reorganize".

Case #3

In this case I've just increased fragmentation by splitting 1000 pages:
 As a result 730K records on 11K pages with 18.127% of fragmentation.

After Reorganize and Rebuild I've got this:


Not sure if timing was captured correctly, but you still can see the significant difference in number of reads. "Rebuild" is better again.

Case #4 (extreme)

In that case I've increased capacity of all pages up to 98% as in Case #2 and then split 1700 pages:
In the result I've got 963400 records in 11,7K pages with fragmentation of 29% (almost 30% limit as in MSDN's guideline)

And here is what I've got:

Rebuild is faster and uses less Reads and Writes. No surprise.

Do I have to test anymore?
Why not?

Case #5

In that case I wanted to see how "Rebuild" or "Reorganize" impact other transactions on busy servers.

In order to do this test I've inserted to my table only 70K records and split only 30 pages (10 times less than in the first case)
Before running ALTER INDEX command I started following script in another window to emulate heavy activity:
SET NOCOUNT ON

WHILE 1 > 0
BEGIN
    ;WITH a AS ( SELECT top 1 * FROM tbl_Test_Index_REBUILD
        ORDER BY NEWID()
    )
    UPDATE a SET ID = ABS(CAST(NEWID() AS VARBINARY) % 1000000)
    OPTION (MAXDOP 1)
END
Surprisingly, when I run "INDEX REORGANIZE", I almost immediately got an error:

Msg 1205, Level 13, State 52, Line 68
Transaction (Process ID 52) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Here is captured in Profiler Deadlock Graph:

I've tried some other scenarios and discovered that even though "Rebuild" causes blocking, but does not produce Deadlocks. "Reorganize" produced Deadlocks only on the very small amount of pages (about 1K).
When I've tried bigger tables I couldn't reproduce Deadlock scenario, but I do not think it is impossible, I've just have not tried hard enough.

As the conclusion:
"INDEX REORGANIZE" is generally slower than "INDEX REBUILD". It uses much more I/O operations and can produce deadlocks.

Do you still want to use "REORGANIZE" option in your code?

That is pretty tough question.

If you take a look at the internal difference how "REBUILD" and "REORGANIZE" work you'll see following:
- "INDEX REBUILD" is creating brand new copy of the index in the database. Would say your table is 2Tb in size. That means you need another 2Tb in your database to finish "INDEX REBUILD" successfully. In addition to that you need 2Tb in your Log file, which means 4Tb of extra space. Do you have it?
- "INDEX REORGANIZE" works much slower, but in small transactions. You can just run it periodically, in your lowest DB activity time and just kill process automatically after an hour or two. It is not a problem if process won't finish completely. Partial results are also beneficial. If process will be a deadlock victim - that is also not a problem other than you get deadlock alert in the middle of the night.

As a result of my research I will be always using "INDEX REBUILD" option from now going forward, unless I'm in a situation when "REBUILD" does not work. Then "REORGANIZE" will take the place.



Thursday, January 28, 2016

Aggregation without "GROUP BY" using OVER clause.

Do you think it is not possible to aggregate values without using clause "GROUP BY"?

I'll show you how to do it in SQL 2012/2014

At first, will create a table for ten years of sales and fill it with values:

USE TestDB;
GO
CREATE TABLE tbl_No_Croup_By (
    Period_Dt DATE,
    Sales_Amount Money
);
GO
CREATE CLUSTERED INDEX CLIX_No_Croup_By ON tbl_No_Croup_By(Period_Dt);
GO
INSERT INTO tbl_No_Croup_By(Period_Dt, Sales_Amount)
SELECT DATEADD(day,CAST(NewID() AS VARBINARY(64)) % 1827,'2010-01-01')
    , CAST(ABS(CAST(NewID() AS VARBINARY(64)) % 100000)/100. AS MONEY)
FROM sys.messages;
GO
SELECT * FROM tbl_No_Croup_By;
GO

Now we have following aggregation request:
Report all sales aggregated by a year with a percentage of each year within whole data set;

Not a big deal. That how it would look like in classic scenario:
;WITH AnnualSales AS (
    SELECT YEAR(Period_Dt) AS YearPeriod,
        SUM(Sales_Amount) AS AnnualSale
    FROM tbl_No_Croup_By
    GROUP BY YEAR(Period_Dt)
), TotalSales AS ( SELECT SUM(AnnualSale) AS TotalSale FROM AnnualSales)
SELECT a.YearPeriod, a.AnnualSale
    , t.TotalSale     , CAST(ROUND(CAST(a.AnnualSale * 100. / t.TotalSale AS SMALLMONEY),2) AS VARCHAR) + '%' AS Year_Total
FROM AnnualSales AS a
INNER JOIN TotalSales AS t ON 1 = 1
ORDER BY a.YearPeriod;
Here is an approximate result:




And now will do the Magic
SELECT DISTINCT YEAR(Period_Dt) AS YearPeriod
    , SUM(Sales_Amount) OVER(PARTITION BY YEAR(Period_Dt)) AS AnnualSale
    , SUM(Sales_Amount) OVER(PARTITION BY 1) AS TotalSale
    , CAST(ROUND(CAST(SUM(Sales_Amount) 
        OVER(PARTITION BY YEAR(Period_Dt)) * 100./ SUM(Sales_Amount) 
        OVER(PARTITION BY 1) AS SMALLMONEY),2) AS VARCHAR) + '%' AS Year_Total
FROM tbl_No_Croup_By
ORDER BY YEAR(Period_Dt);
GO

The result is exactly the same, but code does not have any single "GROUP BY" and not so obvious.

Is it good?
Turn On statistics in your session by running two following commands:
SET STATISTICS IO ON
SET STATISTICS TIME ON

And then run both scripts together.
If after execution you look at Messages tab you can see following picture:


Wow!!!
"GROUP BY" uses 5 times less CPU power and almost 1000 times less IO system!!!

As a conclusion: You CAN use "OVER"clause to do the aggregation in three following cases:
1. When data set is extremely small and fits in just one 8 Kb page;
2. When you want to hide your logic from any future developer or even yourself to make debugging and troubleshooting a nightmare;
3. When you really want to kill your SQL Server and its underlying disk system;


Wednesday, January 27, 2016

Count all records in a table

That is very easy and well know topic, but I still have to do it because there are still a lot of people who do not really know how counting really works.

Just recently I've met a person, who argued with me on the way of counting records in a table.
He was irritated by me using "COUNT(*)" and suggested using "COUNT(1)" instead.

I've tried to explain that result would be the same anyway with the same expenses.
Unfortunately there was no time for an explanation and I want to put it in that post for everybody.

At first we can run following script against AdventureWorks Database (which you can download here):
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT COUNT(*) FROM Person.Person;
SELECT COUNT(1) FROM Person.Person;
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
If you run that script multiple times and see in "Messages" tab, you'll notice that both statements execute equally quick and require absolutely same amount of IO to count the records.


The way of thinking that "COUNT(*)" is bad starts from the fact that in most cases "SELECT * " is bad and you have to avoid asterisk when you can.
But then, what is the difference between "COUNT(*)" and "COUNT(1)"?

"*" Means that we "want to count all records in the table"
"1" Means that we "want to count all records in the table substituted by number 1"

You Might be surprised, but all following statements produce equal results and spend the same amount of resources:
SELECT COUNT(*) FROM Person.Person;
SELECT COUNT(1) FROM Person.Person;
SELECT COUNT(1000000) FROM Person.Person;
SELECT COUNT('A') FROM Person.Person;
SELECT COUNT('Any Substitution String') FROM Person.Person;
If, before the execution, you press "Ctrl-M" in your SQL Server Management Studio you can see the query execution times for all these queries:

All query plans for these queries are the same and cost exactly the same 20%

Look at the most expensive part of all queries: Index Scan, which consumes 86% of a query.
It produces 19972 output rows and all these rows have the same size 9 bytes, even in the case of crazy long sentence:



So, what is the difference of what we put inside of COUNT() function?

The difference starts when we try to count by specific field. For example by Firs and Middle name:
SET STATISTICS IO ON
SELECT COUNT(FirstName) FROM Person.Person;
SELECT COUNT(MiddleName) FROM Person.Person;
SET STATISTICS IO OFF

The result will be little surprising:





Why counting on the same table is different?
Because Middle name has NULL values, which were ignored by COUNT() function.




If we rewrite the second query like this:
SELECT COUNT(IsNull(MiddleName,'')) FROM Person.Person;
We will get the perfect result of 19972 records.

That means that using Column names for counting should be used in cases when you want to count items ONLY in the specific field and do not care about overall number of records.

You also might use "DISTINCT" statement to count only unique records:
SELECT COUNT(DISTINCT FirstName) FROM Person.Person;
That query returns only 1018 counted First names.

"DISTINCT" is pretty expensive operation. Use it only when it is absolutely necessary.

At this point we established that "COUNT(*)" and "COUNT(1)" are basically equivalent and produce the same result, but do we have a faster way to count the rows?
Of cause we have!

We can use "sys.dm_db_partition_stats":
SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT COUNT(*) FROM Person.Person;

SELECT row_count FROM sys.dm_db_partition_stats
WHERE index_id < 2 and object_id = OBJECT_ID('Person.Person');


SET STATISTICS TIME OFF
SET STATISTICS IO OFF

You can see that second query takes only 12 read operations to get the row count in a table.
For the small table difference is not significant, but for tables with billions of rows you can prefer that method over anything else.


WARNING!
Use "sys.dm_db_partition_stats" only in cases when you do not need the exact record count.
MSDN says that it provides "The approximate number of rows in the partition.".
That means for small tables you can safely use "COUNT(*)" and for large ones you can get an approximation in very quick manner.