Friday, September 11, 2015

How many Clustered Indexes you may have?


Have you ever got that question on a job interview?

You've been smart and answered: "- Only One!"
Have you?

Is that the right answer?

I like tricky questions and that one is one of them.
At first, you might have no clustered indexes at all and at second, question did not specify "a single table". That means you can have as many clustered indexes in your database as many tables you already have.
Even more! Do not forget about views, you can have clustered indexes on views as well.

So, the really true answer is "- As many as you want, but no more than one per table or view."


Why is that?
Let's look at the nature of Clustered Index. Just forget everything what you know about indexes for a second and just imagine Clustered Index as your table data set sorted in a certain order. Would say you sorted your data by Sale's Person ID and Sale Date and data lay in that order - that will be your Clustered Index. You physical can't have the same data set sorted in more than one way, otherwise it will be different data set.
That the reason why we can't have more than one Clustered index per Table or view.

Can we live without clustered index at all?
Sure we can. Table without clustered index is called Heap.

Do we always have to have Clustered Index?
Not at all. For huge tables with multiple indexes and a lot of lookup operations it might be beneficial is not having Clustered Index!

Look at two same size tables in the sample AdventureWorks2014 database:
- Production.ProductProductPhoto - 504 records without Clustered Index
- Production.Product - 504 records with Clustered Index
One of them has Clustered Index and another do not.

Run simple selects against these tables:

SET STATISTICS IO ON  
SELECT * FROM Production.ProductProductPhoto
WHERE ProductID = 1 AND ProductPhotoID = 1;
GO
SELECT * FROM Production.Product  
WHERE ProductNumber = 'AR-5381';  
SET STATISTICS IO OFF


They have almost identical execution plans with only difference that Heap using RID (Row Identifier) Lookup and Clustered table uses Key Lookup.
When we look in "Messages" tab we can see how many IO operations were used for each select:

(1 row(s) affected)
Table 'ProductProductPhoto'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
Table 'Product'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Select against table with clustered index took 4 IO reads (2 reads for index + 2 reads for Clustered Index) and 3 IO reads for Heap (2 reads for index + ONLY 1 read for Heap!)
Those are extremely small tables and if you have 6-7 Clustered Index levels you will have all these extra reads vs only one read for Heap.

Did I just say Heap is better than Clustered Index? - Absolutely NOT!
I've said that you should not fallow "Must Have Clustered Index" strategy and be more flexible.

Heaps also have their own caveats. The biggest one is that if you need and have only one index on a table better to have it Clustered.

In this case you will avoid extra lookup operations and won't fail in a nightmare of RID Lookups.

In addition to this there is not very well known Heap issue as Extra Spacing.

Run a script, which creates a simple table with only one varchar column and inserts there 700 records:

USE TestDB;
GO
CREATE TABLE tbl_Test_Heap( TextField VARCHAR(1000) );
GO
;WITH a1 AS (SELECT 1 AS a UNION ALL SELECT 1),
        a2 AS (SELECT a FROM a1 UNION ALL SELECT a FROM a1),
        a3 AS (SELECT a FROM a2 UNION ALL SELECT a FROM a2)
INSERT INTO dbo.tbl_Test_Heap ( TextField )
SELECT TOP 7 REPLICATE('A',950) FROM a3;
GO 100
GO
SELECT used_page_count, row_count
FROM sys.dm_db_partition_stats st WHERE OBJECT_NAME(OBJECT_ID) = 'tbl_Test_Heap'; GO
As a result we have 101 pages of used space.

Now will add new Integer ID column, which is supposed to add some space to the table and build Clustered Index on it:

ALTER TABLE tbl_Test_Heap ADD ID INT IDENTITY(1,1);
GO
CREATE CLUSTERED INDEX CLIX_Test_Heap ON tbl_Test_Heap(ID);
GO
SELECT used_page_count, row_count
FROM sys.dm_db_partition_stats st WHERE OBJECT_NAME(OBJECT_ID) = 'tbl_Test_Heap' GO
DROP TABLE tbl_Test_Heap;
GO
As a result, size of our table reduced by 10% !
Tables with Clustered Indexes could be little bit smaller than Heaps.

Clustered Indexes have their own problems. Large, rapidly growing tables can fail into a fragmentation and page split issues unless the primary field of the index is growing DATETIME, IDENTITY or any other growing over the time value. At the same time that exact solution can generate page contention problem, which could lead to locks and even deadlocks.

Do you see how many different problems and questions bring Clustered Indexes?
Do you still want to have them for every table even if they will save space?

4 comments:

  1. This is just brilliant!
    Just wanted to add that, in some special cases the RID can be a lot cheaper than a clustered index; I would use heaps only when the row count is below 100, otherwise would just do as you showed in this clear example.

    I came across your blog redirected from a LinkedIn group and now i'm bookmarking it. Thanks

    ReplyDelete
  2. It is not a question of 100 records, it is how many fits on one or two pages. If you have less than three pages of data you do not need clustered index unless you are in Azure, which requires it.

    ReplyDelete
  3. Hi Slava - your example is "cheating" a benefit for a clustered index because you are not exceeding the full available space on a data page (8.060 Bytes).
    If I have 8,060 Bytes for each page available why do you add 7 records only to it per transaction?

    Replay your example as follows and compare it with the "benefits" of your clustered index result!

    -- Create your demo table
    CREATE TABLE tbl_Test_Heap( TextField VARCHAR(1000) );
    GO

    -- now i insert 8 records in ONE page
    ;WITH a1 AS (SELECT 1 AS a UNION ALL SELECT 1),
    a2 AS (SELECT a FROM a1 UNION ALL SELECT a FROM a1),
    a3 AS (SELECT a FROM a2 UNION ALL SELECT a FROM a2)
    INSERT INTO dbo.tbl_Test_Heap ( TextField )
    SELECT TOP 8 REPLICATE('A',950) FROM a3;
    GO 87
    GO

    -- and have to add 4 more records to get to 700 records
    INSERT INTO dbo.tbl_Test_Heap (TextField)
    VALUES
    ( REPLICATE('A', 950)),
    ( REPLICATE('A', 950)),
    ( REPLICATE('A', 950)),
    ( REPLICATE('A', 950))
    ;
    GO

    -- now compare the result (89 pages!)
    SELECT used_page_count, row_count
    FROM sys.dm_db_partition_stats st WHERE OBJECT_NAME(OBJECT_ID) = 'tbl_Test_Heap';
    GO

    A clustered index add new records by scanning the free space on the data page while a heap is only scanning the PFS which have only 4 degrees of "filling": 50%, 80%, 95% and 100%!

    Best from Germany, Uwe

    ReplyDelete
  4. Uwe, that was not cheating.
    try to insert records one by one and results will be even worse:
    INSERT INTO dbo.tbl_Test_Heap ( TextField )
    SELECT REPLICATE('A',950)
    GO 700
    Because of those 4 PFS degrees of "filling" heaps have that problem.
    You can try to avoid that inserting data by batches, but if you insert on one-by-one basis you hit that issue.

    ReplyDelete