Wednesday, June 22, 2016

Basics of Clustered Columnstore Index in SQL Server 2016


In this post I want to go through the Basic operations with Clustered Columnstore Index (CCSI) and show their internal views:

Creation

Will create a test table with CCSI:

Use TestDB;
GO
DROP TABLE IF EXISTS tbl_Test_Columnstore;
GO
CREATE TABLE tbl_Test_Columnstore(  
     Row_Value INT NOT NULL,  
     Row_Description NVARCHAR (50),  
     Row_Type NVARCHAR(50),  
     Row_KEY int,  
     INDEX idx_Test_Columnstore CLUSTERED COLUMNSTORE  
);
GO

After table creation we immediately can see it's internals:
Script #1:
SELECT ix.name as Index_Name
 , ix.type_desc 
 , c.Name as Column_Name
 , ic.is_included_column
FROM sys.indexes as ix
INNER JOIN sys.index_columns as ic
ON ix.object_id = ic.object_id and ic.index_id = ix.index_id
INNER JOIN sys.columns as c
ON ix.object_id = c.object_id and ic.index_column_id = c.column_id
WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore');
GO

Here we can see that ALL four columns of our table are "columnstored" and even though Columnstore Indexes (CSI) do not support "Included" columns, all of them marked as "included".

Along with CCSI itself, SQL Server creates "COLUMN_STORE_DELETE_BITMAP" for the new index:
Script #2:
SELECT ix.name as Index_Name
 , ip.internal_object_type_desc
 , ip.row_group_id
 , ip.rows
 , ip.data_compression_desc as Compression_Type
FROM sys.indexes as ix
INNER JOIN sys.internal_partitions as ip
ON ix.object_id = ip.object_id
WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore');
GO

"COLUMN_STORE_DELETE_BITMAP" - tracks deleted rows, obviously, number it is still zero.

Inserting rows into CSI Delta Store

At first will try to insert only one row into the table
Script #3:
DECLARE @i INT = 10
WHILE @i > 0
BEGIN 
 INSERT INTO tbl_Test_Columnstore WITH (TABLOCK)
 SELECT @i, 'Row Description: ' + CAST(@i as varchar)
  , 'Row Type: ' + CAST(@i as varchar)
  , Row_KEY = @i % 10000000 ;
 SET @i -= 1;
END
GO
Then immediately run Script #2 again and get following result:

SQL Server created "Delta store" for our CCSI and inserted 10 new uncompressed rows.

Lets change variable @i to "100000" in Script #3 and run it again. Script #2 will show following:
As you can see all 100K rows also get into the Delta Store

If we try different method of inserting records we can get pretty interesting results:
Script #4:

DELETE FROM tbl_Test_Columnstore; 
GO
;WITH
  Pass0 as (select 1 as C union all select 1),
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),
  DataSet as ( 
 SELECT top 100000
  F = CAST(FLOOR(RAND(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))*2147483647) as INT)
 FROM Pass5)
INSERT INTO tbl_Test_Columnstore WITH (TABLOCK)
SELECT F, 'Row Description: ' + CAST(F as varchar), 'Row Type: ' + CAST(F as varchar), F % 10000000 
FROM DataSet;
GO

(Run script #2 right after)


SQL Server calculated an amount of data to insert and decided to use Parallelism for CSI insertion.
Parallel CSI insert used all four CPUs and loaded data equally into 4 Delta Store groups.

Inserting rows into Column Store using Bulk Insert


At first, clean CSI:

DELETE FROM tbl_Test_Columnstore; 
GO

Accordingly to MSDN "Columnstore Indexes Data Loading" document, in order to avoid placing your data into the Delta Store and move them directly to Compressed storage you have to keep your batch size equal or bigger than 102,400 rows.
Lets test that. I've modified script #4 to insert 102400 records into CSI:

DELETE FROM tbl_Test_Columnstore 
GO
;WITH
  Pass0 as (select 1 as C union all select 1),
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),
  DataSet as ( 
 SELECT top 102400
  F = CAST(FLOOR(RAND(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))*2147483647) as INT)
 FROM Pass5)
INSERT INTO tbl_Test_Columnstore WITH (TABLOCK)
SELECT F, 'Row Description: ' + CAST(F as varchar), 'Row Type: ' + CAST(F as varchar), F % 10000000 
FROM DataSet;
GO
Run Script #2 after that:

As you can see, SQL Server split all records across your CPUs and as a result you still have your records in the Delta Store.
There are 2 ways to overcome that situation:
1. Make batch size bigger than number of CPUs used by your SQL Server multiplied by 102,400. In my case that number is 4 x 102,400 = 409,600;
2. Lover degree of parallelism. Yes if you want to load data into smaller chunks the process will be slower.
So, lets test the second scenario. Delete all records again from "tbl_Test_Columnstore" table and run following modified script:
Script #5:
;WITH
  Pass0 as (select 1 as C union all select 1),
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),
  DataSet as ( 
 SELECT top 102400
  F = CAST(FLOOR(RAND(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))*2147483647) as INT)
 FROM Pass5)
INSERT INTO tbl_Test_Columnstore WITH (TABLOCK)
SELECT F, 'Row Description: ' + CAST(F as varchar), 'Row Type: ' + CAST(F as varchar), F % 10000000 
FROM DataSet
OPTION (MAXDOP 1);
GO

Now we expect to have some data in our Columnstore. So, we need a new script to report it:

Script #6:
SELECT ix.name as Index_Name
 , CASE WHEN gps.delta_store_hobt_id  Is Null 
  THEN 'Column Store' ELSE 'Delta Store' END Group_Type
 , gps.state_desc
 , gps.total_rows
 , gps.deleted_rows
 , gps.row_group_id
 , gps.size_in_bytes
FROM sys.indexes as ix
INNER JOIN sys.dm_db_column_store_row_group_physical_stats as gps
ON ix.object_id = gps.object_id and ix.index_id = gps.index_id
WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore');
GO
Here is the result:

As you can see, this time all 102,400 rows went directly to the Columnstore. Also, you can notice that actual columnstore uses almost 10 times less space than the same data in Delta Store.

Deleting rows from CSI

Will use very simple script to delete rows:
DELETE TOP (1000) FROM tbl_Test_Columnstore;
GO
After running Script #6 we notice following:
All 1000 rows were deleted from the Columnstore!


Lets delete more and run script #6 again:
DELETE TOP (101400) FROM tbl_Test_Columnstore;
GO
We deleted 102,400 rows and all were deleted from the Columnstore repository, without touching the Delta Store. Notice that the size of CSI has not changed.

Continue deleting:
DELETE TOP (102400) FROM tbl_Test_Columnstore;
GO
Now you can see that Delta Store is significantly decreased in size.

Try to re-insert records using script #5 and see what will happen:

SQL Server Created New Columnstore row group to keep newly inserted records.

Updating data in CSI

Will do very simple update and see what happen:
UPDATE tbl_Test_Columnstore 
SET Row_Description = 'New ' + Row_Description;
GO
All rows were deleted and re-inserted into the Delta Store. I can guess it was single threaded operation.

Moving Data from Delta Store to Column Store

That is pretty simple "ALTER INDEX" operation:
ALTER INDEX idx_Test_Columnstore ON tbl_Test_Columnstore   
REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);
GO
After data are moved from Delta Store it will look like this:

As you can see, data from Delta Store row group #3 were moved to Column Store row group #6 and row groups 3,4, & 5 were marked for system deletion by "TOMBSTONE".

Some time later, marked row groups disappear:

Merging two Column Stores

In order to generate new Column Store will run script #5 again:

In order to merge two small row groups into a larger one we can just use reorganize without any parameters:
ALTER INDEX idx_Test_Columnstore ON tbl_Test_Columnstore REORGANIZE;
GO
It merged two row groups and marked them for deletion.

Reporting CSI columns' space allocation

Script #7:
SELECT ix.name as Index_NAme
 , c.name as Column_Name
 , css.segment_id
 , css.row_count
 , css.on_disk_size
FROM sys.indexes as ix
INNER JOIN sys.index_columns as ic
ON ix.object_id = ic.object_id and ic.index_id = ix.index_id
INNER JOIN sys.columns as c
ON c.object_id = ix.object_id and ic.column_id = c.column_id
INNER JOIN sys.partitions as p 
ON p.object_id = ix.object_id and p.index_id = ix.index_id
INNER JOIN sys.column_store_segments as css
ON css.hobt_id = p.hobt_id and ic.column_id = css.column_id
WHERE ix.object_id = OBJECT_ID( 'tbl_Test_Columnstore');
GO
That script reported an allocation within the CSI by individual columns within individual row groups.
If you sum the size of all columns you'll get the same number as the one, given by script #6.


No comments:

Post a Comment