Estimating the storage savings by removing columns with NULL value across the table or marking them as SPARSE

In the previous article Find columns with NULL values across the table we discussed that storage space can be saved by removing columns with NULL value across the table or marking them as SPARSE. We also learnt about the query to find all such columns across the tables of a database.

In this article we’ll learn to estimate the storage saving by taking the necessary action on the columns with NULL value across the table, either by removing them or by marking them as SPARSE.

It becomes extremely important to be ready with the relevant data and stats when we propose anything. Similarly, when we’ve to approach the Sr. Leadership for the approvals to take any such actions on the Production database, we need to have the data supporting our claim of storage savings.

I found this query very useful. It helped me with the table wise data which we finally aggregated for the total storage savings. This query provides the following columns as the output.

  • TableName : This gives the name of the table
  • TotalColumns : This gives the count of columns in the table with NULL values across.
  • TotalRows: This gives the count of rows of the table
  • Estimated_Savings_Bytes: This gives the estimation of storage savings in bytes.

Note: You may find a table tables_with_null_values_across being referred in the query. This is the same table which was created in the previous article. This article is the continuation of Find columns with NULL values across the table.

SELECT DV.TableName
	, COUNT(DISTINCT DV.ColumnName) AS TotalColumns
	, DV.TotalRows
	, SUM(DV.TotalRows * 
		CASE
			WHEN COL.DATA_TYPE IN ('CHAR', 'NCHAR')
				THEN COL.CHARACTER_OCTET_LENGTH
			WHEN COL.DATA_TYPE = 'TINYINT'
				THEN 1
			WHEN COL.DATA_TYPE = 'SMALLINT'
				THEN 2
			WHEN COL.DATA_TYPE = 'INT'
				THEN 4
			WHEN COL.DATA_TYPE = 'BIGINT'
				THEN 8
			WHEN COL.DATA_TYPE IN ('NUMERIC', 'DECIMAL')
				THEN 9
			WHEN COL.DATA_TYPE = 'FLOAT'
				THEN 8
			WHEN COL.DATA_TYPE = 'DATE'
				THEN 3
			WHEN COL.DATA_TYPE = 'TIME'
				THEN 5
			WHEN COL.DATA_TYPE = 'SMALLDATETIME'
				THEN 4
			WHEN COL.DATA_TYPE = 'DATETIME'
				THEN 8
			WHEN COL.DATA_TYPE = 'BIT'
				THEN 1
			ELSE 2
		END) Estimated_Savings_Bytes
FROM tables_with_null_values_across DV WITH (NOLOCK)
	INNER JOIN INFORMATION_SCHEMA.COLUMNS COL WITH (NOLOCK)
		ON COL.TABLE_NAME = PARSENAME(DV.TableName, 1)
		AND COL.COLUMN_NAME = PARSENAME(DV.ColumnName, 1)
GROUP BY DV.TableName
	, DV.TotalRows

7 thoughts on “Estimating the storage savings by removing columns with NULL value across the table or marking them as SPARSE

  1. Why do you not simply add ROW or PAGE compression to the clustered index, this would “remove” the blocked space too. And of course you could use columnstore indexes too (where NULL values doesn’t block much space), except the table is very often updated.

    And with
    EXEC sys.sp_estimate_data_compression_savings @schema_name = ‘dbo’
    , @object_name = ‘commandlog’
    , @index_id = NULL
    , @partition_number = NULL
    , @data_compression = ‘ROW’

    you can find out much faster (particularly for bigger tables, because it samples the data), how much you can / would save.

    The built in compressions (PAGE compression includes ROW compression) have other benefits too, e.g. a 1 in a BIGINT column would need only 1 instead of 8 bytes (since it is “compressed” to a TINYINT) and ‘abcd’ in a NVARCHAR(50) column will use only use 4 instead of 8 byte because it does not contain any unicode chars.

    Like

    1. Valid suggestions Thomas!

      We are also going with Page Compression and have observed space reduction up-to 70%.

      However, this activity was important to get the insights of benefits of code review and Data Modelling. If code reviews are taken seriously and Data Modeling is done prior to creating Physical tables, obviously such kinds of scenarios can be avoided.

      If a Production Database has thousands of unused columns with all NULL values then obviously it’s waste of resources including Dev. Time and unnecessary overhead.

      Like

  2. PS: if you would join sys.columns instead of INFORMATION_SCHEMA.COLUMNS you could replace your whole CASE by the max_lenght column

    Like

Leave a comment