This article is the part of Does it matter ? series.
Data Archival is considered the infra business and related with cost and storage savings. It is often ignored at the time of data modelling. People consider that it has nothing to do with design and development.
An eye opener.. The reality check
Although, one part of the perception is correct that archival has huge impact on storage and cost. However, another part that it has nothing to do with database design and development needs an introspection.
This article will talk about various aspects to get benefited from the data archival strategy. We’ll also discuss how particularly data archival helps in cost savings.
The storage and cost impact
Suppose we have a database which stores past 10 years of data. However, we hardly access past recent 2 years. Database size with 10 years of historical data is around 1 TB whereas with 2 years it’s just 200 GB.
Now lets’s discuss the further impacts.
Almost all the Production databases has secondaries for high availability and disaster recovery. Additionally, every Production databases is part of backup policy for point in time recovery or other strategy for disaster recovery.
It’s just 1 TB. What’s the big deal?
Suppose there are 2 secondaries along with the primary then 1 TB will become 3 TB. Plus Full, Differential and Log backups size is separate and this is recurring.
If backup strategy says weekly full, daily differential and log backups in every 5 minutes
- 4 full backups in a month : 1 TB * 4 weeks = 4 TB in a month
- daily differential : Hypothetically 5% of the total size will be 50 GB’s daily * 30 days = Approx. 1.5 TB
- log backup in every 5 minutes : Hypothetically 0.005% of the total size will be 50 MB’s * 24 hours (1440 minutes) * 30 days = Approx. 0.5 TB
We’ve discussed the storage aspect. Now you must have got an idea of the amount of data needs to be transferred over wire.
We all know, nothing is free in this world, so network also doesn’t comes free. It has impact on both the cost as well as performance. This can cause network congestion and other applications within your organization placed on the shared network shall suffer the performance issues.
Suppose one of your SQL Server tables has over a billion rows, pertaining to past 10 years. These rows are spread across 1K Pages (Data/Index). Whereas, if you will archive the historical data of 8 years (which is never accessed), then the number of pages, shall be reduced to approx. by 80%, roughly 200 pages.
Read here more about the Pages and SQL Server Storage Architecture.
Now what do you think, your SELECT will perform better on 1K pages, or on 200 pages? I think, it’ll perform better with 200 pages, and will have lesser physical and logical reads.
Also, having lesser pages will surely help, to have an optimized operations such as INDEX MAINTENANCE, UPDATE STATISTICS etc.
We already discussed about the cost implications of archival. Lesser size of primary databases means lesser storage and replication cost.
But cost is not only about Storage and Network. CPU and RAM also comes into the picture. More data means more RAM and more Cores are required for computing.
Archiving the historical data can surely help to minimize the memory and computing requirement. This’ll in-turn help to reduce the cost of Physical Memory and CPU.
Data Archival is not a post production job. Storage and access of archived data needs to be well planned at the time of design and modelling.
We’ll not have the same liberty in production, the way we have during development. We can properly plan the criteria for data archival, such as whether it will be based on date, year or any other attribute.
For e.g. There could be a criteria where you can place the closed accounts in archived tables. In this case, the account status becomes the archival criteria.
Data archival doesn’t means forgetting the data
Data archival doesn’t means keeping it separate and forget it. There is a need of mechanism and strategy, to access the archived data as & when needed.
You can do it manually, or plan to have an automated mechanism, to access it. Web API’s, Micro-service and Design patterns can definitely help, to have a perfect solution.
But don’t worry if you are not an application developer. You can leverage: Stored Procedures, Synonyms and Linked Server features, to build the data adapters, to automatically read, from both Production as well as Archived data, depending upon the input parameters.
Defending bad perception with good SQL Server features
Few of us will defend that SQL Server has features such as Partitioning, Indexing, Compression etc. so why to worry?
I’ve only one thing to say here: “you can’t justify the bad perception with good features”. Features are there to leverage, but not to exploit.
Do you keep unused things in your cupboard or in your store room? Cupboard and Store Room has different purposes, so OLTP and OLAP do.
You may need the historical data for analytics, audits or other purposes. Plan it out and have specific strategy and mechanism for the same.
Out of scope
I’ve also come across requirements that doesn’t allow data archival.
Being in Asset Management and Investment Banking domain for over a decade, I can remember the example of Fund Performance. In order to show the returns, historical data such as trades, inflows/outflows, net asset values, stock price, profit & loss etc. may be required.
But even in this case, Decoupling can be implemented. This architecture can help to achieve both user requirement as well as archival. This may not help in cost saving but will surely deliver the better performance and control over data.
We discussed various aspects to save cost and optimize the database performance by archiving the data. Archival should not be thought of Infra business. Storage and access of archived data should be well planned at the design and modeling phase.
We also discussed that decoupling can be leveraged even though archival is not allowed. It may not help to save the cost but will definitely improve the database performance.