Scaling a Microsoft SQL Server database solution and big data

The purpose of this blog is to raise awareness in general about how database-centric software development is essential for the success of enterprise-level system projects, especially when it comes to big data. In my blog, I will mainly focus on on-premises solutions, as Microsoft Azure still imposes certain limitations and constraints on massive SQL Server-based database solutions at the time of writing this blog.

The role of the database is often downplayed in software development, even though it is almost invariably the single most important aspect of information system performance and scalability. SQL Server databases are complex entities and involve a wide range of scalability, performance and availability skills. It is very rare for a single database specialist to have all these skills. Instead, in large-scale database-centric software development projects, it is quite common for there to be several database experts, each specialising in a particular area:

Data architect (server infrastructure, virtualization, data architecture, high availability, DR design, users and security)
Data Modeller (high-level conceptual models and database modelling, indexing)
Database programmer (database queries and optimisation, such as indexes and configurations)
Data engineer (database integrations, ETL and ELT)
DBA (database routines and maintenance)

As the size of a SQL Server database grows to tens of millions of rows for the largest user tables, often to tens of billions of rows or more, the database no longer scales using traditional methods. At this point, we can already talk about big data in SQL Server: traditional methods are no longer sufficient. This means repeated sluggishness and unnecessary downtime in real-time, operational OLTP instances, sometimes even total system crashes, and excessive load times and report slowdowns in data warehouses, among other tangible business harm and risk. It is therefore of paramount importance to identify those areas where a database solution can be systematically scaled hundreds, even thousands of times more than a basic database development. The scaling of relational databases is divided into a number of different areas. The most important of these, in addition to best practice data architecture, are:

- Good database design
- Indexing and statistics
- Partitioning
- Materialised views and stored procedures
- Optimisation of database queries
- Database configurations
- Replication of databases
- Vertical scaling (“iron to the border”)
- In-Memory OLTP
- Caching
- Sharding (cumbersome, not native in SQL Server)
- Capacity-based planning and forecasting

Each of these areas would warrant its own blog, but let’s start with a conceptual description of these areas and a few key practical points.

Good database design

A basic principle of good database design is to normalise the database according to its intended use. For example, when designing an OLTP database, you should always design the database to at least the third normal form (3NF), so that it performs a large number of point read and write operations as cleanly as possible. Similarly, in data warehouse databases, it is typical to optimize the schema to accommodate data masses well over large batch times and to report massive amounts of data efficiently. This in turn requires denormalization of the database schema (2NF). Read more on this topic in my quite comprehensive blog series on Data Modeling – Part 1: Intro – DB Pro Services

Indexing

A powerful database lives and breathes on indices. This is an art in itself. Too many indexes and updates start to slow down, too few and sub-optimal indexes and database queries start to slow down. Incorrect fill rates fragment indexes eating into their efficiency and outdated index statistics mislead the SQL Server query optimizer giving suboptimal execution plans. A well-designed index, on the other hand, can easily speed up a slow database query by a factor of up to 1000, and well-balanced index designs also keep operational database tables performing well in all situations. Advanced index and statistics maintenance routines, such as the ISM component developed by DB Pro Service, ensure index and statistics performance even when there are more indexes than can be updated and data volumes grow to terabytes or more. Efficient use of indexes also sometimes involves efficient use of the Query Store and explicitly forcing optimal query plans or indexes when the Query Engine cannot. Sometimes it is also necessary to resort to the use of the Legacy Cardinality Estimator. Naturally, indexing also increases disk space requirements, usually by several tens of percent.

Partitioning

The idea of partitioning is to split the SQL Server database table vertically so that it is divided into several “sub-tables”, or partitions. The partitions are physical entities in their own right, but logically and explicitly form a coherent whole that can be referenced just like any other database table. A partition is associated with a logical partition key, on the basis of which each partition, or ‘sub-table’, is formed. A partition key can refer to a date, for example, so that there are as many partitions as there are dates in the table. As a general rule of thumb, a database table of ~30 million rows or more should already be partitioned, because then the performance of the table will clearly start to suffer. The main benefit of partitioning is that the performance of the database table is maintained regardless of the number of rows in the table, even if there are hundreds of billions of rows of data. Of course, the database query must then always target the partition key to avoid scanning the entire table and instead retrieve the data directly from the partition. Another key benefit of partitioning is the ability to delete entire partitions, say à100M rows, in milliseconds, rather than overloading the log file and taking several hours or even days to delete one row at a time, radically slowing down the entire database. A third benefit is the so-called “partition switch”, which can be used to accelerate, for example, continuously changing big data, such as monthly ETL downloads to the DW database for, say, the most recent 3 months, by first loading the changed data into the staging table implemented with the corresponding partitioning, and then performing a quick “partition switch” to the monthly partitions, which updates the staging table partitions to the big history table. This operation is usually performed in milliseconds, since the entire partition is “implanted” in a single operation in the target table.

Materialised views and stored procedures

Materialized views, called indexed views in SQL Server, offer several advantages that can significantly improve database performance and usability. Faster database query execution is due to the fact that indexed views store the precomputed results of a query, which reduces the time required to execute complex or resource-intensive queries. In addition, because the results are pre-calculated and stored, the system does not need to repeatedly access and compile data from underlying tables, reducing the workload. Custom indexes can also be created for views, which can further enhance the efficiency of queries. By reducing the computational load on the system, indexed views help the database to handle more concurrent users and queries, thus improving scalability. Queries are also simpler because indexed views can include table joins and aggregation. SQL Server’s Query Optimizer can also make effective use of indexed views. Although indexed views can reduce disk IO, for example, they have the disadvantages of extra disk space and increased write latency (propagation of data from user tables to views).

Stored procedures are a great way to scale the performance of your SQL Server database. By encapsulating multiple T-SQL statements in the same batch within a procedure and sending simple parameters as part of a procedure call instead of long queries, there is much less network traffic between clients and the server. Further, SQL Server compiles and stores the stored procedure execution plan in memory on its first execution. Subsequent procedure calls reuse the plan, avoiding the overhead of query parsing and optimization. Stored procedures are also less prone to repeated recompilation than ad hoc queries, especially if parameters are used. By centralizing the logic in the database, stored procedures reduce the dependency on client application processing, allowing for better scalability in high concurrency OLTP systems. Stored procedures also reduce the computational load on application servers by performing heavy data processing in the database. Transactionality can also be conveniently encapsulated into the stored procedure logic, allowing the database to run as efficiently as possible.

Optimisation of database queries

Optimizing database queries, as well as efficient indexing, can achieve significant performance gains, in practice almost as significant as indexing. For example, a poorly done database query will fail to use the indexes created in the database table, causing the performance plan to collapse. When doing database optimization, the magic of indexes and statistics, the impact of various in-memory and persisted table structures, cursors, operators, joins and functions on performance, etc., etc., must be well understood. The version of SQL Server also plays a big role here. You can get up to speed with database query optimisation in just a few days, but this is an area that takes years to master. For a good example of database optimization, see our product page on using composite indexes for complex queries: SQL Server performance insights – OR vs UNION ALL is not always what it seems

Database configurations

Operating system-level settings such as power plan and instance-level configurations such as max degree of parallelism and cost threshold for parallelism, as well as certain database settings such as tempdb data file configuration, user database filegroup and datafile settings, and certain database options such as statistics update rules, can together and correctly configured easily affect the performance of an OLTP database server by several tens of percent, sometimes even more.

Database configurations

Database replication can be practiced in SQL Server, for example with a log shipping mechanism within certain limits, and in a more flexible and real-time manner with Always On Availability Group, a database-level HA solution that goes beyond traditional SQL Server clustering. The idea of replication is to be able to “out-scale” the operational OLTP workload, and read data from a readable replica placed on a different server. In this way, for example, ad hoc reporting loads do not burden the operational server at all. Further; data engineers can load new data generations from operational databases in batches to a separate data warehouse server in a so-called reporting database, the data map, thus optimising reporting for both the dataset and the load.

Read more about log shipping here: What to do with SQL Server Log Shipping – DB Pro Services.

Vertical scaling (“iron to the border”)

The idea behind vertical scaling is simple: if the database server runs out of memory, add more memory. If you run out of CPU power, add more CPU. If storage coughs up, deploy a tighter tier or dedicated storage. There’s a problem with this, though: you’re treating a “symptom, not a disease”, which often leads to bloated, out-of-control TCO and much steeper SQL Server licensing costs. SQL Server uses a lot of resources for a reason, and if telemetry deviates from the so-called baseline, it is reasonable to assume that something in the SQL Server heartland is not working efficiently. The culprit is often a setting, index, database query, or something similar. It is much more effective to treat the root cause than the symptom, and often an experienced DBA is sufficient for this “cure”. Sometimes, however, it is appropriate to increase physical capacity. This is perfectly normal, but should typically not be done blindly, but only after a thorough capacity and performance diagnostic.

In-Memory OLTP

Database content can also be pushed into in-memory: in-memory OLTP is a feature of SQL Server designed to significantly improve the performance of transaction loads by leveraging in-memory data structures and optimized processing techniques. Introduced in SQL Server 2014 and enhanced in later versions, this technology allows certain tables and stored procedures to be located and executed entirely in memory.

Caching

Storage system disk space can be divided into cold and hot data. In this way, the performance of the disk system can be scaled in those areas where it is essential. Storage caching in relational databases therefore refers to the use of a caching mechanism to store frequently used data in faster storage layers, such as memory, to reduce latency and improve database performance. By using storage caching, databases can minimise the need to repeatedly retrieve data from slower disk-based storage, significantly improving read and write performance.

Sharding

Relational database sharding is a database architecture technique that partitions a database horizontally to divide data into multiple databases or servers called shards. Each shard contains a portion of the total data with its key (=shard key), which allows systems to scale by sharing the load among multiple servers, thus improving performance, scalability and fault tolerance.

Note! SQL Server does not natively support sharding as a built-in feature, but sharding can still be implemented using a combination of SQL Server application-level logic, database design techniques and third-party tools or frameworks.

SQL Server capacity planning and forecasting

If you want your SQL Server database platform to be TCO-optimal, and to avoid capacity bottlenecks and SLA-crushing unavailability events, it is important to continuously measure and predict the performance baseline of your production environment and its changes. An excellent tool for this is SQL Governor software, which allows you to take control of performance optimization and capacity optimization across your entire data platform, regardless of the size of your SQL Server data platform.

Interested? Please contact us!

Jani K. Savolainen

jani.savolainen@dbproservices.fi
Tel. 0440353637
VP & Chairman
DB Pro Services Oy

Scaling a Microsoft SQL Server database solution and big data

Good database design

Indexing

Partitioning

Materialised views and stored procedures

Optimisation of database queries

Database configurations

Database configurations

Vertical scaling (“iron to the border”)

In-Memory OLTP

Caching

Sharding

SQL Server capacity planning and forecasting

Jani K. Savolainen

Related Posts

Certified Microsoft Azure partner for modern data solutions

SQL Server diagnostics and wait statistics analysis: intro

The potential of data platforms to support an organisation’s FinOps goals and implementation, part 2 / 2: Differences between public cloud and in-house data centre

How data platforms can support your organisation’s FinOps implementation, part 1 / 2: Your own data centres