Published on May 26th, 2015 | by Warner Chaves1
Top 10 SQL Server Performance Killers
Discussing with my colleagues a couple of days ago we were going into some of the worst practices we had seen out in the field with our different consulting clients. What really struck me as a little scary is that many of our stories were terribly similar, even on clients with completely different hardware, different industry verticals, different levels of staff seniority, etc. Maybe it’s due to how easy Microsoft makes working with SQL Server, the still recent explosion of SQL Server as a fully mature Enterprise-level product, misguided advice on the internet or a combination of all of those.
The truth is that many of these performance killers are not hard to avoid. With the proper knowledge, most DBA’s can at least act proactively and get rid of potential slowness by looking out for the most common issues that have plagued other people working with the product. In the interest of giving our readers succinct, practical advice I filtered out the most common ones into this top 10 you have right here. Let’s get on with it, in no particular order:
1. You run an application that is mostly read activity (as most OLTP are) but don’t have enough indexes.
I had a case where a client had a large table with several hundred million records and every time a specific query would run it would paralize the entire system. This particular query had to run several times per day so this became a constant headache. We got the execution plan for this query and we immediately saw that it was being scanned because there wasn’t a proper index for it. When we suggested it, the client was very hesitant to add an index because the table was so large, worried about the performance degradation for the Insert, Delete, Update activity and the increase in disk space.
Truth is, if your system is already paralyzed by this situation, it’s very unlikely that adding a few milliseconds to Insert, Update, Delete operations is worse than letting a query run without a proper index against a large table. We showed them how their system was read-heavy and convinced them to add the missing index. Problem solved.
2. You’re still running a lot of reporting directly on the operational copy of your database.
So you want to run OLTP queries and also reporting queries on one system to save some $$ or some hardware but you end up with neither the OLTP nor the reporting working properly. By definition, reporting queries will usually touch many records and doing them in the same machine you end up backing yourself against these corners:
a) You lock records and this interrupts and slows down your OLTP style queries.
b) You end up putting NOLOCK hints everywhere and compromising the quality of your reporting.
c) You activate snapshot isolation and then your IO system wasn’t ready for the extra tempdb load.
OK, so your options: Availability Groups, Mirroring with Snapshots, Log Shipping or good old replication. Which one of these is the optimal solution depends on your budget and your reporting goals so make sure you look into what each one can do and what you might need to compromise for each.
3. Your VM hosts are way over committed.
If your VM cluster has become the SQL Server bucket where all workloads go then don’t be surprised when it doesn’t perform well. I had someone tell me that “SQL Server doesn’t virtualize well”, of course it doesn’t, neither does any resource intensive application when it’s running on over committed virtual resources. Here are the main guidelines I use:
a) I don’t like over committing memory on the VM host. If the host has 128GB of RAM then I want all the VMs memory to sum up to less than that amount, not more.
b) CPU is a more flexible resource, I still don’t recommend to overcommit it by over 20%.
c) A lot of places have a DBA, a VM admin and a storage admin. The VM admin checks memory and CPU pools but has a harder time figuring out if the IO limits are hit. So, make sure your storage admin is in the loop regarding your virtualization goals as well.
4. Your IO sucks.
This one is pretty self-explanatory. I’ve seen environments where adding an all-flash array or a PCI Fusion IO card was like pressing the TURBO button. Suddenly the issues were about contention on spinlocks to squeeze the most out of the CPUs instead of the same old PAGEIOLATCH from the slower storage. Create a Performance Monitor collection and get your Average Secs per Read and Average Secs per Write baselined. If your latency on Read or Write is consistently over 15-20ms then your IO is slow. No ifs, no buts, doesn’t matter if the storage admin can’t see anything on the SAN. This will be your biggest bottleneck until you take care of it.
5. You have massive queries that are hard to optimize.
Writing query optimizers is really hard. One of the reasons why all these NoSQL platforms have very limited query surfaces (no joins, no aggregations, etc), is because writing an optimizer is HARD WORK and it has to perform well most of the time in order to be taken seriously. Now as good as the SQL Server optimizer is, you as a DBA or developer can still help it out a lot of times. This is the kind of stuff that I still see a lot:
a) Endless nested views (a view calls a view that calls a view that calls a view…).
b) A lot of business logic and manipulation coded as once massive SQL statement. It’s easier to the optimizer if we feed it questions in more manageable chunks, don’t be afraid to break down a query and throw some intermediate results on a temp table if necessary.
c) Queries that want to do everything and end up under performing for all cases. Again, it’s OK to throw this inside a stored procedure and write different SQL statements for different branches of the procedure. With statement level compilation this is not a bad thing anymore.
6. Tempdb is not properly configured.
This one should be part of any DBA’s installation checklist.
a) Add more tempdb files than the default 1. I like to use a post-install rule of 0.25 * logical processors up to a max of 8. Adjust up as needed if you see contention once you have an actual workload running.
b) Pre-size them all and set them all to the same size and autogrowth.
c) Use these trace flags: 1117, 1118 for best performance. Details on my recommended trace flags post.
7. Instant file initialization is not set and your autogrowth settings are still at the default.
Another one that should be part of the installation checklist (or part of the installer, period). Instant file initialization allows SQL Server to grow the data files without having to zero them out, making file growth an *instant* metadata operation on Windows. To enable Instant File Initialization you need to grant the SQL Server service account the “Perform Volume Maintenance tasks” privilege. Related to this recommendation, you should also:
a) Presize your files to the best of your knowledge and monitor them so you can proactively grow them if necessary.
b) If autogrowth does end up triggering, you got instant file initialization enabled so the autogrowth doesn’t suspend your session.
c) Replace the awful 1MB – 10% defaults with an uniform increment that makes sense for your environment. For smaller databases I often use 100MB increments, for larger ones up to 4GB increments.
8. Your application or your queries are pulling way more data than necessary.
I was reviewing an execution plan with a colleague a few days ago. The client was setting up an SSIS process to consume some data, transform and move it to another system, however the package was simply never completing. Going over the steps of the plan we found a join where the estimated data output was more than 1TB (yes, TERABYTE) of data through the execution plan. Going over the code we found a cartesian product in the logic, brought it up, refactored it with the developers and got much better performance once the data flowing inside the execution plan was orders of magnitude less.
We’ve faced this issue before as well with the way some applications are built. For example, bringing thousands of records over to the application when it’s well known that users almost never browse past the first 5 pages of any search results. Or not implementing caching at the server level or on a dedicated caching server (like REDIS for example) to avoid moving data back and forth the database constantly. Or queries that do the top N by some type of category or criteria but they still go through thousands or even millions of records to pick the top N because of the lack of proper filtering criteria.
9. You’re using distributed linked server queries without looking into how much data is getting pumped through the network.
Linked servers are a cool and useful feature and certainly have it’s place inside the SQL Server toolbox. However, some applications take the linked server concept and insert it into operational or critical database code without taking into account the complexity or cost of distributed queries. Basically, just because something works, doesn’t mean it’s the right thing to do. Again, this is one of those features where it’s so easy and transparent to use that we forget about the implications.
For example, I’ve found code where T-SQL code from Server B is doing a join from two tables on remote Server A and one small table on B. This all works, but under the hood the optimizer for linked server queries is just not that good. In this case, it was pulling the data from both remote tables locally to then do the join to the small local table resulting in a lot of waste of CPU and network bandwidth. Refactoring this process, even though it was *working*, was worth it because of the big performance improvement of moving the small table to server A and doing all the computation there.
10. Your maintenance routines are not optimal.
I’m a big fan (along with many other DBAs) of Ola Hallengren’s great maintenance solution. Using it will already cover the different fragmentation levels, updating stats that have been modified, etc. I’ve seen environments where no index maintenance is done, others where TOO MUCH was done instead. Same thing with statistics, a very large percentage of the cases where an execution plan suddenly performs worse is down to statistics not being up to date.
The out-of-the-box maintenance plans that ship with SQL Server definitely don’t help in this regard. Do yourself a favor, don’t try to reinvent the wheel and take advantage of Ola’s amazing community contribution.
So there you have it, please go check your servers today and see if you fall into any of the above items. If you do, I hope we gave you some good information and you’re on your way to remediation and saving yourself future headaches! Did we miss something you consider a massive performance killer? Let us know your opinion in the comments!