Monitoring SQL Server on Linux Part 2: Getting sar Data Into SQL Server (Quick and Dirty)

Introduction

In part 1 of this series I walked through the process of installing sysstat on a Linux-based SQL Server as well as some simple commands for getting data from it. The next question is … how do I get this data into our database server so that we can analyze it? Setting up a SSIS package to read and transform the data would be a great way to go, but I will not be discussing that today because it’s a lot to ask of a reader that is just starting to evaluate the use of Linux in their environment. Instead I will use this post to discuss using the bcp (Bulk Copy) command to load the data into a database without the need for development work.

Sample Database

This particular post is all about minimizing effort, so I will not be using format files for the bulk copy commands. In the name of simplicity I will also be avoiding using any tricks to work around bulk copy limitations (such as inserting into views instead of tables). That means that each type of data that we get from sar/sadf will need to go into it’s own table, that the columns in the table need to line up exactly with columns in the data files (in order … bcp pays attention to column order), and data types need to be close enough that everything can import with only implicit conversions.

I hope that most readers will be disturbed by the composite keys found in this schema. This is a consequence of the fact we are not using format files for the import – since the table columns need to match what’s in the file, there’s no way to add an identity column without doing something tricky. Stay tuned – part three of this series will add format files to the mix which will allow us to clean up the schema a bit.

CREATE DATABASE perf   -- Adjust file paths, sizing, etc as needed for your environment
ON PRIMARY( NAME = N'pri', FILENAME = N'/sqldata/perf.mdf' , SIZE = 8MB, FILEGROWTH = 1MB ),
FILEGROUP PERF DEFAULT ( NAME = N'perf', FILENAME = N'/sqldata/perf.ndf', SIZE = 1GB, FILEGROWTH = 1GB )
LOG ON ( NAME = N'perf_log', FILENAME = N'/sqllog/perf_log.ldf' , SIZE = 1GB , FILEGROWTH = 128MB )
GO

ALTER DATABASE perf SET RECOVERY SIMPLE;  -- Not required, but simple mode may be easier for testing
GO
USE perf
GO

CREATE SCHEMA sar AUTHORIZATION dbo;
GO

CREATE TABLE sar.cpu_data(
	hostname   VARCHAR(128) NOT NULL,
	interval   SMALLINT     NOT NULL,
	tstamp     DATETIME2(0) NOT NULL,
	cpu        TINYINT      NOT NULL,
	pct_user   DECIMAL(5,2) NOT NULL,
	pct_nice   DECIMAL(5,2) NOT NULL,
	pct_sys    DECIMAL(5,2) NOT NULL,
	pct_iowait DECIMAL(5,2) NOT NULL,
	pct_steal  DECIMAL(5,2) NOT NULL,
	pct_idle   DECIMAL(5,2) NOT NULL,
	CONSTRAINT PK_cpu_data PRIMARY KEY CLUSTERED( hostname, tstamp, cpu )
);

CREATE TABLE sar.mem_data(
	hostname   VARCHAR(128) NOT NULL,
	interval   SMALLINT     NOT NULL,
	tstamp     DATETIME2(0) NOT NULL,
	kbmemfree  INT          NOT NULL,
	kbavail    INT          NOT NULL,
	kbmemused  INT          NOT NULL,
	pct_used   DECIMAL(5,2) NOT NULL,
	kbbuffers  INT          NOT NULL,
	kbcached   INT          NOT NULL,
	kbcommit   INT          NOT NULL,
	pct_commit DECIMAL(5,2) NOT NULL,
	kbactive   INT          NOT NULL,
	kbinact    INT          NOT NULL,
	kbdirty    INT          NOT NULL,
	CONSTRAINT PK_mem_data PRIMARY KEY CLUSTERED( hostname, tstamp )
);

CREATE TABLE sar.disk_data(
	hostname VARCHAR(128) NOT NULL,
	interval SMALLINT     NOT NULL,
	tstamp   DATETIME2(0) NOT NULL,
	dev      VARCHAR(128) NOT NULL,
	tps      DECIMAL(9,2) NOT NULL,
	rkbs     DECIMAL(9,2) NOT NULL,
	wkbs     DECIMAL(9,2) NOT NULL,
	areq_sz  DECIMAL(9,2) NOT NULL,
	aqu_sz   DECIMAL(9,2) NOT NULL,
	await    DECIMAL(9,2) NOT NULL,
	svctm    DECIMAL(9,2) NOT NULL,
	pct_ut   DECIMAL(5,2) NOT NULL,
	CONSTRAINT PK_disk_data PRIMARY KEY CLUSTERED( hostname, tstamp, dev )
);

CREATE TABLE sar.net_data(
	hostname VARCHAR(128) NOT NULL,
	interval SMALLINT     NOT NULL,
	tstamp   DATETIME2(0) NOT NULL,
	iface    VARCHAR(128) NOT NULL,
	rxpcks   DECIMAL(9,2) NOT NULL,
	txpcks   DECIMAL(9,2) NOT NULL,
	rxkBs    DECIMAL(9,2) NOT NULL,
	txkBs    DECIMAL(9,2) NOT NULL,
	rxcmps   DECIMAL(9,2) NOT NULL,
	txcmps   DECIMAL(9,2) NOT NULL,
	rxxcsts  DECIMAL(9,2) NOT NULL,
	pct_ut   DECIMAL(5,2) NOT NULL,
	CONSTRAINT PK_net_data PRIMARY KEY CLUSTERED( hostname, tstamp, iface )
);

CREATE TABLE sar.page_data(
	hostname  VARCHAR(128) NOT NULL,
	interval  SMALLINT     NOT NULL,
	tstamp    DATETIME2(0) NOT NULL,
	pgins     DECIMAL(9,2) NOT NULL,
	pgouts    DECIMAL(9,2) NOT NULL,
	faults    DECIMAL(9,2) NOT NULL,
	mfaults   DECIMAL(9,2) NOT NULL,
	pgfrees   DECIMAL(9,2) NOT NULL,
	pgscanks  DECIMAL(9,2) NOT NULL,
	pgscands  DECIMAL(9,2) NOT NULL,
	pgsteals  DECIMAL(9,2) NOT NULL,
	pct_vmeff DECIMAL(5,2) NOT NULL,
	CONSTRAINT PK_page_data PRIMARY KEY CLUSTERED( hostname, tstamp)
);

Sample script for daily data import

Below is a short shell script that could be scheduled on the Linux server to import a day’s worth of data. The script as written will process the previous day’s data and in most environments would probably be run shortly after midnight. Lines 6 and 7 allow us to supply the SQL username and password (this is convenient for testing, but we shouldn’t use clear text passwords if this goes to prod). The command at line 9 calls out (back quotes) to the Linux date command to get the day number (+”%d”) for the previous day (–date=”yesterday”). Line 10 uses this to construct the name of the sysstat data file (using the Ubuntu naming convention – on RedHat location will be different ).

#!/bin/sh

# NOTE - password is for testing. Please don't store password in clear text in prod ... consider
# joining server to an AD domain and using integrated security or using
# a password safe.
sqluser="username_goes_here"
sqlpass="password_goes_here"

day=`date --date="yesterday" +"%d"`
fname="/var/log/sysstat/sa$day"
# On RedHat use fname=/var/log/sa/sa$day"

sadf -P ALL $fname -e 23:59:00 -td > cpu.txt
sadf $fname -e 23:59:00 -td -- -r > mem.txt
sadf $fname -e 23:59:00 -td -- -dp > disk.txt
sadf $fname -e 23:59:00 -td -- -n DEV > net.txt
sadf $fname -e 23:59:00 -td -- -B > page.txt

/opt/mssql-tools/bin/bcp sar.cpu_data in cpu.txt -U $sqluser -P $sqlpass -d perf -S "(local)" -F 2 -c -t ";" -r "\n"
/opt/mssql-tools/bin/bcp sar.mem_data in mem.txt -U $sqluser -P $sqlpass -d perf -S "(local)" -F 2 -c -t ";" -r "\n"
/opt/mssql-tools/bin/bcp sar.disk_data in disk.txt -U $sqluser -P $sqlpass -d perf -S "(local)" -F 2 -c -t ";" -r "\n"

The calls to sadf were discussed in part one of this series but I should spend some time drilling into the bcp commands, whose syntax is timeless. In this case, by “timeless” I really just mean that the syntax not changed much in the past three decades, and hearkens back to the heyday of UNIX-y command-line utilities (I actually used to use this tool all the time around the turn of the millennium when I was working with Sybase SQL Server on Tru64 UNIX). It is absolutely worthwhile for us to work through the syntax, however, because Bulk Copy is usually the fastest way to get data from a text file into SQL Server.

Let’s decompose the first bcp call to see what all of the options do.

  • /opt/mssql-tools/bin/bcp – path to bcp command, in case we haven’t added it to our path yet
  • sar.cpu_data – name of the table we are working on (in this case inserting into)
  • in – bcp supports both input and output. “in” specifies that we are inserting data
  • cpu.txt – name of the file we are working on (inserting data from)
  • -U $sqluser – SQL username (supplied at top of script). Use of the -U and -P options indicate we are using SQL authentication, if we want integrated authentication we would use -T instead. Joining a Linux server to Active Directory is out of scope for this post.
  • -P $sqlpass – SQL password (supplied at top of script)
  • -d perf – use the perf database
  • -S “(local)” – Server name. In this case we are working locally so we can simply say “(local)”.
  • -F 2 – Start at line 2 of the file (because the first line is a header)
  • -c – Input file is a text file (character format)
  • -t “;” – Field terminator is semi-colon (i.e. this is a semi-colon delimited file). This is a quirk of sadf.
  • -r “\n” – Row terminator is a newline character (rather than a semi-colon)

At this point we have seen how to get performance data with the sadf command and then get it into a database where we can start viewing it. But we probably do not care about all of the data that we are importing, and the fact that the tables need to exactly match the layout of the text files limits our schema design choices. Wouldn’t it be nice if we had more flexibility when mapping fields from the file onto table columns? We actually can do that if we do a little extra work before importing to set up a format file – which we will do in the next part of this series.

Monitoring SQL Server on Linux Part 1: Installing sysstat (Surviving Without Perfmon)

In my opinion, one of the greatest barriers to wider adoption of Linux for SQL Server platforms is not concern about how well the engine works under Linux. I believe it is actually the absence of tools that SQL Server DBAs have become accustomed to. No, I am not really talking about SSMS (at least not today). Today I’m actually talking about how to survive without the Windows Performance Monitor (Perfmon) that we’ve all grown to love over the years.

First the easier part. When we talk about configuring the Windows performance monitor on a Windows-based SQL Server we are usually worried both about general Windows performance counters and also SQL Server-specific performance counters. The good news here is that the SQL Server performance counters (page lifetime, connections per second, etc) are also available from the sys.dm_os_performance_counters dynamic management view. This DMV is still available when we are on a Linux platform so this data can be obtained from there instead. I may write about this in later parts of this series, but until/unless I do Dave Bland and Louis Davidson (among others) have written on the topic. We still need a way to get operating-system level stuff like CPU usage, I/O performance, etc., … which will be the focus of the rest of this post.

Second, any time we are discussing ongoing performance monitoring it is wise to ask ourselves if we really want to set up our own monitoring framework or not. There are some great third party tools out there which are worth a look, and that is even more true if we are branching out into a less familiar (to us) operating system. Even if none of the traditional third party SQL Server monitoring packages are in budget, there are some great frameworks for monitoring Linux that may fit the bill.

If you are still reading that must mean that you, like me, are used to leaning heavily on perfmon on Windows and have some infrastructure around saving the data and are looking for something close to a drop in replacement … without installing an agent or really any more software than is necessary. The rest of this post will walk through the installation of sar (sysstat) as one option that, while not actually pre-installed, is lightweight and easy to get up and running. Examples will be in Ubuntu, I may provide walkthroughs for other distros in the future. Bear in mind that things are still early, it is possible that a year or so from now I would have a different suggestion.

Step 1 : Install the sysstat package

As of the time of this writing sysstat is not installed by default. Make sure repositories are up to date and install. In Unbuntu packages are installed with the apt command, in other distributions the manager will be different (yum or zypper)

sudo apt-get update
sudo apt-get install sysstat

Step 2 : Set retention period in config file

Edit the configuration file ( /etc/sysstat/sysstat in Ubuntu, may be located in /etc/sysconfig/sysstat if you are using a different distribution ). Decide how many days worth of data you would like for sysstat to retain on the filesystem. Be aware that the data file name convention is different if a value greater than 28 is specified for history, so in this example I am editing from it’s default length of 7 to 28.

sudo vi /etc/sysstat/sysstat

Step 3 : Consider updating crontab to gather data ever 5 minutes

Data collection does work a bit differently. On Windows, we typically configure perfmon to start every day at midnight and run for 24 hours regardless of sample frequency. With sysstat, we instead schedule each sample collections. I personally prefer this because it is more tolerant of events like mid-day reboots.

The scheduler in UNIX is called cron, and a default crontab file is usually installed with sysstat. The default will save data every 10 minutes. I typically adjust this to save a sample every 5 minutes. This can be done by updating the rightmost column of the line calling debian-sa1 to be */5 (* in this column would mean run every minute, */5 means every 5). This file is located at /etc/cron.d/sysstat and can be edited with any text editor (I’m using vi in this example)

sudo vi /etc/cron.d/sysstat

Step 4 : Enable automated data collection

It’s easier to manage data collection if it’s enabled as an Ubuntu service. This can be done in the /etc/default/sysstat file

sudo vi /etc/default/sysstat

And finally, restart the service using this command

sudo service sysstat restart

Viewing some data

The next couple of posts in this series will work through some examples of how we can retrieve historical data from sysstat and insert it into SQL Server. But until then here are some examples of how we can view historical data. A few notes on filenames

  • Assuming we configured a history length of 28 or less, the last two digits of the filenames refer to a day of the month. All of my examples are using data from the 27th so I use /var/log/sysstat/sa28 – to see data from a different day change the last two digits.
  • Some distributions may place this data in /var/log/sa/ instead of /var/log/sysstat/
  • If the “-f <filename>” parameter is omitted, data for the current day will be shown
  • Be aware when using tab completion that the /var/log/sysstat/ directory contains files other than data files. If sar complains about an invalid file it’s possible that a report file was passed in instead of a data file.

A few notes on parameters

  • The first parameter in the below examples specifies the type of data we want
  • The -f parameter specifies which data file (which day) we’re interested in
  • The -e parameter specifies an ending time. I personally like to specify 11:59PM as an ending time to prevent data timestamped at midnight of the following day.
  • The -t parameter specifies that times should be show in the local time zone rather than UDC.
# Show cpu data for all cores for the 27th day of the month
sar -P ALL -f /var/log/sysstat/sa27 -e 23:59:00 -t
# Show memory utilization for the 27th day of the month
sar -r -f /var/log/sysstat/sa27 -e 23:59:00 -t
# Show disk (block device) info for the 27th day of the month
sar -dp -f /var/log/sysstat/sa27 -e 23:59:00 -t
# Show network device utilization
sar -n DEV -f /var/log/sysstat/sa27 -e 23:59:00 -t
# Show paging activity
sar -B -f /var/log/sysstat/sa27 -e 23:59:00 -t

Viewing some easier to process data

The sar commands discussed in the last section display data formatted for human readability. This is Linux and it is, of course, entirely possible for us to use shell scripting tools such as sed and awk to reformat this data into a delimited list which would be easier to insert into SQL Server. Unfortunately those are not tools in the typical SQL Server professional’s tool belt. There is good news – we can get computer-friendly output from the sadf command. The following examples will give semi-colon delimited versions of the same data.

# Show cpu data for all cores for the 27th day of the month. The -d option specifies ';' separator
sadf -P ALL /var/log/sysstat/sa27 -e 23:59:00 -td
# Show memory utilization for the 27th day of the month. The -- indicates sar options follow
sadf /var/log/sysstat/sa27 -e 23:59:00 -td -- -r
# Show disk (block device) info for the 27th day of the month
sadf /var/log/sysstat/sa27 -e 23:59:00 -td -- -dp
# Show network device utilization
sadf /var/log/sysstat/sa27 -e 23:59:00 -td -- -n DEV
# Show paging activity
sadf /var/log/sysstat/sa27 -e 23:59:00 -td -- -B

Weird, Unpredictable Connection Issues in Multi-Subnet Availability Groups

Executive Summary

There is a setting that is enabled by default for most SQL Server Availability Group listeners that will cause all possible listener IP addresses (as opposed to only the active one) to be registered with the name server. This can cause seemingly random connection issues. The most common example is that some machines are able to connect without issue and others time out. If this is the case, consider either using the MultiSubnetFailover connection string property or check the value of the RegisterAllProvidersIP cluster role provider and set to 0 if it is currently 1.

Short Version

If a multi-subnet availability group is acting strangely, see the SQL Server documentation for the page “Configure a listener for an Always On availability group“. There is a discussion of RegisterAllProvidersIP towards the bottom with instructions for adjusting.

Long Version

Consider the following scenario. You’ve got an availability group up and running for high availability purposes, and it’s been working without a hitch. You then decide that the easiest way to keep the DB at your disaster recovery site up to date is to add it as an asynchronous replica to this same availability group. The DR site is of course on a separate subnet, so a second IP address needs to be added to the listener that will be needed in the event of a failover to DR. At some point after this, about half of the server that connect to the database will start either timing out or connecting really slowly. Sometimes different applications behave differently. To make things worse, often the problems do not start right after the second IP address is added to the replica – it is not unusual for everything to be running fine until the first time the machines are rebooted.

Here is an example. For this demo I set up two SQL Server developer edition instances on servers that are on different subnets (calling them db1 and db2). Both servers were added to a cluster and an availability group was set up with two possible IP addresses (one for each subnet). I then try to connect from a third server (called clnt for client in this example) and my connection times out. No special reason that I’m using PowerShell instead of a command prompt, I’m just kind of lazy and stick with what right-clicking on the windows menu gives me.

Key observation – connection timing out, two IPs for listener

In the above image, notice that after I had issues connecting I did a nslookup. The key observation here is that I got back multiple IP addresses which usually isn’t what we expect. In the interest of space I’ll set quickly dismiss a couple of the first questions that may come to our minds at this point. Are the IP addresses just the IPs of the servers? No, the listener has its own IP(s) separate from the instances and the listener IP is what gets registered with the name server. Is this a load balancing feature? Since the listener now has two IP addresses is it accepting connections on both? Unfortunately no, only one of the listener’s IP addresses will be accepting connections at a time – after all, the primary node is in a better position than the secondaries to know what replicas are up at the moment.

Let’s dig deeper. Why do we need two IP addresses? Because we have two subnets. Why does that matter? Because the subnets have different IP ranges. For example, if the availability group has been using the 172.31.30.194 IP address on the 172.31.16.0/20 subnet fails over to the 172.31.0.0/20 subnet it will not be able to take that IP address with it because it is not a legal address on the latter subnet. Instead, after failover the listener will stop listening on 172.31.30.194 and will start listening on 172.31.10.216 in this example.

Because only 1 listener IP address will be active at a time most of us would expect that only 1 IP address would be returned by our name service at a time. There is a possible issue here having to do with TTL (Time To Live) settings. It is usually assumed that changes to DNS are relatively rare so it is safe to cache the results of lookups. The name server has a TTL setting that notifies clients how long that lookup should be good for. Suppose we were only registering the active IP address with DNS. When a failover to a different subnet occurs that means that in the worst case connections to the listener may fail for up to TTL seconds before they pick up the new IP. That might be OK if TTL is short. It may also be OK if TTL is long but we only have multiple subnets involved to support a DR site … if for no other reason that the fact it won’t be just the DB server in this boat. But it could be a problem if multi-subnet failovers are common.

What’s the alternative to only listing the active IP address with DNS? It would be to list all possible IP addresses. How these multiple entries from a DNS server are interpreted by a client are is not well defined. If we speak of DNS in general (as opposed to just thinking about AGs), though, the most common assumption is that we are trying to achieve a remedial load balancing so some form of round robin is typically used. In other words the code in the network stack will assume that all of the IPs should be usable rather than just one of them. Which usually means that the network stack will pick one, try it, and give up if the connection fails.

There may be an easy fix for this. If you have the ability to edit the full connection string, and all applications are only using standard and reasonably recent drivers, and there is enough headroom that the idea of attempting multiple connections in parallel doesn’t cause heartburn, adding MultiSubnetFailover=true to the connection string may fix the issue. I will not discuss this option further in this post because I do not have a lot of experience with it (do not take that as a reason to avoid it). This connection option could be really exciting for environments like AWS EC2 where multi subnet AGs are more common in HA scenarios.

What if we don’t have full control of the connection string? Or what if we have a third party app that is using a completely out of date database driver? Or if we just don’t want to mess with doing a round of regression testing with MultiSubnetFailover=true right now? If we promise to set our TTL to something small can’t we just get the listener to only register the active IP address? Yes we can!

There is a cluster parameter called RegisterAllProvidersIP that controls this. If a listener is created by SQL Server it will default to true (and of course most listeners are created this way). Unfortunately this is a Windows clustering setting so we can not adjust it in TSQL. More unfortunately, at the time of this writing and AFIK, the parameter can not be adjusted in the failover clustering admin tool either. It can be updated in PowerShell relatively easily though as seen here.

Powershell commands to adjust RegisterAllProvidersIP

Let’s discuss what all that code is doing. I opened an admin PowerShell prompt on one of the DB servers (mostly so I didn’t have to type the cluster name). The first call to Get-ClusterResrouce lists out all of the cluster resources, I do this to make sure I know the full internal name of the listener cluster resource (not the same as the listenter name in SQL). The next command (Get-ClusterResource Blog_AG_Demo_Listener | Get-ClusterParameter) just lists the parameters so we can verify that the current value is indeed 1. Finally we call Set-ClusterParemter to adjust the value to 0.

At this point we do unfortunately need to restart the availability group’s resource which will take the AG offline for a few seconds. PowerShell did warn us we need to take the listener offline – taking the listener offline will actually take the entire AG offline so we may as well restart the entire resource. I typically do this in the failover cluster admin tool because I have a poor memory for commands I don’t use often but it can of course be done in PowerShell.

And that’s all there is to it! After making this adjustment and waiting for TTL to expire (could probably have cleared DNS cache rather than waiting) I can see that I only get one entry for my dnslookup and connections to the listener are working again.

Searching for Cascadia: Nine Months on the Road

Introduction, Motivation, and a False Start

“Nothing behind me, everything ahead of me, as is ever so on the road.”
― Jack Kerouac, On the Road

For the past couple of years I have been one of those work-at-home SQL Server professionals who finds myself trying to decide where to live when I don’t have a traditional office tying me to a specific location. When enough of us #sqldrifter types get together after a SQL Saturday event, talk almost inevitably turns to the idea of pulling up anchor and traveling the country in search of a new home. COVID has kept us from gathering to have these chats since I settled in to my new home, so I thought I’d blog about my experiences on the road instead.

First a few words about why my wife and I felt the need to spend a year aimlessly drifting around the county at the time we did. In our situation, it is not that we were unhappy in Denver. On the contrary, it’s actually a great place. Sure we could have done without the scorching hot summers and high-desert landscape … but the actual city is great. It’s just that I never really managed to crack the Denver consulting market. Essentially 100% of my work continued to be remote and the cost of housing along most of the Colorado front range is rather steep. We were happy in Denver but felt we could be just as happy somewhere else with a lower cost of living. Prior to this we had already done the “front range crawl” by relocating from northern New Mexico to Colorado Springs and from Colorado Springs to Denver so we really were out of short, easy moves to consider.

In 2018 we decided to let our lease on the apartment lapse, pack for a long trip, and hop in the old reliable Subaru Tribeca with our two small dogs for an adventure. One of my clients is a Canadian startup, so at the time was considering locations in Canada as well as the US. Our plan was to slowly work our way north until we hit Edmonton, AB then back to Denver for the first leg. After that we would slowly work our north-east towards Maine (more to see along this route) and possibly all the way out to the Halifax, NS area.

This is when we started to hit obstacles. Shortly before our lease ran out (literally as we were packing up our apartment) our 17 year old Peekapoo Chloe passed away. It was a sad time, but we soldiered on with our plan for the road trip of a lifetime with our remaining dog Maggie, a 6 year old Lhasa Apso. Because we intended to spend some time in Canada, one of the final tasks before leaving was to take her to the vet for a health certificate. During this visit, the vet detected a suspicious lump in her abdomen that was eventually diagnosed as gastrointestinal  lymphoma. We moved into a local hotel (too late to get apartment back) for a few weeks to consult with a veterinary oncologist and decided that 6 years was entirely too short of a life for Maggie. Weekly vet visits for blood work and chemotherapy ruled out heavy travel, so we delayed our adventure and signed a lease for a new apartment in Lone Tree (south side of Denver).

In the spring of 2019 Maggie lost her battle with cancer. The question now was, did the road trip still feel right or after a year-long delay were we ready to just stay put and start looking for homes in Colorado again? We took a long overdue vacation and returned to discovered that our neighbor, unbeknownst to everyone else, had passed away shortly after we left. That was the last nudge that we needed. We gave up the apartment, packed up our recently leased Subaru Outback (the Tribeca had also passed during this time), and hit the road in June of 2019 looking for a fresh start and better memories.

What Kind of Home We Sought

To seek the sacred river Alph
To walk the caves of ice
To break my fast on honey dew
And drink the milk of paradise
-Neil Peart, Xanadu

This was our initial wish list for a new home:

  • Someplace that isn’t in a desert (not even “high desert”). We might have actually stayed in the Denver area if it was a little less dry. Specifically, someplace I can have a nice green lawn
  • Someplace we can find a nice home for less than $X00,000 (fill in your own value for X, whatever seems excessive for a median price home … that number is probably regional).
  • Within two hours of a reasonably large city with a decent airport (I don’t travel much for work-work but I do travel a lot for speaking engagements)
  • Nice big lot (an acre?) with good potential for a guest house /  ADU in case one of our parents ever needs to move in for a while
  • Someplace we can keep chickens for fresh eggs

By the time we started the trip, we had de-emphasized the idea of moving to Canada. After losing a year it didn’t make sense to complicate an already stressful decision with concerns about getting residency. In hindsight I wish we would’ve pursued the Canada option more aggressively but this blog isn’t the place to get into that.

We were looking for something new and both of us had spent most of our lives out West, so our expectation was that we’d end up someplace on the east coast. That said, we wished to be open to other possibilities as we made our journey.

The Journey

And you may ask yourself, well…
How did I get here?
– David Byrne, Once in a Lifetime

I worry about over-sharing when I relay some of the events of the past year, and I’m aware that I tend to make my posts much too long. So for this section I will try to keep things pretty high level.

  • June 2019: Finish putting our possessions back into storage. Decide to head towards Portland, ME (a top contender for new home). Stop in Iowa City for SQL Saturday where I presented a pre-con as well as a regular session. I learned I would be needed for an on-site visit in early July. Fly out of Chicago for a quick trip to Vegas.
  • July 2019: Since we were already in Chicago, spent the 4th on one of the lakes in Southern WI. Quick side-trip to Lake Geneva to see Gary Gygax memorial. Back to Chicago for one of my almost unheard of (at the time) on-site visits. Minor medical issue required a visit to Denver, so we drove back. Got back on the road at the end of the month and drove to Hartford, CT.
  • Aug 2019: One of our more productive months for exploration. After Hartford we explored northern MA and southern NH for a week. Quick stopover in Portland, ME (Portland is always AWESOME) before heading up to Bangor. Fell in love with the views around Bar Harbor. Flew to Minneapolis, drove to Sioux Falls for their SQL Saturday and then spend a week checking out Rochester, MN. Flew back to Bangor. Needed to go back to Denver another doctor visit (again, nothing serious), drove back to Denver.
  • Sep 2019: Decided to leave car in Denver (central location) and fly instead for a while. Flew to the Seattle area to look at some more homes on Whidbey Island, then to Boston for SQL Saturday, then back to Denver. After a few days I had to fly back out to Chicago for a quick site visit which pretty much closed out the month. During September our approach changed
    • Our intention was to look for something around Saco, ME but real estate listings had started to dry up for the winter. We put the east coast on the back burner (with intention of going back when housing inventory picked up).
    • We had hit the out of pocket maximum for our health insurance which meant it behooved us to take care of any medical stuff we may need in the next year. Our provider network only covered Colorado, so we made Denver our base of operations for the remainder of the year.
    • To the extent that we could still squeeze in travel, we decided to keep looking in Washington state through the rest of the winter because the real estate market there is less seasonal. WA tended to stay on our list of possibilities, despite a high housing cost, because it puts us close to three of our four parents. They are not getting younger, and my father especially had been experiencing more health issues.
  • Oct 2019: Not much house hunting. I presented at SQL Trail 2019 and SQL Saturday Denver. Later in the month I had another site visit in Chicago. Immediately after I flew out to Portland, OR. Drove out to visit with my dad in Klamath Falls, OR for a few days before SQL Saturday Oregon.
  • Nov 2019: SQL Saturday Oregon. Rode the #sqltrain up to Seattle for summit. My wife flew in while I was there, and we spent a few days in Port Townsend, WA and Whidbey Island before flying back to Denver.
  • Dec 2019: We went away for an anniversary trip (nine years). Late at night, on our anniversary, I was awakened by a phone call informing me that my father had passed away. Was able to make most short term arrangements by phone (eventually … but that’s another story). Flew back out to Oregon between Christmas and New Years to clean out his home before another month of rent came due.
  • Jan 2020: Check out of hotel, pack car, and drive away from Denver yet again. Work our way out to a Airbnb in Poulsbo, WA which served as our new base of operations. Two back-to-back on-site visits in Chicago this month … but one of their travel policies actually required me to fly back to Seattle and turn around. That’s a long flight … one mark against Washington. We discovered some real estate listings in some out of the way town called Camano Island, WA and added it to our search list.
  • Feb 2020: During our second house hunting trip to Camano, I walked through the door of the last house of the day and fell in love with the view. We decided to put in an offer which was accepted. We needed some old tax returns out of storage in Colorado – fortunately I had two more site visits in Chicago coming up … I was able to add an extra stopover in Denver to one of those trips. Filled out closing paperwork on Feb 27 and left car at new house. Flew out to SQL Saturday Rochester Feb 28, then immediately back to Colorado to get a moving truck. News is filled with coverage of the rapid spread of a novel coronavirus.
  • March 2020: Packed moving truck first few days of March. As we started driving the moving truck from Colorado to Washington we started getting calls warning us about toilet paper and bottled water shortages in WA. On March 6 we pulled the moving truck into the driveway of our new home.

What We Bought

“Nailed it!” – anon

Our wish list can be found above, here is what we actually ended up with. I do not see this as a failure in our requirements elicitation process, I prefer to think of it as agile process.

  • We are at the edge of a rain shadow. It’s not a desert … deserts are dry and rain shadows are … well … not damp. It’s totally different because … well … never mind. The lawn was green when we moved in, but water is limited due to concerns about salt water intrusion. As I write this the lawn has gone yellow and crispy for the summer.
  • We paid exactly $X01,000. But for the money we got a water view.
  • We are within two hours of the SEATAC airport if there isn’t traffic on I-5. Of course, this is I-5 we’re talking about so there is always traffic. But for flights I am also close to a couple of smaller airports (Everett and Bellingham) so there are options.
  • Single story house on a cozy, quarter acre lot with no basement and zero ADU potential.
  • Chickens are specifically prohibited by neighborhood covenants.

What Went Well

“I Love It When a Plan Comes Together” – Hannibal (George Peppard), The A Team

There are a number of aspects of our search that I think went well

  • We did, in the end, successfully move. And just in the nick of time, had we waited too much longer our experiences on the road would have changed drastically as COVID-19 blew up.
  • We not only chose, IMO we chose fairly well. Our new home has been a remarkably quiet and peaceful place to wait out quarantine. I can walk out to the bay every day and amuse myself and our neighbors by being the neighborhood’s least adept fisherman and crabber.
  • I was able to work continuously (but admittedly at a reduced workload) on the road.
  • We saw parts of the country we might never have seen otherwise because we mostly drove.
  • We did not really miss having a permanent address while we were on the road. Actually moving in and out of hotels is, of course, a hassle. But once we started doing longer stays at hotels we actually settled in pretty quickly and made some new friends with other long-term guests.
  • Getting mail was not as big of a problem as expected. In practice we found that we went back to Colorado often enough that our PO box worked fine. I did run into an issue where a client needed to overnight something to me, but renting a mailbox at the local UPS store worked out well (pro tip – went that way in this case because USPS requires proof of residency in the state for a PO box).
  • Due to some weird luck, I had a lot more on-site visits during this time than I typically do. This might sound like it would be inconvenient, but the extra travel actually caused surprisingly little disruption.
  • Our cable includes CBC as a local channel so we get to enjoy the Canadian version of Family Feud as well as Kim’s Convenience.

What I Wish We’d Done Differently

“The way that is THE way is not the WAY” – Lao Tzu, Tao Te Ching

Of course there are some lessons learned

  • I wish we’d more seriously considered buying an RV. I dismissed the idea of RV life relatively quickly because I wasn’t sure what we would do with the thing after buying a house, and also because I didn’t want our only mode of local transportation to be a giant dually pickup. Those concerns are probably insurmountable for me … but I still wish I’d thought harder before deciding to depend on hotels.
  • I wish we had meandered less. Part of the reason I supplied our brief itinerary above was to make the point that we actually spent remarkably little time house hunting because we were always moving. We relocated about once a week in the beginning. If I had it to do over, I would probably do longer stays at fewer locations. Besides making it easier to look around, it would have saved money – some extended stay hotels have amazing rates on month long stays.
  • I wish I had thought about this trip during open enrollment the year before and looked for a health plan with a national network (although since I’m self employed insurance options are usually grim and get worse every year).
  • Timing was not perfect. We did not have the luxury of picking an optimal time for this trip, but it did seem like we would have been better off if we were looking for home in 2018 instead of 2019 – a lot of the areas we looked at had recently experienced booms that we just missed out on. Additionally, we did not get serious about looking at stuff out east until too late in the year – I love our home in WA but it’s entirely possible we missed out on an even more awesome home in ME because we didn’t get up there when more homes were available.
  • When I start traveling again I may love this location less. If I need to fly in and out of SEATAC my best option is probably a train into Seattle followed by light rail to the airport. That may get old … wonder if it’s faster to get into Seattle from here by boat :).
  • I wouldn’t completely eliminate the road trip … but I do wish we’d driven less and flown more. Driving is a time suck.
  • Moving during a global pandemic sucks. Nothing can be done about this but there are a lot of challenges we faced due to timing
    • I had cut back my work schedule while we were on the road. After dropping a bunch of cash on a house I’d love to have a full schedule or even overload myself a bit. That’s difficult to do when the economy is slow.
    • It took months to both get cable installed and also to get a washer and dryer delivered. Ironically when the rest of the world was starting to work from home, I was forced to go back to an office (joined a coworking center) to get internet.
    • We miss our dogs. We are now ready to add some companions to complete our home. Unfortunately this is the worst time imaginable find a puppy. Or for that matter even adopt an adult dog. Or a Guinea Pig. Or a … well you get the picture. It may be Christmas or next year before we get to the top of any of the breeder’s lists or find a compatible pup available at the shelter. So far the closest we’ve gotten is a our new friend Bubba, a Betta fish.

Why NOLOCK Is a Thing For MSSQL

In my experience, few disagreements between groups of MSSQL developers and DBAs have been as long running, contentious, and unproductive as the debate about whether or not the use of the NOLOCK is acceptable. It seems clear to me, at least, that it is not really ever acceptable for use in production code (assuming MSSQL 2008 or better which isn’t a big ask) – but that’s not the interesting part of this debate. The really interesting thing, IMO, is that we’re still talking about NOLOCK today in much the same way that we did a decade ago.

For those lucky readers who aren’t already familiar, the NOLOCK query hint in SQL Server prevents a SELECT statement from taking shared locks to enforce data consistency. When MSSQL is in the default transaction isolation level with default settings, transactional consistency is guaranteed with a locking mechanism that prevents writers from updating data while another process is reading it, and prevents readers from accessing data while another process is writing. If the NOLOCK query hint is added to a SELECT statement, the reader portion of this locking mechanism is skipped. This means two things: we no longer have to be worried about readers getting blocked by writers, but because our transaction consistency has been intentionally disabled the we may get results that are in the in the middle of a change.

For the DBAs … Why Developers Do This

Before speaking to the developers about this, I think it’s also helpful to try to help DBAs understand whey this is a thing.

  • We don’t do a great job of explaining the issue. My experience has been that when most DBAs see code that is littered with NOLOCK hints, they push back by saying something like “NOLOCK is bad, you could get inconsistent data, may we please take these hints out?”. When that fails, it may escalate to “Is there anything I can say to persuade you to stop?” or possibly a more sarcastic “If you don’t care about the consistency of your data why are we using a database?”. Frequently that’s as far as the debate goes – it’s hard for us to change developer behavior if we can’t engage in their language, and those of us on the ops side aren’t always in a great position to do that. What’s missing is often guidance on what developers should do instead.
  • Developers get dinged when their code causes blocking. A really serious blocking issue can take down a system. Ideally this doesn’t happen because hopefully some kind of monitor has been set up that will alert if an unusual amount of blocking is happening. Regardless of whether production went down or the on-call DBA was awakened at three in the morning by an automated check, a developer who writes code that causes blocking may get yelled at. This tends not to happen if NOLOCK is liberally sprinkled through the code. No NOLOCK does not always eliminate blocking and yes other bad things may happen, but NOLOCK does tend to reduce the yelling.
  • It worked in test. Problems that we haven’t actually personally experienced don’t always seem real. This matters to the NOLOCK debate because serious consistency issues arising from dirty reads tend to be improbable. If the production system is 100 or even 10 times busier than a test system, there is plenty of room for these improbably things to happen in production even if they never occur in test. As more shops embrace test driven development, this disconnect between test systems and production becomes more problematic.
  • We don’t often have to think about transactions in MSSQL. It’s probably a stretch to say that a developer can go their entire career without ever having to commit a transaction, but I actually don’t think it’s that big of a stretch. Unlike some other database products, if we mostly work in single statements we don’t have to explicitly commit or rollback our work. I like this behavior personally but the fact that a junior developer doesn’t need to think about transactions very often does make it easy to forget that transactions are important.
  • Dirty reads are frequently OK. This is of course dangerous thinking as I’ll discuss later, but think about the simplified case of a selecting data from a single row of a single table. If an update to the data is in process and NOLOCK is not used, the code waits until the update is complete and then gets the new value. If NOLOCK is used, the code does not wait and may or may not get the new value, depending on how far along the update statement is. If we assume data is never rolled back, for a lot of applications this behavior may actually be acceptable – if we’re checking at the instance a value is changing we may not care if we get the old or new value. Don’t get me wrong, I’m not arguing that NOLOCK is OK here … but I also don’t think that it’s completely crazy that developers can get into the mindset of “we’ve thought about it, and we’re OK with dirty reads”.

For the Developers – Why NOLOCK is a problem

As tempting as it may be to use the NOLOCK hint, there is a healthy list of reasons to avoid its use

  • Optimistic concurrency is a better way to systematically avoid blocking. Ever since SQL Server 2008, we’ve had two flavors of optimistic concurrency (read committed snapshot isolation and also snapshot isolation). I will not dive into a detailed discussion of these here because others have done a great job (Kendra Little to name one of many), but my biggest issue with the NOLOCK hint is that we have better ways to prevent contention between readers and writers. If you are a developer who is reading this and is greatly concerned about blocking, instead of resorting to NOLOCK I’d encourage you to have a discussion with your DBA about enabling snapshot isolation – it can lead to a similar reduction in blocking but does so in a way that respects other transactions.
  • Knee-jerk NOLOCK is the real problem. Regardless of the above point, I personally wouldn’t bother arguing with a developer who uses NOLOCK once or twice a year. The real issue lies in adding the hint to every query – it’s maybe reasonable to say “I’ve thought about it, and a dirty read is OK here” once in a while. It’s hard to accept that this argument when applied to every query without forethought.
  • NOLOCK ignores transactions (other people’s transactions). Think about somebody checking their balance at a bank. If we don’t really think things through I could understand a developer wanting to believe that NOLOCK would be OK for this – if a change is in flight the difference between seeing the old value versus the new value could just be the same as it would have been if the customer spent an extra second standing in line (edited). The real problem with this thinking is that the update may be rolled back – for example if the other account involved in the transfer has insufficient funds. Even in cases where we might think at first blush that a dirty read could be acceptable, it’s possible that we’re mistaken in that belief when we consider other transitions happening on the system … and we could give folks hardcopy receipts or reports showing data that was technically never part of the database.
  • NOLOCK doesn’t actually mean no lock. This distinction mostly effects code that is running during something like a big nightly ETL job, but the NOLOCK hint actually only suppresses shared locks. It actually leaves other locks in place, specifically schema consistency locks … so at the end of the day even if dirty reads were not dangerous (the almost always are) it cannot even really be said that sprinkling this hint through code is always effective at preventing locking. It can actually make some locking issues worse.

Index Fragementation Still Matters

Executive Summary

Until recently a huge driver for index maintenance in SQL Server was that reading data from contiguous blocks on disk was more efficient than randomly scattered reads on magnetic media. Now that solid state storage is taking over, some see index defragmentation as wasted effort. While fragmentation may be less of a concern than it was in past times, regular index maintenance does still boost disk utilization as well as memory efficiency by reclaiming unused space inside pages.

The Short Version

Yes it is true that solid state storage is less sensitive than magnetic media to whether data is accessed sequentially versus accessed in a random order. Note that it is not actually true that this doesn’t matter at all, it does matter when we get close to the throughput of our storage (see below). But that is not the point of this post.

Remember that there are actually two different kinds of fragmentation that we discuss. The one we typically think of, where database pages are stored out of order on disk, is external fragmentation. As we may guess from the name, there is also another phenomenon frequently referred to as internal fragmentation. Internal fragmentation refers to inefficiency with how data is organized inside pages – specifically issues that cause us to have chunks of unused space inside pages. Obviously this has an impact on how much unused space is available inside our data files, but as pages are read into the buffer pool high levels of internal fragementation can actually impact how much usable data will fit in the buffer pool and eventually will drag on page lifetime.

How do we control internal fragmentation? The same way that we control external fragmentation – by rebuilding or reorganizing indexes. For some workloads it can be important to check for internal fragmentation (by way of avg_page_space_used_in_percent in dm_db_index_physical_stats) directly, but for most situations the main lesson is to continue defragmenting as we have for years.

The Long Version

Old School View of Index Maintenance

Think back to … say … 2010. Many of us were terribly excited about seeing “The Last Airbender” or “Hot Tub Time Machine”. Ugg boots were at the peak of their inexplicable popularity. Cee Lo was singing “F*ck You”. And many database folks in smaller shops were running SQL Server 2008 with something like 8GB of RAM on physical hardware backed up by spinning rust. The word “spinning” is what matters here. Our data was stored on physical platters. Before a page could be read into memory, we had to wait for two things to happen – the heads inside our hard drive had to be moved to the correct track on the disk, and then we had to wait for the correct portion of the disk to physically spin around underneath the head. Throughput is maximized if we read as much data as we possibly can while neighboring blocks are spinning past. In other words, doing a few large sequential reads is fundamentally more efficient than large numbers of small, random access reads.

Why does this matter? Mostly because we have historically expected the bottleneck in relational database servers to be I/O. If we know that sequential I/O is fundamentally more efficient than random access I/O, then the easiest way to get more bang for our storage buck is to arrange for more of our reads to be sequential. How do we do this? That’s the easy part. We identify those indexes that are not stored in contiguous blocks on the disk, and then we either rebuild them or reorganize them.

Do we rebuild or reorganize? That’s not something I’ll dig into in this post. In a nutshell, reorganizing an index simply redistributes data among the pages already allocated to the index while rebulding is a more drastic operation that actually rebuilds the index. Which makes more sense depends on many factors including SQL Server edition, whether or not index can be offline, whether or not we need to be able to interrupt the command, etc etc. A great post by Kendra Little on some of these issues can be found here.

Why Index Maitenance Isn’t Getting As Much Love Now

Technology moves quickly, and even since 2010 quite a bit has changed.

  • SQL Server is a data hoarder, and its house has gotten bigger. Think of SQL Server as an eccentric person whose house is filled with … say … stacks and stacks of old newspapers. The “house” in this example is memory on the server that SQL Server occupies, and the pages of newspapers are data pages from our database. If we ever ask ask for a piece of data from disk, it will get hoarded (cached) in memory for absolutely as long as the space is not needed for something more important. That’s actually a really good thing – if we give our database server large amounts of memory, there’s a pretty good chance that once the cache is warmed up any data that we’re actively using will be in memory. If most of the data we need is in memory already, that means we are not working the I/O system as hard which makes us a bit less sensitive to disk performance. If we do only have room to allocate 4GB to the buffer pool that’s not a lot of room for data. But as more folks are able to afford hundreds of GB of RAM it’s reasonable to assume there will be a higher cache hit ratio for a given amount of data.
  • More folks are aware of the importance of query tuning. Monitoring tools have gotten better. Thanks to events like SQL Saturday, education amongst the SQL Server community has improved. Odds are better today than they used to be that if a query is doing a huge index scan then somebody will notice and do some tuning. Scans really do make sense for some workloads, but if  we do have a workload where some large scans can be eliminated by tuning efforts then doing so actually accomplishes two things. First, the query tends to run faster because it tends to do less I/O. The second impact, however, can be less obvious. As large scans are replaced with smaller seek operations (or even smaller scans), that tends to make disk access more random. More random access means we notice the impact of external fragmentation less.
  • Finally the elephant in the room. More and more of us have transitioned from spinning magnetic media to solid state storage. Because retrieval is no longer dependent on moving parts, solid state devices tend to perform very well for random access workloads. Since the argument that was used on most of us to get us serious about index maintenance was based on the way spinning discs operate, the rise of SSDs has led to a widespread belief that index maintenance doesn’t matter anymore. It’s worth noting that even if sequential reads are no faster than random reads it is still possible that more IOPs are required in the presence of fragmentation. That means external fragmentation can still matter in high performance workloads – I will not discuss further here but I would refer the interested reader to Jonathan Kehayias’ post here.

These are all excellent points, and I tend to agree that external fragmentation (data out of order on disk) is not as much of an issue as it has historically been. That does not mean that index maintenance does not matter at all because reduction of external fragmentation was not the only reason that we rebuilt indexes. The other type of fragmentation that we worry about, and which is also reduced by index maintenance, is internal fragmentation. Internal fragmentation refers to unused space inside pages on data files. In other words, in addition to recording data physically on disk, index maintenance is also the mechanism that we use to reclaim space from deleted or updated records in our databases.

An Example

If anybody wishes to follow along with this example, here is the script that I used to create my sample database (click to expand). In a nutshell, the DB contains two tables. Into one table we insert a large amount of data in reverse order (so it will be perfectly fragmented, in terms of external fragmentation). Into the other table, we insert a large amount of data in order, and them delete the majority of it so that we have large holes of unused space on every page.

USE [master]
GO

CREATE DATABASE FragmentTest ON
PRIMARY ( NAME = N'Frag_P', FILENAME = N'[some_filepath_here]\Frag_P.mdf' , SIZE = 1GB , MAXSIZE = UNLIMITED, FILEGROWTH = 1GB ),
FILEGROUP main DEFAULT( NAME='Frag_D', FILENAME = N'[some_filepath_here]\Frag_D.ndf', SIZE=21GB, MAXSIZE = UNLIMITED, FILEGROWTH = 1GB )
LOG ON( NAME = N'Frag_L', FILENAME = N'[some_filepath_here]<span id="mce_SELREST_start" style="overflow: hidden; line-height: 0;"></span>\FRAG_L.ldf' , SIZE = 8000MB , FILEGROWTH = 8000MB )
GO

ALTER DATABASE FragmentTest SET RECOVERY SIMPLE;
GO

USE FragmentTest
GO

CREATE SCHEMA frag;
GO

CREATE TABLE frag.InternalFrag(
ID INT IDENTITY NOT NULL,
Payload CHAR(996) NOT NULL CONSTRAINT DF_InternalFrag_Payload DEFAULT 'Time to look big',
CONSTRAINT PK_InternalFrag PRIMARY KEY CLUSTERED( ID )
);

CREATE TABLE frag.ExternalFrag(
ID INT NOT NULL,
Payload CHAR(996) NOT NULL CONSTRAINT DF_ExternalFrag_Payload DEFAULT 'Time to look big',
CONSTRAINT PK_ExternalFrag PRIMARY KEY CLUSTERED( ID )
);
GO

-- Create internal framentation by populating a table with 10GB of data then deleting 7/8 of the data.
INSERT INTO frag.InternalFrag(Payload) VALUES('Time to look big');
GO 10485760

DECLARE @first_match INT;
DECLARE @last_match INT;

WITH x AS(SELECT TOP 100000 ID FROM frag.InternalFrag WHERE ID % 8 > 0)
SELECT @first_match = MIN(ID), @last_match = MAX(ID) FROM x;

WHILE @first_match IS NOT NULL AND @last_match IS NOT NULL
BEGIN
DELETE FROM frag.InternalFrag WHERE ID >= @first_match AND ID  0;

WITH x AS(SELECT TOP 100000 ID FROM frag.InternalFrag WHERE ID % 8 > 0)
SELECT @first_match = MIN(ID), @last_match = MAX(ID) FROM x;
END
GO

-- Insert about 10GB into the external frag table, in reverse order, to generate external fragmentation
DECLARE @count INT = 10485760;
WHILE @count > 0
BEGIN
INSERT INTO frag.ExternalFrag(ID) VALUES(@count);
SET @count -= 1;
END
GO

Lets go ahead and see what fragmentation looks like against these two tables like so

SELECT OBJECT_NAME( ps.object_id ) AS Tab,
	i.name AS Ind,
	ps.avg_fragmentation_in_percent,
	ps.avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats( DB_ID(), DEFAULT,
		DEFAULT, DEFAULT, DEFAULT) ps
	INNER JOIN sys.indexes i ON ps.object_id = i.object_id
		AND ps.index_id = i.index_id;

pic_20180122_01

Image 1: This is not the fragmentation you are looking for

As expected, the ExternalFrag table where we intentionally inserted data out of order is extraordinarily fragmented. But what about the InternalFrag table? With fragmentation of less than 1% does that mean we do not need to worry about it? Actually what this really means is that the default “LIMITED” mode of the dm_db_index_physical_stats function doesn’t look at leaf-level pages so we are not getting information on what things look like at the leaf level. The fix is straightforward, we can instead run in the “SAMPLED” mode (or even “DETAILED” if we want to scan the entire leaf level … and have a ton of time to kill).

SELECT OBJECT_NAME( ps.object_id ) AS Tab,
	i.name AS Ind,
	ps.avg_fragmentation_in_percent,
	ps.avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats( DB_ID(), DEFAULT,
		DEFAULT, DEFAULT, 'SAMPLED') ps
	INNER JOIN sys.indexes i ON ps.object_id = i.object_id
		AND ps.index_id = i.index_id;

pic_20180122_02

Image 2: There’s the wasted space

This is more like it. We have to wait longer for results, but the result set now tells us how much unused space is sitting in data pages waiting to be reclaimed. Those who are accustomed to looking at avg_fragmentation_in_percent are used to thinking of low numbers as good and high as bad. This is reversed for avg_page_space_used_in_percent – a low percentage of used space could indicate we need to rebuild (unless we intentionally specified a large fill factor). So in this case, almost 88% of the space in the InternalFrag table is not actually getting used. Why does this matter? The obvious reason is that if we aren’t intentionally padding the InternalFrag table then it is taking up about 8 times as much space as it needs to on disk. But also consider what happens when we run the following queries

SET STATISTICS IO ON
GO

SELECT COUNT(*) FROM frag.InternalFrag;
SELECT COUNT(*) FROM frag.ExternalFrag;

On my workstation here are the results

(1 row affected)

Table 'InternalFrag'. Scan count 5, logical reads 1321485, physical reads 0, read-ahead reads 1309772, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

Table 'ExternalFrag'. Scan count 5, logical reads 1319382, physical reads 2, read-ahead reads 1314985, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Now what happens if we defragment the indexes like so?

ALTER INDEX PK_InternalFrag
	ON frag.InternalFrag REORGANIZE;
ALTER INDEX PK_ExternalFrag
	ON frag.ExternalFrag REORGANIZE;

After this, when we re-run the COUNT(*) queries above we get the following message

(1 row affected)

Table 'InternalFrag'. Scan count 5, logical reads 174607, physical reads 2, read-ahead reads 170547, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

Table 'ExternalFrag'. Scan count 5, logical reads 1319552, physical reads 9, read-ahead reads 1325530, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

That is an 87% reduction in logical reads for the query against InternalFrag. If this were a real wold scenario, that could mean at the end of the day 87% fewer pages that could help satisfy future queries getting unceremoniously dumped from memory if we ever scan this table.

For the sake of completeness, before we leave lets take another look at the results of the dm_db_index_physical_stats query to be absolutely certain that both types of fragmentation have been cleared up.
pic_20180122_03

Internal Fragmentation Is Important

One of my pet peeves on most projects I’ve joined is that the index maintenance jobs are typically checking for high levels of avg_fragmentation_in_percent and ignoring avg_page_space_used_in_percent. Am I saying that we’ve been doing things wrong? It depends. But for smaller DBs … yeah kind of. Checking for excessive amounts of free space inside data pages directly is probably better … because unused space is more of an issue than pages that are merely out of order on many systems nowadays. I’m not saying we should not check for external fragmentation at all though, I actually like to keep an eye on both. Remember that even on solid state storage, external fragmentation still matters a little as we get close to the performance limits on our hardware … and if we say that our servers are I/O bound then that means we probably do push this limit at times (by definition). Assuming that there is enough slow time on the system for a call to dm_db_index_physical_stats in either sampled or detailed mode as well as any desired index rebuild/reorganize operations I like to keep an eye on both.

What If There Isn’t Time?

The abover advice should be fine for the vast majority of folks reading this. But there is an issue with even the sampled mode of dm_db_index_physical_stats – it takes longer. We may not care how long this takes if we are looking at something like AdventureWorks where the difference between limited and sampled modes is measured in seconds or minutes. But for larger datasets, it’s possible that limited mode will take an hour or two (or even longer) while sampled could take several hours to complete. On such systems, there is a chance that there isn’t time to both check for unused space and also do any useful amount of defrag work. One possible fix is simply to just check fragmentation levels less frequently. For example, rather than checking fragmentation every day, maybe it is possible to check fragmentation levels once per week and scatter rebuild operations over slow periods on their other days. Another possibility could be to not spend time on dm_db_index_physical_stats on any recurring basis and to instead develop a fixed schedule for rebuilding indexes that are known to be problematic.

There is another possibility though. My examples that show external fragmentation without internal, and more importantly internal fragmentation without external are a bit contrived. In real world situations it is normal for the two kinds of fragmentation to be seen together. Workloads vary so do your own testing … but for most real-world situations external and internal fragmentation tend to be closely correlated. Tables that have high levels of internal fragmentation are going to tend to also have significant external fragmentation. I would like to reiterate – whenever practicable I am a huge advocate for tracking internal fragmentation directly. That said, if the amount of time needed for sampled mode is an issue, odds are that rebuilding externally fragmented tables will keep internal fragmentation in check. Just be sure to consider running an extra rebuild or reorg after doing massive deletes or updates.

Updating Heaps. Not Awesome.

I’ve found myself speaking a lot lately about things that actually go wrong when heaps are used inappropriately. Most of us have heard that heaps aren’t recommended for update-heavy workloads, but it is still disturbingly common for me to start on a new project and find heaps of heaps. I’m pretty sure that the issue is more that there isn’t a lot of awareness about what kinds of things can break when best practices aren’t followed than it is about anybody not hearing the guidance.

Setup

Basically all that’s needed to demonstrate this particular issue is a database with a table that doesn’t have a clustered index. It’s helpful if the record size for that heap is large enough to be interesting. Not being terribly creative, here’s what I used

USE master
GO

CREATE DATABASE HeapDB
	ON( NAME=HeapDB, FILENAME='C:\your_path_here\HeapDB.mdf', SIZE=10GB, FILEGROWTH=1GB )
	LOG ON( NAME=Heap_log, FILENAME='C:\your_path_here\HeapDB.ldf', SIZE=1GB, FILEGROWTH=1GB)
GO

ALTER DATABASE HeapDB SET RECOVERY SIMPLE;
GO

USE HeapDB
GO

CREATE TABLE dbo.BigHeap(
	ID            INT          NOT NULL,
	BigBlobbyData VARCHAR(MAX) NOT NULL
);
GO

-- Add some data

DECLARE @x   INT = 1;
DECLARE @xvc VARCHAR(10);
WHILE @x <= 1000
BEGIN
	SET @xvc = CAST(@x AS VARCHAR(10));
	INSERT INTO dbo.BigHeap( ID, BigBlobbyData ) VALUES( @x, REPLICATE( '0', 4000 ) + @xvc);

	SET @x += 1;
END
GO

Breaking Stuff

Perhaps the easiest issue to demonstrate on this setup happens if we simply do enough simultaneous updates on a table. For my test, I opened up two new query windows in SSMS and in both windows ran this code (at the same time)

USE HeapDB
GO

UPDATE dbo.BigHeap
	SET BigBlobbyData = REPLICATE( '0', 10000) + CAST(RAND() AS VARCHAR(100))
	WHERE ID = FLOOR(RAND() * 10) + 1;
GO 100000

After a while, it should be possible to scroll through the contents of the messages tab and see something like this


shot_20170613_1

Fortunately with current versions of SQL Server we have an easy way to dig into deadlocks. Opening up the system_health extended event session and finding the most recent deadlock gives me this graph


shot_20170613_3

What does this mean? It means that at some point each session had an exclusive lock on different RIDs (a RID, or row identifier, is essentially just what we call a record in a heap). And each session was attempting to get an update lock on the record which the opposing session exclusively locked. Update locks can not be granted as long as another transaction holds an exclusive lock on that resource, so no amount of waiting will allow us to make progress and a deadlock results.

But why? In a nutshell this particular deadlock happens because there is no structure in a heap. To update a record SQL Server must scan until it is found … but even once the record has been found the search must continue. There is no primary key, so there is no way to know that more records matching the WHERE clause of the update do not exist. If two separate sessions make changes to a record that has not yet been scanned by their counterpart, this kind of deadlock will result.

Will it help if we add a unique nonclustered index? Maybe a little, in that a deadlock may be less likely. But we can still see deadlocks because both the heap and index may need to be updated, and the fact that pages in both need to be locked opens the door to a different kind of deadlock. But more to the point, why would we do this? If we have a candidate key that could be used as a unique index, it may also be a great candidate for a clustering key.

Why Relying on SSMS for Execution Plans Isn’t Enough

One thing I’ve noticed about myself – I don’t tend to want to blog about the same subjects that I want to talk about at SQL Saturday. Different media, different audience, so different topics probably aren’t surprising. But in an effort to be more consistent … here is a taste of the kind of material that I cover in my ‘Why Should I Care about .. The Plan Cache” presentation.

If you’re still reading this, there’s a great chance than you’ve looked at an execution plan by putting a query into SSMS, hitting the button to ask for the actual plan, and then executing the query. Pretty easy, pretty awesome, that’s how performance tuning is done, right? Well, even if we ignore the possibility of different session settings (won’t get into that here), there are a few issues.

Why It’s Not That Easy

  • If the end user doesn’t report an issue right away, they might not remember exactly what they were doing at the time they noticed a performance issue.
  • Even if the user does remember, they usually can’t tell you what query was running (they were probably using some kind of software).
  • Documentation for that software likely won’t drill down to the level of what queries are run against the database.
  • If software documentation does exist and actually does include queries, that’s a very detailed software document. Has it been kept up to date?
  • If the software was developed in house, why not ask the developer? If the developer that wrote it is still around, they probably won’t know off the top of their head exactly what the query looked like. Odds are they would have to investigate.
  • Even if it’s possible to quickly find exactly where in the code the problematic query is executed, that might not give us the query. It’s becoming more common for some kind of dynamic SQL to be involved, which means the actual query could depends on parameter values.

It may sound like I’m trying to say it’s not reasonable to ask the development team what TSQL is actually getting run. Of course that’s a reasonable thing to ask. The point I’m trying to make is just that it likely that you will not get an answer immediately. When troubleshooting performance issues time usually matters, so there is usually better if you can quickly find out for yourself what is causing the problem.

The Disappointment of Querying with LINQ. Why I Don’t Love LINQ to Entities.

Generalized search (or catch all, or Swiss army knife) stored procedures are one of the cases where I personally feel that T-SQL does not offer a fantastic solution. Unfortunately, neither does LINQ to SQL nor LINQ to Entities. That said, no tool does a fantastic job with that use case so this is not an indictment of LINQ.

It is true that using LINQ may allow a development team to be more productive, but it is also true that LINQ complicates monitoring and performance tuning. At the end of the day that means that a more skilled (pronounced : expensive) database administration staff may be required.

The Short Version

One big issue with supporting the queries that LINQ generates is that the queries are strange – the almost look like they’ve been through an obfuscator. Another is that there’s not a clean, easy way to add a RECOMPILE or an OPTIMIZE FOR UNKNOWN hint. Sure it’s nice that LINQ doesn’t allow developers to add any crazy, dangerous hint that they want to (*cough* NOLOCK *cough*) but there is a reason that query hints exist. Use of hints should indeed be rare but at times they really are necessary.

The current state of practice in shops that use LINQ (in my experience at least) is for the development team to start out using LINQ, and to re-factor logic into a TSQL-based stored procedure for those cases that LINQ does not handle well. My issue is that those tend to be the really complex cases, and frequently are the cases where we may end up with dynamic SQL inside the stored procedure once we are done performance tuning. As a performance tuning person, that’s exactly the opposite of what I’d like to see. I have no doubt that the average SQL developer can get the easy stuff right. What I would like to see something like LINQ help with is the weird, gnarly, dynamic SQL that goes along with the hardest cases.

Most of us have seen examples of completely insane TSQL that was generated by LINQ. Epic fails are real, I have seen them. I wouldn’t say they’re common though. In my experience what really keeps LINQ from shining is normally that it runs afoul of much more mundane issues such as bad execution plans which arise from the wrong value being  sniffed for the wrong parameter at the wrong time. In TSQL when there is a parameter sensitivity issue more often than not a query hint is added, but LINQ makes it difficult to add query hints without really offering an acceptable alterative solution.

The Long Version Part 1: What part of “Project1 Extent1 LEFT JOIN” Don’t You Understand?

I’ll be writing some LINQ code for this post. Don’t worry, I won’t make a habit of it, and I’ll keep things fairly simple for this. To create the database that I use in the examples below, I started with a standard copy of the Adventure Works database. Then I enlarged it using Jonathan Kehayias’ script (read his blog post here here). Yes I am still using Adventure Works for examples. Yes I am old. After creating the database I started a new C# console application in Visual Studio and generated an ADO.Net data model from it.

Now consider the following code

static void Main(string[] args)
{
	AdventureWorks2012_bigEntities ctx = new AdventureWorks2012_bigEntities();

            var q =
                from p in ctx.People
                select p.EmailAddresses;
            Console.Out.WriteLine(q.ToArray().Length);
}

If we were to write a TSQL query for the same data by hand, it might look like this

SELECT EmailAddress.*
FROM Person.Person
	INNER JOIN Person.EmailAddress ON Person.BusinessEntityID = EmailAddress.BusinessEntityID;

No, the it’s not great that we’re using “SELECT *” but that’s not my point, and contrary to popular belief it actually is possible to get LINQ to only bring back a few columns … but that would be another blog post. For this post, I want to compare the preceding query with the following one, which is what LINQ generates in my lab setup.

SELECT       [Project1].[BusinessEntityID] AS [BusinessEntityID],       [Project1].[C1] AS [C1],       [Project1].[BusinessEntityID1] AS [BusinessEntityID1],       [Project1].[EmailAddressID] AS [EmailAddressID],       [Project1].[EmailAddress] AS [EmailAddress],       [Project1].[rowguid] AS [rowguid],       [Project1].[ModifiedDate] AS [ModifiedDate]      FROM ( SELECT           [Extent1].[BusinessEntityID] AS [BusinessEntityID],           [Extent2].[BusinessEntityID] AS [BusinessEntityID1],           [Extent2].[EmailAddressID] AS [EmailAddressID],           [Extent2].[EmailAddress] AS [EmailAddress],           [Extent2].[rowguid] AS [rowguid],           [Extent2].[ModifiedDate] AS [ModifiedDate],           CASE WHEN ([Extent2].[BusinessEntityID] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]          FROM  [Person].[Person] AS [Extent1]          LEFT OUTER JOIN [Person].[EmailAddress] AS [Extent2] ON [Extent1].[BusinessEntityID] = [Extent2].[BusinessEntityID]      )  AS [Project1]      ORDER BY [Project1].[BusinessEntityID] ASC, [Project1].[C1] ASC

No, that’s not super readable as is … but it gets a lot better if we take some time to format the query. Here’s what I came up with after a few minutes of formatting.

SELECT       [Project1].[BusinessEntityID] AS [BusinessEntityID],
       [Project1].[C1] AS [C1],
       [Project1].[BusinessEntityID1] AS [BusinessEntityID1],
       [Project1].[EmailAddressID] AS [EmailAddressID],
       [Project1].[EmailAddress] AS [EmailAddress],
       [Project1].[rowguid] AS [rowguid],
       [Project1].[ModifiedDate] AS [ModifiedDate]
FROM (
    SELECT           [Extent1].[BusinessEntityID] AS [BusinessEntityID],
           [Extent2].[BusinessEntityID] AS [BusinessEntityID1],
           [Extent2].[EmailAddressID] AS [EmailAddressID],
           [Extent2].[EmailAddress] AS [EmailAddress],
           [Extent2].[rowguid] AS [rowguid],
           [Extent2].[ModifiedDate] AS [ModifiedDate],
           CASE WHEN ([Extent2].[BusinessEntityID] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]
    FROM  [Person].[Person] AS [Extent1]
          LEFT OUTER JOIN [Person].[EmailAddress] AS [Extent2] ON [Extent1].[BusinessEntityID] = [Extent2].[BusinessEntityID]
)  AS [Project1]
ORDER BY [Project1].[BusinessEntityID] ASC, [Project1].[C1] ASC

One minor point that I won’t dwell on, because it’s kind of nit-picky and probably depends on which version of Entity Framework (EF) is used, is that there is an ORDER BY clause on the query even though I didn’t specify any ordering in my LINQ query. SQL Server licenses are expensive – personally I would prefer that those CPU cycles not be wasted when I could sort the data in my C#. Odds are the application is running on a processor that does not cost as much to license. But again, small point and not my main point.

My main point for this section is this – if you’re not a DBA try to put yourself in your DBA’s shoes. Which query makes more sense to you, the one I wrote in TSQL or the one that was generated by LINQ? If you are looking for a performance issue, which one would you rather try to wrap your head around to make a performance recommendation? And this is a relatively straightforward example. Realistic examples of entity framework code produce much longer and much more confusing queries.

Most EF gurus that I’ve talked to say the answer is simple – don’t look at the query, focus on the execution plan. If it’s a system that I’m supporting then fair enough, I personally am pretty comfortable with that. But I have been working with SQL Server for a very long time, and I am very comfortable reading execution plans. Not everyone is as comfortable with them. That means that workloads which involve LINQ are going to tend to skew towards the more experienced end of the performance DBA spectrum for support. In a smaller shop where one of the developers has been “nominated” as the DBA, or which is getting by with a junior DBA, it’s possible that there just simply won’t be the expertise on staff to work through complicated plans. Even at places which do have an experienced performance person on staff, they’re usually pretty busy already and not looking for more work to do.

One important  clarification before I move on – I’m not actually saying that one should not use LINQ because it generates ugly SQL. My point is more that there is a trade-off to keep in mind. I think it’s pretty much inevitable that LINQ to Entities will turn up in most organizations because it does code faster than, say, ADO.Net. My point is that there is a cost – both the amount and the complexity of DBA work increases. In all but the smallest organizations I’d suggest planning to bring on a qualified performance DBA before adopting LINQ which is going to consume some of the cost savings from the development efficiencies.

Come On, EF, Take a Hint Already

So ugly queries are unfortunate, and yes they generate some extra work when they need to be tuned. But we’re all professionals here, that’s not going to be a deal breaker for me. Personally my issue is actually that it is very difficult to solve certain performance issues in LINQ. Consider the following code

class Program
{
	static IQueryable<SalesOrderDetailEnlarged> DemoQuery( int prodID )
	{
		AdventureWorks2012_bigEntities ctx = new AdventureWorks2012_bigEntities();

		var q1 =
			from d in ctx.SalesOrderDetailEnlargeds
			where d.ProductID == prodID
			select d;

		return q1;
	}

	static void Main(string[] args)
	{
		int n1 = DemoQuery(42).ToArray().Length;  // ToArray to make really really extra sure query runs since we're throwing away results
		Console.Out.WriteLine(n1.ToString() + " rows");
	}
}

Which gives me this output

0 rows

Obviously somebody is not a Douglas Adams fan. But we are not here to talk about literature, we’re here to talk performance. This is a fairly small DB that fits in memory so there isn’t much point in studying runtimes. Let’s look at I/O. Running this SQL query on the database server

USE master
GO

SELECT qs.last_logical_reads, st.text, qp.query_plan
FROM sys.dm_exec_query_stats qs
	CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
	CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
WHERE st.text LIKE '%extent1%'  -- Quick and dirty check for a LINQ query
	AND qp.dbid = DB_ID('AdventureWorks2012_big')  -- Check DB ID to prevent past queries against plan cache from matching;

I see this

Prod 42 did 3 IO ops

which by itself isn’t very exciting. But now let’s run that same C# code again with a different product. Below I am only showing the lines that changed (updated 42 to 870).

		int n1 = DemoQuery(870).ToArray().Length;  // ToArray to make really really extra sure query runs since we're throwing away results
		Console.Out.WriteLine(n1.ToString() + " rows");

We see a lot more rows and a lot more IO

192208 rows

Prod 870 did 192208 IO ops

Assuming each read looks at a page and that we are using 8KB pages, that means that this query is sifting through 4,619,304 KB of data. But if we check the size of the table (e.g. using sp_spaceused), the table only contains 601,472 KB worth of data. To process this query we just read more than 7.5 times as much data as we would have with by just reading the entire table in one shot.

What’s happening here? Let’s look at those two columns that I’ve been ignoring from the SQL query above (text and query_plan). The contain the query and the execution plan (I did a little formatting for clarity).

SELECT
       [Extent1].[SalesOrderID] AS [SalesOrderID],
       [Extent1].[SalesOrderDetailID] AS [SalesOrderDetailID],
       [Extent1].[CarrierTrackingNumber] AS [CarrierTrackingNumber],
       [Extent1].[OrderQty] AS [OrderQty],
       [Extent1].[ProductID] AS [ProductID],
       [Extent1].[SpecialOfferID] AS [SpecialOfferID],
       [Extent1].[UnitPrice] AS [UnitPrice],
       [Extent1].[UnitPriceDiscount] AS [UnitPriceDiscount],
       [Extent1].[LineTotal] AS [LineTotal],
       [Extent1].[rowguid] AS [rowguid],
       [Extent1].[ModifiedDate] AS [ModifiedDate]
      FROM [Sales].[SalesOrderDetailEnlarged] AS [Extent1]
      WHERE [Extent1].[ProductID] = @p__linq__0


Plan with key lookup

Here is what I noticed when I look at these. First that is a parameterized query. On the whole that’s probably a good thing. When parameterized SQL is used, by default SQL Server will reuse the execution plan from the first call to the query, which on the whole is also a good thing. But in this specific case, the first call to this query didn’t return any data, so it generated an execution plan that doesn’t work well when a large amount of data comes back. The plan is using an index seek to locate records first and then as a second step retrieving data from the table with a key lookup operator. That’s a fantastic plan for cases where we only expect a few rows, but when a large amount of data is scattered through the table, this plan tends to hit each page multiple times.

For comparison, I ran DBCC FREEPROCCACHE to clear the plan cache (don’t do this in prod) and then re-ran my C# code which is querying for product ID 870. We now see a much more reasonable amount of IO, and an execution plan which simply does a single scan. This is a fantastic plan for product ID 870 even if it isn’t ideal for product ID 42, and it is not as bad for product 42 s the first plan was for 870.

IO for product 870 now 75,763


Plan now scans

This is of course a pretty straightforward case of parameter sensitivity. In this particular case, parameter sniffing is hurting us rather than helping. Parameter sniffing is a SQL Server behavior, not a LINQ behavior, so why and I discussing it here? I have devoted this much time to it because this kind of issue is typically fixed by adding either a RECOMPILE or and OPTIMIZE FOR hint, and EF (as well and LINQ to SQL before it) doesn’t have a great way to add query hints. This is already a long post so I won’t dive into these hints in depth, but briefly the RECOMPILE hint prevents SQL Server from caching the plan (no plan reuse, fresh compile every time) while the OPTIMIZE FOR hint allows for plan reuse but changes which parameter value is used for optimization during compilation.

Surviving LINQ

So the problem stands – LINQ is part of our landscape, the issue discussed above probably isn’t going away anytime soon, and we still have to administer our servers. There are a few ways to do this.

Plan Guides

I personally get the most mileage out of plan guides. Plan guides are horrible. And they suck. And they are bad. Did I mention that they suck? At the end of the day, however much I may hate using plan guides in general it does work for this problem. Plan guides change the way that a query is optimized. One of the things that we can do with them is to instruct the optimizer to inject a query hint every time it sees a given query. Here is what I typically do personally when I m trying to fix a problematic LINQ workload.

  1. Watch the plan cache for queries that have bad plans (and also appear to be from LINQ. It’s obvious which ones are from LINQ, trust me). Discussion on how to do this is out of scope for this post.
  2. Find the handle of the bad plan, e.g. from dm_exec_query_stats
  3. Call sp_create_plan_guide_from_handle to create a plan. At this point we have pinned the bad plan into the cache. Do not stop here.
  4. In Management Studio, locate the new plan guide, right click on it, and generate a DROP and CREATE script for it.
  5. In the create part of the resulting script, delete the hard-coded plan (it’s a big ugly blob of XML)
  6. Add the hint you’d like to the @hints parameter( probably OPTIMIZE FOR if there is a parameter value that yields a “good enough” plan for all values, and probably RECOMPILE if this isn’t the case)
  7. Run the script to re-create the guide.

Query Store

Query store is a new feature in SQL Server 2016. This should be a much easier way of getting an effect similar to what I describe for plan guides above, assuming that the query is recompiled often enough that a “good” plan can be found in the store. As long as there is a single plan that always works well enough, simply pin that execution plan using the GUI. The main reasons that I do not list this first are that I haven’t personally done this yet (haven’t done a MSSQL 2016 project that also uses LINQ yet), and that there are, at the time of this writing, still relatively few places that have 2016 running in production.

Stored Procedures

A more traditional way of coping with LINQ strangeness in general is to recode the problematic query in TSQL as a stored procedure. This is an excellent approach for some cases, but frequently pushing SELECT statements down into a stored procedure is a problem if there is any amount of polymorphism. Is it possible to list all of the queries that the code may issue? Do most of the queries always join to the same set of tables regardless of parameter values? Is the column list that gets returned always the same? If the answer to all of these questions is yes, then it won’t be too painful to wrap a lot of the SELECT logic into stored procedeures. But at the same time, if there is that little polymorphism is there really much benefit to using LINQ?

Entity Framework Interceptors

In the Entity Framework, at least in recent versions, it’s possible for the developers to modify how TSQL code gets generated by using an interceptor. In my mind at least, interceptors have a lot in common with plan guides. They allow us to inject query hints, and they are kind of icky. Personally if I’m the performance DBA for project I’d prefer using a plan guide because it usually requires less coordination. I am likely to be the one that decides to fix a problem query using a hint, and if my only two options are plan guides and interceptors then at least the plan guide can be added without coordinating a code change with the developers. Besides, I don’t want anybody getting the idea that they could write an interceptor that liters the code with NOLOCK hints, and it’s possible this could happen if I call attention to that particular EF feature.

Priming the Cache

I mention this mostly to try to convince the reader not to ever try it. This will not work in the long run. As we saw earlier, it may be possible to avoid parameter sensitivity issues by just ensuring that the first execution of each query has reasonable parameter values. Why not just run a controlled, hand-tuned workload to load good plans into the cache?

The most immediate problem is that plans are added to and ejected from the cache all the time. Does your data change? If so then so do your stats, and after a relevant statistic changes then the query will be recompiled. Is your workload really predicable enough to do this? If not, it’s possible one of the plans that you “seeded” can fall out of the cache if it isn’t used at  certain time of day and … again … cause a new compilation next time it is executed. Most importantly, does anybody on the team have the cycles to maintain a priming script as the development team changes the behavior of the product over time?

And of course, this there is the assumption that there is one single execution plan that is reasonable for all parameter values. That’s not always the case.

What Would Make Me Happy

In conclusion, I should probably try to be constructive and discuss possible solutions. Not that I have the answers. If I knew how to make something better than LINQ I would not be a nearly anonymous performance tuning professional that writes overly long blog posts. I would be a famous language architect that is to busy living the good life to be bothered writing.

One thing that I would not change – I’m not sure I’d add an easy way for developers to add any hint that they’d like to. Yes all query hints do exist for a reason and we do need a way to use them, but most hints really are not needed that often. For the one occasion every year that I decide I need to us a FORCESEEK hint I don’t think I’d necessarily mind using a plan guide. Much. It’s a pain, but I am happy to live with a little discomfort if it means that I don’t really have to worry about NOLOCK hints getting overused anymore.

I guess what I’d personally like to see would be extension methods for the hints that are actually needed relatively often in practice. For example wouldn’t it be cool if we could do this

            var q1 =
                (from d in ctx.SalesOrderDetailEnlargeds
                where d.ProductID == prodID
                select d).OptimizeForUnknown();

or if OptimizeForUnknown is too MSSQL-centric maybe

            var q1 =
                (from d in ctx.SalesOrderDetailEnlargeds
                where d.ProductID == prodID
                select d).SupressParameterSniffing();

The Potential of the LINQ. Why I really Wish I Could Love LINQ to Entities.

Executive Summary

Generalized search (or catch all, or Swiss army knife) stored procedures are one of the cases where I personally feel that T-SQL does not offer a fantastic solution – my feeling is that we usually need to choose between maintainability (LINQ is likely to seem more maintainable than the “sea of red” that the stored procedure approach becomes) and performance (when parameter sniffing becomes an issue, LINQ will suffer). The careful use of LINQ to Entities would be able to help us if it were not for the fact that there is no way to supply a query hint for those cases where parameter sniffing becomes an issue. The only way that I’ve discovered to work around this shortcoming is to use plan guides which are not very maintainable IMO. That’ a long winded way of saying that dynamic SQL or dynamic string execution in T-SQL may currently be the least bad solution.

The Problem

Whether a database stored purchase order information, technical support tickets, employee data, descriptions of Doctor Who episodes, or something else entirely in this day it is likely that a software frontend will eventually sit between the database and the data consumer. Frequently the software will include some kind of search screen with about a gazillion optional fields where the user can potentially enter data. For example, if we are working on a system for creating and tracking tech support tickets then the user might expect to see a screen someplace in our product that lets them filter by things like creation date, assigned technician, customer, system location, completion date, etc.

Typically the expectation is that this search functionality will be implemented by calling a single stored procedure to perform this search regardless of what combination of parameters is supplied. If the development team has not been down this road before or warned about the pitfalls, typically the optional parameters will be handled by using the “@parm IS NULL OR @parm=val” approach. For example a greatly simplified procedure for our ticketing system may look like this

CREATE PROCEDURE dbo.SearchAllTheThings
  @StartDateCreation DATE = NULL,
  @EndDateCreation DATE = NULL,
  @TechnicianID TINYINT = NULL
AS
	SELECT TicketID, CustomerID, CompletionDate, TicketText
	FROM dbo.Ticket
	WHERE (@StartDateCreation IS NULL OR Ticket.CreationDate >= @StartDateCreation)
	AND (@EndDateCreation IS NULL OR Ticket.CreationDate <= @EndDateCreation)
	AND (@TechnicianID IS NULL OR EXISTS(SELECT * FROM dbo.TicketTechnician
		WHERE TicketTechnician.TicketID = Ticket.TicketID AND TicketTechnician.TechID = @TechnicianID))
GO

and if we ask for the actual plan when calling it like this (requesting all tickets created on October 1, IE all tickets created between October 1 and October 1).

exec dbo.SearchAllTheThings '20131001', '20131001'

we may see a plan like this

What’s up with that nested loop and the concatenation operator? Since we did not specify a parameter for the technician there is no reason for us to look at anything but the ticket table. The clustered index scan should have us covered – well, in the absence of a better index.

The reason, of course, is that SQL Server will reuse the plan for future calls. Even though the plan is optimized for the current set of parameters, it needs to produce correct results for all possible combinations of parameters. The only way this is possible is for the plan to include these extra operators.

Fixing With Dynamic SQL

Typically a good solution for this issue is to rewrite the stored procedure so that it builds an appropriate query on the fly, depending on which parameters are actually used, and then to execute using either exec or sp_executesql depending on whether compilation overhead or parameter sniffing are of greater concern. So in this case we could rewrite the stored procedure as

CREATE PROCEDURE dbo.ImprovedSearchAllTheThings
  @StartDateCreation DATE = NULL,
  @EndDateCreation DATE = NULL,
  @TechnicianID TINYINT = NULL
AS
	DECLARE @query NVARCHAR(300);
	DECLARE @parms NVARCHAR(300);

	SET @parms = N'@StartDateCreation DATE, @EndDateCreation DATE, @TechnicianID TINYINT';

	SET @query = N'SELECT TicketID, CustomerID, CompletionDate, TicketText FROM dbo.Ticket WHERE 1=1';
	IF @StartDateCreation IS NULL
	BEGIN
		SET @query = @query + N' AND Ticket.CreationDate >= @StartDateCreation';
	END
	IF @EndDateCreation IS NULL
	BEGIN
		SET @query = @query + N' AND Ticket.CreationDate <= @EndDateCreation';
	END
	IF @TechnicianID IS NULL
	BEGIN
		SET @query = @query + N' AND EXISTS(SELECT * FROM dbo.TicketTechnician WHERE TicketTechnician.TicketID = Ticket.TicketID AND TicketTechnician.TechID = @TechnicianID)';
	END

	EXEC dbo.sp_executesql @query, @parms, @StartDateCreation, @EndDateCreation, @TechnicianID
GO

When we request the actual execution plan while calling

exec dbo.ImprovedSearchAllTheThings '20131001', '20131001'

And now the plan looks a little cleaner

2013_12_isnull_improved_plan

A Face Only a DBA Could Love

As a rule, my experience has been that developers hate any form of dynamic SQL. At a visceral level, I believe it feels as if dynamic SQL flies in the face of at least 15 years of database programming best practice thinking (to be clear it does not, at least IMO). There was a time when it was common to slam tons of strings, some of which may have come from an end user, together in code to assemble a query. This practice is what made SQL injection attacks so effective at the time. Far and away the most popular mitigation against SQL injection has been various kinds of parameterized SQL such as the sp_executesql procedure used above or using parameters with the humble SQLCommand object. Over time as the community continues to hammer into developers heads that parameterized SQL is a very very good thing, many have lost sight of the fact that the real issue is including user-supplied input into dynamically assembled SQL … many today are of the opinion that ALL dynamic SQL is dangerous (or at least icky).

And I actually agree with them, dynamic SQL is icky. Don’t get me wrong, I’m not saying that dynamic SQL should not be used. I actually think it is often demonstrably the best approach in cases like the above. What I am saying is that it is the best of a handful of bad options. We keep hoping for a good one and that’s one of the reasons LINQ has become so popular over the objections of many many DBAs.

I’m sure different folks have different reasons for disliking dynamic SQL. My issues with it are testing and maintainability. Testing because if there are a large number of parameters and complicated logic the number of potential combinations that could be produced can increase exponentially which makes it impossible to be absolutely sure that there is no possible combination that is missing a space somewhere important. Maintainability because there is no syntax help from the IDE when working inside text strings everything is in a “sea of red” which means anyone who isn’t familiar with how the procedure is put together will need to spend some time orienting themselves before they touch anything.

Not such big problems if you hire some fancy-pants consultant with a blog and never have to modify the code after they leave, but these are problems for mere mortals.

So How Was LINQ Supposed to Help?

I don’t want to get too bogged down in discussing why LINQ is so exciting from a development perspective, but briefly it is an acronym for “Language-Integrated Query” and is pretty awesome because that’s exactly what it does. It allows a developer to drop something that looks an awful lot like a query into the middle of their code and work with the results of the query as objects instead of flat data. All this without tens or hundreds of lines mapping parameters, looping through readers, and stuffing data into objects by hand. As somebody who uses LINQ to Entities in secret, late at night, when nobody else is looking and I don’t care about parameter sniffing issues I can say it really is pretty awesome.

The reason I with I could love LINQ, as a DBA, is that all of this happens through SQL rather than through some proprietary library. In other words, LINQ is a SQL generator that is fully supported by IntelliSense and syntax checkers in Visual Studio. It does not generate concise SQL. It does not always generate comprehensible SQL. But it does generate correct SQL without leaving the developer an opportunity to forget an important space or misspell “SELECT”.

Here is a sample what some LINQ code may look like for this problem. I kept things kind of simple for this example, this isn’t necessarily code that I would put into production. Notice how, even though the syntax and order of clauses is a bit different, the database access code almost looks a little bit like an inline SQL query.

		Ticket[] LongVersion(int? TechnicianID, DateTime StartDate, DateTime EndDate)
		{
			ThrowAway.Blog_2013_11_LameTicketDBEntities ctx = new Blog_2013_11_LameTicketDBEntities();

			IEnumerable<Ticket> pancakes = null;   // "var" is for squids.

			if (TechnicianID == null)
			{
				pancakes = from tkt in ctx.Tickets
						   orderby tkt.TicketID
						   select tkt;
			}
			else
			{
				pancakes = (from tech in ctx.Technicians
							where tech.TechID == (int)TechnicianID
							select tech.Tickets).SingleOrDefault();
			}

			//Narrow down by StartDate
			if (StartDate != null)
			{
				pancakes = from tkt in pancakes
						   where tkt.CreationDate >= StartDate
						   select tkt;
			}

			//Narrow down by EndDate
			if (EndDate != null){
				pancakes = from tkt in pancakes
						   where tkt.CreationDate <= EndDate
						   select tkt;
			}

			return pancakes.ToArray();
		}

Or, more succinctly, if lambda expressions don’t make your head explode and if you aren’t that excited about the SQL-ish syntax

		Ticket[] CompactVersion(int? TechnicianID, DateTime StartDate, DateTime EndDate)
		{
			ThrowAway.Blog_2013_11_LameTicketDBEntities ctx = new Blog_2013_11_LameTicketDBEntities();

			IEnumerable<Ticket> pancakes = null;   // "var" is for squids.

			if (TechnicianID == null)
			{
				pancakes = ctx.Tickets;
			}
			else
			{
				pancakes = ctx.Technicians.Where( x => x.TechID == (int)TechnicianID ).SingleOrDefault().Tickets;
			}

			//Narrow down by StartDate
			if (StartDate != null)
			{
				pancakes = pancakes.Where( x => x.CompletionDate >= StartDate );
			}

			//Narrow down by EndDate
			if (EndDate != null){
				pancakes = pancakes.Where(x => x.CreationDate <= EndDate);
			}

			return pancakes.OrderBy(x => x.TicketID).ToArray();
		}

One key thing to understand about LINQ to Entities is that the query is not executed against the database until the results of the query are accessed. So in the above examples, the only thing that happens between the start of the function until just before the return statement is that the text of a SQL query is assembled. Since the optional parts of the query are enclosed in if/then statements the query will not contain any reference to conditions which do not apply, and since the query is not sent to the server until the results are actually needed no time is wasted generating results that may not be looked at. In both versions of the function, the query is actually executed at the time the ToArray() method is called.

But The Reader Gets the Impression Rick Dislikes LINQ
That would be correct, I do have issues with LINQ at least as it exists today. But this post is probably already more than long enough … stay tuned for my next post on why I feel LINQ fails to live up to its potential.