Category Archives: Performance

Updating Heaps. Not Awesome.

I’ve found myself speaking a lot lately about things that actually go wrong when heaps are used inappropriately. Most of us have heard that heaps aren’t recommended for update-heavy workloads, but it is still disturbingly common for me to start on a new project and find heaps of heaps. I’m pretty sure that the issue is more that there isn’t a lot of awareness about what kinds of things can break when best practices aren’t followed than it is about anybody not hearing the guidance.

Setup

Basically all that’s needed to demonstrate this particular issue is a database with a table that doesn’t have a clustered index. It’s helpful if the record size for that heap is large enough to be interesting. Not being terribly creative, here’s what I used

USE master
GO

CREATE DATABASE HeapDB
	ON( NAME=HeapDB, FILENAME='C:\your_path_here\HeapDB.mdf', SIZE=10GB, FILEGROWTH=1GB )
	LOG ON( NAME=Heap_log, FILENAME='C:\your_path_here\HeapDB.ldf', SIZE=1GB, FILEGROWTH=1GB)
GO

ALTER DATABASE HeapDB SET RECOVERY SIMPLE;
GO

USE HeapDB
GO

CREATE TABLE dbo.BigHeap(
	ID            INT          NOT NULL,
	BigBlobbyData VARCHAR(MAX) NOT NULL
);
GO

-- Add some data

DECLARE @x   INT = 1;
DECLARE @xvc VARCHAR(10);
WHILE @x <= 1000
BEGIN
	SET @xvc = CAST(@x AS VARCHAR(10));
	INSERT INTO dbo.BigHeap( ID, BigBlobbyData ) VALUES( @x, REPLICATE( '0', 4000 ) + @xvc);

	SET @x += 1;
END
GO

Breaking Stuff

Perhaps the easiest issue to demonstrate on this setup happens if we simply do enough simultaneous updates on a table. For my test, I opened up two new query windows in SSMS and in both windows ran this code (at the same time)

USE HeapDB
GO

UPDATE dbo.BigHeap
	SET BigBlobbyData = REPLICATE( '0', 10000) + CAST(RAND() AS VARCHAR(100))
	WHERE ID = FLOOR(RAND() * 10) + 1;
GO 100000

After a while, it should be possible to scroll through the contents of the messages tab and see something like this


shot_20170613_1

Fortunately with current versions of SQL Server we have an easy way to dig into deadlocks. Opening up the system_health extended event session and finding the most recent deadlock gives me this graph


shot_20170613_3

What does this mean? It means that at some point each session had an exclusive lock on different RIDs (a RID, or row identifier, is essentially just what we call a record in a heap). And each session was attempting to get an update lock on the record which the opposing session exclusively locked. Update locks can not be granted as long as another transaction holds an exclusive lock on that resource, so no amount of waiting will allow us to make progress and a deadlock results.

But why? In a nutshell this particular deadlock happens because there is no structure in a heap. To update a record SQL Server must scan until it is found … but even once the record has been found the search must continue. There is no primary key, so there is no way to know that more records matching the WHERE clause of the update do not exist. If two separate sessions make changes to a record that has not yet been scanned by their counterpart, this kind of deadlock will result.

Will it help if we add a unique nonclustered index? Maybe a little, in that a deadlock may be less likely. But we can still see deadlocks because both the heap and index may need to be updated, and the fact that pages in both need to be locked opens the door to a different kind of deadlock. But more to the point, why would we do this? If we have a candidate key that could be used as a unique index, it may also be a great candidate for a clustering key.

The Disappointment of Querying with LINQ. Why I Don’t Love LINQ to Entities.

Generalized search (or catch all, or Swiss army knife) stored procedures are one of the cases where I personally feel that T-SQL does not offer a fantastic solution. Unfortunately, neither does LINQ to SQL nor LINQ to Entities. That said, no tool does a fantastic job with that use case so this is not an indictment of LINQ.

It is true that using LINQ may allow a development team to be more productive, but it is also true that LINQ complicates monitoring and performance tuning. At the end of the day that means that a more skilled (pronounced : expensive) database administration staff may be required.

The Short Version

One big issue with supporting the queries that LINQ generates is that the queries are strange – the almost look like they’ve been through an obfuscator. Another is that there’s not a clean, easy way to add a RECOMPILE or an OPTIMIZE FOR UNKNOWN hint. Sure it’s nice that LINQ doesn’t allow developers to add any crazy, dangerous hint that they want to (*cough* NOLOCK *cough*) but there is a reason that query hints exist. Use of hints should indeed be rare but at times they really are necessary.

The current state of practice in shops that use LINQ (in my experience at least) is for the development team to start out using LINQ, and to re-factor logic into a TSQL-based stored procedure for those cases that LINQ does not handle well. My issue is that those tend to be the really complex cases, and frequently are the cases where we may end up with dynamic SQL inside the stored procedure once we are done performance tuning. As a performance tuning person, that’s exactly the opposite of what I’d like to see. I have no doubt that the average SQL developer can get the easy stuff right. What I would like to see something like LINQ help with is the weird, gnarly, dynamic SQL that goes along with the hardest cases.

Most of us have seen examples of completely insane TSQL that was generated by LINQ. Epic fails are real, I have seen them. I wouldn’t say they’re common though. In my experience what really keeps LINQ from shining is normally that it runs afoul of much more mundane issues such as bad execution plans which arise from the wrong value being  sniffed for the wrong parameter at the wrong time. In TSQL when there is a parameter sensitivity issue more often than not a query hint is added, but LINQ makes it difficult to add query hints without really offering an acceptable alterative solution.

The Long Version Part 1: What part of “Project1 Extent1 LEFT JOIN” Don’t You Understand?

I’ll be writing some LINQ code for this post. Don’t worry, I won’t make a habit of it, and I’ll keep things fairly simple for this. To create the database that I use in the examples below, I started with a standard copy of the Adventure Works database. Then I enlarged it using Jonathan Kehayias’ script (read his blog post here here). Yes I am still using Adventure Works for examples. Yes I am old. After creating the database I started a new C# console application in Visual Studio and generated an ADO.Net data model from it.

Now consider the following code

static void Main(string[] args)
{
	AdventureWorks2012_bigEntities ctx = new AdventureWorks2012_bigEntities();

            var q =
                from p in ctx.People
                select p.EmailAddresses;
            Console.Out.WriteLine(q.ToArray().Length);
}

If we were to write a TSQL query for the same data by hand, it might look like this

SELECT EmailAddress.*
FROM Person.Person
	INNER JOIN Person.EmailAddress ON Person.BusinessEntityID = EmailAddress.BusinessEntityID;

No, the it’s not great that we’re using “SELECT *” but that’s not my point, and contrary to popular belief it actually is possible to get LINQ to only bring back a few columns … but that would be another blog post. For this post, I want to compare the preceding query with the following one, which is what LINQ generates in my lab setup.

SELECT       [Project1].[BusinessEntityID] AS [BusinessEntityID],       [Project1].[C1] AS [C1],       [Project1].[BusinessEntityID1] AS [BusinessEntityID1],       [Project1].[EmailAddressID] AS [EmailAddressID],       [Project1].[EmailAddress] AS [EmailAddress],       [Project1].[rowguid] AS [rowguid],       [Project1].[ModifiedDate] AS [ModifiedDate]      FROM ( SELECT           [Extent1].[BusinessEntityID] AS [BusinessEntityID],           [Extent2].[BusinessEntityID] AS [BusinessEntityID1],           [Extent2].[EmailAddressID] AS [EmailAddressID],           [Extent2].[EmailAddress] AS [EmailAddress],           [Extent2].[rowguid] AS [rowguid],           [Extent2].[ModifiedDate] AS [ModifiedDate],           CASE WHEN ([Extent2].[BusinessEntityID] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]          FROM  [Person].[Person] AS [Extent1]          LEFT OUTER JOIN [Person].[EmailAddress] AS [Extent2] ON [Extent1].[BusinessEntityID] = [Extent2].[BusinessEntityID]      )  AS [Project1]      ORDER BY [Project1].[BusinessEntityID] ASC, [Project1].[C1] ASC

No, that’s not super readable as is … but it gets a lot better if we take some time to format the query. Here’s what I came up with after a few minutes of formatting.

SELECT       [Project1].[BusinessEntityID] AS [BusinessEntityID],
       [Project1].[C1] AS [C1],
       [Project1].[BusinessEntityID1] AS [BusinessEntityID1],
       [Project1].[EmailAddressID] AS [EmailAddressID],
       [Project1].[EmailAddress] AS [EmailAddress],
       [Project1].[rowguid] AS [rowguid],
       [Project1].[ModifiedDate] AS [ModifiedDate]
FROM (
    SELECT           [Extent1].[BusinessEntityID] AS [BusinessEntityID],
           [Extent2].[BusinessEntityID] AS [BusinessEntityID1],
           [Extent2].[EmailAddressID] AS [EmailAddressID],
           [Extent2].[EmailAddress] AS [EmailAddress],
           [Extent2].[rowguid] AS [rowguid],
           [Extent2].[ModifiedDate] AS [ModifiedDate],
           CASE WHEN ([Extent2].[BusinessEntityID] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]
    FROM  [Person].[Person] AS [Extent1]
          LEFT OUTER JOIN [Person].[EmailAddress] AS [Extent2] ON [Extent1].[BusinessEntityID] = [Extent2].[BusinessEntityID]
)  AS [Project1]
ORDER BY [Project1].[BusinessEntityID] ASC, [Project1].[C1] ASC

One minor point that I won’t dwell on, because it’s kind of nit-picky and probably depends on which version of Entity Framework (EF) is used, is that there is an ORDER BY clause on the query even though I didn’t specify any ordering in my LINQ query. SQL Server licenses are expensive – personally I would prefer that those CPU cycles not be wasted when I could sort the data in my C#. Odds are the application is running on a processor that does not cost as much to license. But again, small point and not my main point.

My main point for this section is this – if you’re not a DBA try to put yourself in your DBA’s shoes. Which query makes more sense to you, the one I wrote in TSQL or the one that was generated by LINQ? If you are looking for a performance issue, which one would you rather try to wrap your head around to make a performance recommendation? And this is a relatively straightforward example. Realistic examples of entity framework code produce much longer and much more confusing queries.

Most EF gurus that I’ve talked to say the answer is simple – don’t look at the query, focus on the execution plan. If it’s a system that I’m supporting then fair enough, I personally am pretty comfortable with that. But I have been working with SQL Server for a very long time, and I am very comfortable reading execution plans. Not everyone is as comfortable with them. That means that workloads which involve LINQ are going to tend to skew towards the more experienced end of the performance DBA spectrum for support. In a smaller shop where one of the developers has been “nominated” as the DBA, or which is getting by with a junior DBA, it’s possible that there just simply won’t be the expertise on staff to work through complicated plans. Even at places which do have an experienced performance person on staff, they’re usually pretty busy already and not looking for more work to do.

One important  clarification before I move on – I’m not actually saying that one should not use LINQ because it generates ugly SQL. My point is more that there is a trade-off to keep in mind. I think it’s pretty much inevitable that LINQ to Entities will turn up in most organizations because it does code faster than, say, ADO.Net. My point is that there is a cost – both the amount and the complexity of DBA work increases. In all but the smallest organizations I’d suggest planning to bring on a qualified performance DBA before adopting LINQ which is going to consume some of the cost savings from the development efficiencies.

Come On, EF, Take a Hint Already

So ugly queries are unfortunate, and yes they generate some extra work when they need to be tuned. But we’re all professionals here, that’s not going to be a deal breaker for me. Personally my issue is actually that it is very difficult to solve certain performance issues in LINQ. Consider the following code

class Program
{
	static IQueryable<SalesOrderDetailEnlarged> DemoQuery( int prodID )
	{
		AdventureWorks2012_bigEntities ctx = new AdventureWorks2012_bigEntities();

		var q1 =
			from d in ctx.SalesOrderDetailEnlargeds
			where d.ProductID == prodID
			select d;

		return q1;
	}

	static void Main(string[] args)
	{
		int n1 = DemoQuery(42).ToArray().Length;  // ToArray to make really really extra sure query runs since we're throwing away results
		Console.Out.WriteLine(n1.ToString() + " rows");
	}
}

Which gives me this output

0 rows

Obviously somebody is not a Douglas Adams fan. But we are not here to talk about literature, we’re here to talk performance. This is a fairly small DB that fits in memory so there isn’t much point in studying runtimes. Let’s look at I/O. Running this SQL query on the database server

USE master
GO

SELECT qs.last_logical_reads, st.text, qp.query_plan
FROM sys.dm_exec_query_stats qs
	CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
	CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
WHERE st.text LIKE '%extent1%'  -- Quick and dirty check for a LINQ query
	AND qp.dbid = DB_ID('AdventureWorks2012_big')  -- Check DB ID to prevent past queries against plan cache from matching;

I see this

Prod 42 did 3 IO ops

which by itself isn’t very exciting. But now let’s run that same C# code again with a different product. Below I am only showing the lines that changed (updated 42 to 870).

		int n1 = DemoQuery(870).ToArray().Length;  // ToArray to make really really extra sure query runs since we're throwing away results
		Console.Out.WriteLine(n1.ToString() + " rows");

We see a lot more rows and a lot more IO

192208 rows

Prod 870 did 192208 IO ops

Assuming each read looks at a page and that we are using 8KB pages, that means that this query is sifting through 4,619,304 KB of data. But if we check the size of the table (e.g. using sp_spaceused), the table only contains 601,472 KB worth of data. To process this query we just read more than 7.5 times as much data as we would have with by just reading the entire table in one shot.

What’s happening here? Let’s look at those two columns that I’ve been ignoring from the SQL query above (text and query_plan). The contain the query and the execution plan (I did a little formatting for clarity).

SELECT
       [Extent1].[SalesOrderID] AS [SalesOrderID],
       [Extent1].[SalesOrderDetailID] AS [SalesOrderDetailID],
       [Extent1].[CarrierTrackingNumber] AS [CarrierTrackingNumber],
       [Extent1].[OrderQty] AS [OrderQty],
       [Extent1].[ProductID] AS [ProductID],
       [Extent1].[SpecialOfferID] AS [SpecialOfferID],
       [Extent1].[UnitPrice] AS [UnitPrice],
       [Extent1].[UnitPriceDiscount] AS [UnitPriceDiscount],
       [Extent1].[LineTotal] AS [LineTotal],
       [Extent1].[rowguid] AS [rowguid],
       [Extent1].[ModifiedDate] AS [ModifiedDate]
      FROM [Sales].[SalesOrderDetailEnlarged] AS [Extent1]
      WHERE [Extent1].[ProductID] = @p__linq__0


Plan with key lookup

Here is what I noticed when I look at these. First that is a parameterized query. On the whole that’s probably a good thing. When parameterized SQL is used, by default SQL Server will reuse the execution plan from the first call to the query, which on the whole is also a good thing. But in this specific case, the first call to this query didn’t return any data, so it generated an execution plan that doesn’t work well when a large amount of data comes back. The plan is using an index seek to locate records first and then as a second step retrieving data from the table with a key lookup operator. That’s a fantastic plan for cases where we only expect a few rows, but when a large amount of data is scattered through the table, this plan tends to hit each page multiple times.

For comparison, I ran DBCC FREEPROCCACHE to clear the plan cache (don’t do this in prod) and then re-ran my C# code which is querying for product ID 870. We now see a much more reasonable amount of IO, and an execution plan which simply does a single scan. This is a fantastic plan for product ID 870 even if it isn’t ideal for product ID 42, and it is not as bad for product 42 s the first plan was for 870.

IO for product 870 now 75,763


Plan now scans

This is of course a pretty straightforward case of parameter sensitivity. In this particular case, parameter sniffing is hurting us rather than helping. Parameter sniffing is a SQL Server behavior, not a LINQ behavior, so why and I discussing it here? I have devoted this much time to it because this kind of issue is typically fixed by adding either a RECOMPILE or and OPTIMIZE FOR hint, and EF (as well and LINQ to SQL before it) doesn’t have a great way to add query hints. This is already a long post so I won’t dive into these hints in depth, but briefly the RECOMPILE hint prevents SQL Server from caching the plan (no plan reuse, fresh compile every time) while the OPTIMIZE FOR hint allows for plan reuse but changes which parameter value is used for optimization during compilation.

Surviving LINQ

So the problem stands – LINQ is part of our landscape, the issue discussed above probably isn’t going away anytime soon, and we still have to administer our servers. There are a few ways to do this.

Plan Guides

I personally get the most mileage out of plan guides. Plan guides are horrible. And they suck. And they are bad. Did I mention that they suck? At the end of the day, however much I may hate using plan guides in general it does work for this problem. Plan guides change the way that a query is optimized. One of the things that we can do with them is to instruct the optimizer to inject a query hint every time it sees a given query. Here is what I typically do personally when I m trying to fix a problematic LINQ workload.

  1. Watch the plan cache for queries that have bad plans (and also appear to be from LINQ. It’s obvious which ones are from LINQ, trust me). Discussion on how to do this is out of scope for this post.
  2. Find the handle of the bad plan, e.g. from dm_exec_query_stats
  3. Call sp_create_plan_guide_from_handle to create a plan. At this point we have pinned the bad plan into the cache. Do not stop here.
  4. In Management Studio, locate the new plan guide, right click on it, and generate a DROP and CREATE script for it.
  5. In the create part of the resulting script, delete the hard-coded plan (it’s a big ugly blob of XML)
  6. Add the hint you’d like to the @hints parameter( probably OPTIMIZE FOR if there is a parameter value that yields a “good enough” plan for all values, and probably RECOMPILE if this isn’t the case)
  7. Run the script to re-create the guide.

Query Store

Query store is a new feature in SQL Server 2016. This should be a much easier way of getting an effect similar to what I describe for plan guides above, assuming that the query is recompiled often enough that a “good” plan can be found in the store. As long as there is a single plan that always works well enough, simply pin that execution plan using the GUI. The main reasons that I do not list this first are that I haven’t personally done this yet (haven’t done a MSSQL 2016 project that also uses LINQ yet), and that there are, at the time of this writing, still relatively few places that have 2016 running in production.

Stored Procedures

A more traditional way of coping with LINQ strangeness in general is to recode the problematic query in TSQL as a stored procedure. This is an excellent approach for some cases, but frequently pushing SELECT statements down into a stored procedure is a problem if there is any amount of polymorphism. Is it possible to list all of the queries that the code may issue? Do most of the queries always join to the same set of tables regardless of parameter values? Is the column list that gets returned always the same? If the answer to all of these questions is yes, then it won’t be too painful to wrap a lot of the SELECT logic into stored procedeures. But at the same time, if there is that little polymorphism is there really much benefit to using LINQ?

Entity Framework Interceptors

In the Entity Framework, at least in recent versions, it’s possible for the developers to modify how TSQL code gets generated by using an interceptor. In my mind at least, interceptors have a lot in common with plan guides. They allow us to inject query hints, and they are kind of icky. Personally if I’m the performance DBA for project I’d prefer using a plan guide because it usually requires less coordination. I am likely to be the one that decides to fix a problem query using a hint, and if my only two options are plan guides and interceptors then at least the plan guide can be added without coordinating a code change with the developers. Besides, I don’t want anybody getting the idea that they could write an interceptor that liters the code with NOLOCK hints, and it’s possible this could happen if I call attention to that particular EF feature.

Priming the Cache

I mention this mostly to try to convince the reader not to ever try it. This will not work in the long run. As we saw earlier, it may be possible to avoid parameter sensitivity issues by just ensuring that the first execution of each query has reasonable parameter values. Why not just run a controlled, hand-tuned workload to load good plans into the cache?

The most immediate problem is that plans are added to and ejected from the cache all the time. Does your data change? If so then so do your stats, and after a relevant statistic changes then the query will be recompiled. Is your workload really predicable enough to do this? If not, it’s possible one of the plans that you “seeded” can fall out of the cache if it isn’t used at  certain time of day and … again … cause a new compilation next time it is executed. Most importantly, does anybody on the team have the cycles to maintain a priming script as the development team changes the behavior of the product over time?

And of course, this there is the assumption that there is one single execution plan that is reasonable for all parameter values. That’s not always the case.

What Would Make Me Happy

In conclusion, I should probably try to be constructive and discuss possible solutions. Not that I have the answers. If I knew how to make something better than LINQ I would not be a nearly anonymous performance tuning professional that writes overly long blog posts. I would be a famous language architect that is to busy living the good life to be bothered writing.

One thing that I would not change – I’m not sure I’d add an easy way for developers to add any hint that they’d like to. Yes all query hints do exist for a reason and we do need a way to use them, but most hints really are not needed that often. For the one occasion every year that I decide I need to us a FORCESEEK hint I don’t think I’d necessarily mind using a plan guide. Much. It’s a pain, but I am happy to live with a little discomfort if it means that I don’t really have to worry about NOLOCK hints getting overused anymore.

I guess what I’d personally like to see would be extension methods for the hints that are actually needed relatively often in practice. For example wouldn’t it be cool if we could do this

            var q1 =
                (from d in ctx.SalesOrderDetailEnlargeds
                where d.ProductID == prodID
                select d).OptimizeForUnknown();

or if OptimizeForUnknown is too MSSQL-centric maybe

            var q1 =
                (from d in ctx.SalesOrderDetailEnlargeds
                where d.ProductID == prodID
                select d).SupressParameterSniffing();

The Potential of the LINQ. Why I really Wish I Could Love LINQ to Entities.

Executive Summary

Generalized search (or catch all, or Swiss army knife) stored procedures are one of the cases where I personally feel that T-SQL does not offer a fantastic solution – my feeling is that we usually need to choose between maintainability (LINQ is likely to seem more maintainable than the “sea of red” that the stored procedure approach becomes) and performance (when parameter sniffing becomes an issue, LINQ will suffer). The careful use of LINQ to Entities would be able to help us if it were not for the fact that there is no way to supply a query hint for those cases where parameter sniffing becomes an issue. The only way that I’ve discovered to work around this shortcoming is to use plan guides which are not very maintainable IMO. That’ a long winded way of saying that dynamic SQL or dynamic string execution in T-SQL may currently be the least bad solution.

The Problem

Whether a database stored purchase order information, technical support tickets, employee data, descriptions of Doctor Who episodes, or something else entirely in this day it is likely that a software frontend will eventually sit between the database and the data consumer. Frequently the software will include some kind of search screen with about a gazillion optional fields where the user can potentially enter data. For example, if we are working on a system for creating and tracking tech support tickets then the user might expect to see a screen someplace in our product that lets them filter by things like creation date, assigned technician, customer, system location, completion date, etc.

Typically the expectation is that this search functionality will be implemented by calling a single stored procedure to perform this search regardless of what combination of parameters is supplied. If the development team has not been down this road before or warned about the pitfalls, typically the optional parameters will be handled by using the “@parm IS NULL OR @parm=val” approach. For example a greatly simplified procedure for our ticketing system may look like this

CREATE PROCEDURE dbo.SearchAllTheThings
  @StartDateCreation DATE = NULL,
  @EndDateCreation DATE = NULL,
  @TechnicianID TINYINT = NULL
AS
	SELECT TicketID, CustomerID, CompletionDate, TicketText
	FROM dbo.Ticket
	WHERE (@StartDateCreation IS NULL OR Ticket.CreationDate >= @StartDateCreation)
	AND (@EndDateCreation IS NULL OR Ticket.CreationDate <= @EndDateCreation)
	AND (@TechnicianID IS NULL OR EXISTS(SELECT * FROM dbo.TicketTechnician
		WHERE TicketTechnician.TicketID = Ticket.TicketID AND TicketTechnician.TechID = @TechnicianID))
GO

and if we ask for the actual plan when calling it like this (requesting all tickets created on October 1, IE all tickets created between October 1 and October 1).

exec dbo.SearchAllTheThings '20131001', '20131001'

we may see a plan like this

What’s up with that nested loop and the concatenation operator? Since we did not specify a parameter for the technician there is no reason for us to look at anything but the ticket table. The clustered index scan should have us covered – well, in the absence of a better index.

The reason, of course, is that SQL Server will reuse the plan for future calls. Even though the plan is optimized for the current set of parameters, it needs to produce correct results for all possible combinations of parameters. The only way this is possible is for the plan to include these extra operators.

Fixing With Dynamic SQL

Typically a good solution for this issue is to rewrite the stored procedure so that it builds an appropriate query on the fly, depending on which parameters are actually used, and then to execute using either exec or sp_executesql depending on whether compilation overhead or parameter sniffing are of greater concern. So in this case we could rewrite the stored procedure as

CREATE PROCEDURE dbo.ImprovedSearchAllTheThings
  @StartDateCreation DATE = NULL,
  @EndDateCreation DATE = NULL,
  @TechnicianID TINYINT = NULL
AS
	DECLARE @query NVARCHAR(300);
	DECLARE @parms NVARCHAR(300);

	SET @parms = N'@StartDateCreation DATE, @EndDateCreation DATE, @TechnicianID TINYINT';

	SET @query = N'SELECT TicketID, CustomerID, CompletionDate, TicketText FROM dbo.Ticket WHERE 1=1';
	IF @StartDateCreation IS NULL
	BEGIN
		SET @query = @query + N' AND Ticket.CreationDate >= @StartDateCreation';
	END
	IF @EndDateCreation IS NULL
	BEGIN
		SET @query = @query + N' AND Ticket.CreationDate <= @EndDateCreation';
	END
	IF @TechnicianID IS NULL
	BEGIN
		SET @query = @query + N' AND EXISTS(SELECT * FROM dbo.TicketTechnician WHERE TicketTechnician.TicketID = Ticket.TicketID AND TicketTechnician.TechID = @TechnicianID)';
	END

	EXEC dbo.sp_executesql @query, @parms, @StartDateCreation, @EndDateCreation, @TechnicianID
GO

When we request the actual execution plan while calling

exec dbo.ImprovedSearchAllTheThings '20131001', '20131001'

And now the plan looks a little cleaner

2013_12_isnull_improved_plan

A Face Only a DBA Could Love

As a rule, my experience has been that developers hate any form of dynamic SQL. At a visceral level, I believe it feels as if dynamic SQL flies in the face of at least 15 years of database programming best practice thinking (to be clear it does not, at least IMO). There was a time when it was common to slam tons of strings, some of which may have come from an end user, together in code to assemble a query. This practice is what made SQL injection attacks so effective at the time. Far and away the most popular mitigation against SQL injection has been various kinds of parameterized SQL such as the sp_executesql procedure used above or using parameters with the humble SQLCommand object. Over time as the community continues to hammer into developers heads that parameterized SQL is a very very good thing, many have lost sight of the fact that the real issue is including user-supplied input into dynamically assembled SQL … many today are of the opinion that ALL dynamic SQL is dangerous (or at least icky).

And I actually agree with them, dynamic SQL is icky. Don’t get me wrong, I’m not saying that dynamic SQL should not be used. I actually think it is often demonstrably the best approach in cases like the above. What I am saying is that it is the best of a handful of bad options. We keep hoping for a good one and that’s one of the reasons LINQ has become so popular over the objections of many many DBAs.

I’m sure different folks have different reasons for disliking dynamic SQL. My issues with it are testing and maintainability. Testing because if there are a large number of parameters and complicated logic the number of potential combinations that could be produced can increase exponentially which makes it impossible to be absolutely sure that there is no possible combination that is missing a space somewhere important. Maintainability because there is no syntax help from the IDE when working inside text strings everything is in a “sea of red” which means anyone who isn’t familiar with how the procedure is put together will need to spend some time orienting themselves before they touch anything.

Not such big problems if you hire some fancy-pants consultant with a blog and never have to modify the code after they leave, but these are problems for mere mortals.

So How Was LINQ Supposed to Help?

I don’t want to get too bogged down in discussing why LINQ is so exciting from a development perspective, but briefly it is an acronym for “Language-Integrated Query” and is pretty awesome because that’s exactly what it does. It allows a developer to drop something that looks an awful lot like a query into the middle of their code and work with the results of the query as objects instead of flat data. All this without tens or hundreds of lines mapping parameters, looping through readers, and stuffing data into objects by hand. As somebody who uses LINQ to Entities in secret, late at night, when nobody else is looking and I don’t care about parameter sniffing issues I can say it really is pretty awesome.

The reason I with I could love LINQ, as a DBA, is that all of this happens through SQL rather than through some proprietary library. In other words, LINQ is a SQL generator that is fully supported by IntelliSense and syntax checkers in Visual Studio. It does not generate concise SQL. It does not always generate comprehensible SQL. But it does generate correct SQL without leaving the developer an opportunity to forget an important space or misspell “SELECT”.

Here is a sample what some LINQ code may look like for this problem. I kept things kind of simple for this example, this isn’t necessarily code that I would put into production. Notice how, even though the syntax and order of clauses is a bit different, the database access code almost looks a little bit like an inline SQL query.

		Ticket[] LongVersion(int? TechnicianID, DateTime StartDate, DateTime EndDate)
		{
			ThrowAway.Blog_2013_11_LameTicketDBEntities ctx = new Blog_2013_11_LameTicketDBEntities();

			IEnumerable<Ticket> pancakes = null;   // "var" is for squids.

			if (TechnicianID == null)
			{
				pancakes = from tkt in ctx.Tickets
						   orderby tkt.TicketID
						   select tkt;
			}
			else
			{
				pancakes = (from tech in ctx.Technicians
							where tech.TechID == (int)TechnicianID
							select tech.Tickets).SingleOrDefault();
			}

			//Narrow down by StartDate
			if (StartDate != null)
			{
				pancakes = from tkt in pancakes
						   where tkt.CreationDate >= StartDate
						   select tkt;
			}

			//Narrow down by EndDate
			if (EndDate != null){
				pancakes = from tkt in pancakes
						   where tkt.CreationDate <= EndDate
						   select tkt;
			}

			return pancakes.ToArray();
		}

Or, more succinctly, if lambda expressions don’t make your head explode and if you aren’t that excited about the SQL-ish syntax

		Ticket[] CompactVersion(int? TechnicianID, DateTime StartDate, DateTime EndDate)
		{
			ThrowAway.Blog_2013_11_LameTicketDBEntities ctx = new Blog_2013_11_LameTicketDBEntities();

			IEnumerable<Ticket> pancakes = null;   // "var" is for squids.

			if (TechnicianID == null)
			{
				pancakes = ctx.Tickets;
			}
			else
			{
				pancakes = ctx.Technicians.Where( x => x.TechID == (int)TechnicianID ).SingleOrDefault().Tickets;
			}

			//Narrow down by StartDate
			if (StartDate != null)
			{
				pancakes = pancakes.Where( x => x.CompletionDate >= StartDate );
			}

			//Narrow down by EndDate
			if (EndDate != null){
				pancakes = pancakes.Where(x => x.CreationDate <= EndDate);
			}

			return pancakes.OrderBy(x => x.TicketID).ToArray();
		}

One key thing to understand about LINQ to Entities is that the query is not executed against the database until the results of the query are accessed. So in the above examples, the only thing that happens between the start of the function until just before the return statement is that the text of a SQL query is assembled. Since the optional parts of the query are enclosed in if/then statements the query will not contain any reference to conditions which do not apply, and since the query is not sent to the server until the results are actually needed no time is wasted generating results that may not be looked at. In both versions of the function, the query is actually executed at the time the ToArray() method is called.

But The Reader Gets the Impression Rick Dislikes LINQ
That would be correct, I do have issues with LINQ at least as it exists today. But this post is probably already more than long enough … stay tuned for my next post on why I feel LINQ fails to live up to its potential.

 

Why did execution times get unpredictable AFTER going production? Could be the ascending key.

Executive Summary

It is common for development to be performed against representative data which is much smaller than what may be expected in real life and, more significantly for this discussion, that does not update during the development process. If this data contains an ascending key (or a non-key date column that is frequently used to filter data) and statistics are not managed, there is a good chance that the statistics will auto update frequently and predictably immediately after deployment but infrequently and unpredictably later on as the size of the data set grows. This can lead to drastically different execution plans and wild swings in execution time for some operations from day to day. The most direct way to fix this issue is to manage any relevant statistics more closely. For example, if a large ETL operation is involved then updating statistics immediately after the data is loaded can work wonders for consistent performance. For more information on this issue, other than the below see, for example, sqlinthewild.

The Meat

Can’t quite recall why exactly I thought I wouldn’t be very busy this fall, but I do need to be better about making time to blog. As will probably be usual for me, this issue discussed in this post is actually well understood by the community but is not understood by many normal database developers that I encounter in my work. In other words, I’m not covering any new ground here, but knowledge does not seem to have been adequately transferred from the SQL gurus to the rank-and-file professionals (especially those in smaller shops). Another write-up certainly can’t hurt. And of course there are many many many things that can cause performance to go wonky after going production, this is only one of them.

This is a toy example loosely based on a recent case. At the point I was called in to help with the project a group of developers which did not include a dedicated DBA had been working on the project for a few years. The biggest concern that the development team had at this point was unpredictable performance in one of their ETL jobs. Each morning data was aggregated from a handful of different database servers and then some computation was done on the aggregated values. On some days the computation portion of this job was blazing fast, and on other days it was extremely slow. This behavior did not manifest itself during their testing. So let’s say that this is what we see when we graph out the computation time.

Chart of crazy runtimes

Except obviously in the real world the times would probably be measured in something other than milliseconds.

Without knowing more about the problem this kind of extreme difference in execution times kind of screams that an inappropriate execution plan may have been in play due to a statistics issue. As a starting point, let’s take a gander at the plan for the most recent day executed.

EXEC dbo.SimulateCalculation '20130930'

Wrong plan for this data

Sure enough, I see plenty to hate here just picking on the first statement in the plan. Without digging into any of the numbers, two things that jump out immediately are that we are using nested loops, which I would normally expect to see in smaller data sets. Additionally, all of the operations are seeks when I would have normally expected to see scans on two of the tables since thousands of records are coming back from each. Looking at the properties for the seek on LaborDetail would seem to clinch it. The plan only expected a single row to come back when there were actually almost 9000 rows.

At this point, it is tempting to say “I know what this is. This is a parameter sniffing issue! Lets throw a recompile or “optimize for” hint on this puppy and then get on with our day.” Unfortunately, it is not that simple. The compiled and runtime parameter values are identical (and if we wanted to cheat, we could look at the stored procedure definition actually has “WITH RECOMPILE” to simplify my problem recreation. Kids, don’t ever use “WITH RECOMPILE” in real life).

Properties window showing matching values

So where do we go from here? When it seems that the optimizer is making very poor decisions that does usually indicate some kind of statistics issue so this would probably be a good time to look directly at the relevant statistics.
Histogram showing no recent dates

Note that the window is actually scrolled all the way to the end. The last step in the histogram really is for September 12.

What’s going on here? The immediate problem is that the statistics are out of date. The last step in the histogram is for September 12, which means that for any date after September 12 the statistics indicate there “should” be no rows. Put simply, SQL Server “thinks” that it “knows” for certain that there is no data in the table more recent than September 12. SQL Server generally will not (never will?) actually use zero for an estimated cost or number of rows. It uses 1 instead. I assume the reason for this is that estimating zero rows would make it impossible for the optimizer to distinguish between good and bad plans on the effected branch of the query plan (I do need to remember to get verification on that someday). Anyway, when the stored procedure is compiled the optimizer operates on the assumption that only 1 row will come back from each of the three tables which is the reason seeks appear instead of scans, and also the reason we see a nested loop join.

The reason the statistics are out of date is that this example is relying on auto updating of statistics. Recall that for larger tables’ statistics are not invalidated until the table has accumulated at least (500 + 20% of the table size) changes, and statistics are not updated until the first use after invalidation. This means that if we load the same amount of data into the table each morning eventually we will reach a point where statistics are not updated each morning, or even every week, because each day’s load is a progressively smaller portion of the total data. There are at least a couple of reasons this is an issue for projects like this

  • Typically the unit tests that exercise the relevant code work against a static data set (because the load functionality is exercised in a separate set of unit tests – that is the nature of unit testing). No matter how robust and interesting this data set is, the fact that it doesn’t change means we have no chance of detecting this behavior in unit testing. Thorough integration testing could detect this but most smaller development shops that I’ve encountered are much stronger on unit than integration testing (if they are strong on testing at all).
  • When the system first goes into production, the tables are probably small at first so this issue will not appear for weeks or even months. It is entirely possible that by the time this happens the development team will have started to work on another project which means it may take them a while to come back up to speed on the code if their assistance is needed.

Since this post has been focused on large daily data loads, my preferred fix for this is to simply update the statistics at the end of each day’s data load using the UPDATE STATISTICS statement. Since each day extendeds the range of the histogram, each day is going to make a statistically significant change in a very focused period which should make a statistics update a no-brainer. That is not to say AUTO_UPDATE_STATISTICS should be turned off – it should definitely be left on. We just should not be depending on it for this particular index.

I would not be tempted to use trace flag 2389 in this case because of the fact that the entire data load happens at once so there is an obvious point in time at which statistics could be manually updated. There is no reason for us to consider mucking with the way statistics update works based on this information alone. Along the same lines, I also would not personally use trace flag 2371 for the same reason – increasing the frequency at which automatic updates happens does not change the fact that we ought to be managing this particular statistic. Further, flag 2371 doesn’t really kick in until the table is starting to get pretty large. That not only means that the issue would have kicked in long before 2371 starts to operate but in reality by the time 2371 starts to make a difference we probably would have already started breaking the table into smaller pieces for performance reasons (see for example Kimberly Tripp).

The Gristle

The observant reader will notice an issue with my chart. My example of September 30, one of the dates which had the statistics issue, actually corresponds to a fast runtime. In fact, towards the end of the chart where several days pass between updates, there are only a handful of spikes at a time when almost all of the days should have inappropriate execution plans. The really observant reader will actually notice that the last spike in the chart actually corresponds to example actually correspond to September 12 which was the last time the statistics auto updated. So this is a day that should have had a great execution plan.

What???!!!??? What is happening in this example is that, on the days where the statistics are auto updated, the statistics are invalidated when the data is loaded but are not actually recomputed until the tables are used by the daily computation. Since the example marks time in terms of milliseconds the statistics update actually dominates the runtime on the days where it occurs – that will not be the case in the real world because the computations will be a lot more expensive. Last time I encountered a real world example of this issue the job would run in tens of minutes on days where the statistics were current but would take several hours on days where they were not.

More to the point, even in the extremely unlikely event we saw this exact same pattern of the statistics update being more expensive than the computation that triggered it, in the real world it probably would not be a good idea to avoid updating statistics for the sake of speeding up the nightly jobs. The entire point of doing this kind of work overnight is to make the daily workload run as quickly as possible and the daily workload will run better if stats are current. If the stats need to be updated anyway, it may as well get done before any computation or aggregation is done on the data. I wouldn’t even really advise using AUTO_UPDATE_STATS_ASYNC in this case. I could see an argument for asynchronous update when data is coming in gradually, but when data changes suddenly I really do think it’s best to wait for the statistics before proceeding.

I thought about leaving the chart out entirely, but thought that it still had some value in giving the reader the flavor of the swings in execution time.

The Guts

The tables used for this example are defined below. I decided to use a time and effort system for this example. The employee table is populated by copying approximately 10,000 records form the AdventureWorks2012 employee table and then deleting those rows that have duplicated employee names. There are 8 Project_Hours tables. Imagine that the data from these 8 tables each came from a separate data source so computations could not be run across projects until all 8 tables were gathered onto the same server. No, I have never seen an actual time and effort system distributed in this way but my real world model for this was not a T&E system. Also note that the join between employee and the project tables is done by name instead of employee ID. Again, the real world model for this example was not a T&E system and the real world equivalent of this join was not as bizarre – but I needed to use something besides employee ID to ensure that table scans as opposed to seeks would be optimal for large amounts of data. Pretend each project manager manually tracked time themselves and did not know the ID’s of the employees on the project. Data from all 8 project tables is accumulated into a labor detail table (by date and employee ID). The TimeHistogram table keeps track of execution times for reporting purposes.

USE AscendingDate

CREATE TABLE dbo.Employee(
	EmployeeID  INT NOT NULL IDENTITY,
	FirstName   NVARCHAR(50) NOT NULL,
	MiddleName  NVARCHAR(50)     NULL,
	LastName    NVARCHAR(50) NOT NULL,
	DeptID      INT NOT NULL CONSTRAINT DF_DeptID DEFAULT 0,
	Wage        NUMERIC(4,2) NOT NULL,
	AGazillionMoreFields NCHAR(2000) NOT NULL CONSTRAINT DF_AGazillionMoreFieds DEFAULT N'',
	CONSTRAINT PK_Employee PRIMARY KEY CLUSTERED(EmployeeID)
);

INSERT INTO	dbo.Employee(FirstName, MiddleName, LastName, Wage)
	SELECT FirstName, MiddleName, LastName, CAST(CAST(CRYPT_GEN_RANDOM(1) AS TINYINT) as DECIMAL(5,2)) / 50 + 10
	FROM Adventureworks2012.Person.Person WHERE BusinessEntityID <= 10000;

DELETE e1
FROM dbo.Employee e1
INNER JOIN(
	SELECT FirstName, MiddleName, LastName
	FROM dbo.Employee
	GROUP BY FirstName, MiddleName, LastName
	HAVING COUNT(*) > 1
) e2 ON e1.FirstName = e2.FirstName AND ISNULL(e1.MiddleName, '') = ISNULL(e2.MiddleName, '') AND e1.LastName = e2.LastName;

CREATE TABLE dbo.ProjectHours_A(
	FirstName   NVARCHAR(50) NOT NULL,
	MiddleName  NVARCHAR(50)     NULL,
	LastName    NVARCHAR(50) NOT NULL,
	HoursWorked DECIMAL(3,2) NOT NULL CONSTRAINT DF_ProjectHoursA_HoursWorked DEFAULT CAST(CAST(CRYPT_GEN_RANDOM(1) AS TINYINT) as DECIMAL(5,2)) / 128
);

CREATE UNIQUE CLUSTERED INDEX UQ_ProjectHoursA_LastName_FirstName_MiddleName ON dbo.ProjectHours_A(LastName, FirstName, MiddleName);

CREATE TABLE dbo.ProjectHours_B(
	FirstName   NVARCHAR(50) NOT NULL,
	MiddleName  NVARCHAR(50)     NULL,
	LastName    NVARCHAR(50) NOT NULL,
	HoursWorked DECIMAL(3,2) NOT NULL CONSTRAINT DF_ProjectHoursB_HoursWorked DEFAULT CAST(CAST(CRYPT_GEN_RANDOM(1) AS TINYINT) as DECIMAL(5,2)) / 128
);

CREATE UNIQUE CLUSTERED INDEX UQ_ProjectHoursB_LastName_FirstName_MiddleName ON dbo.ProjectHours_B(LastName, FirstName, MiddleName);

CREATE TABLE dbo.ProjectHours_C(
	FirstName   NVARCHAR(50) NOT NULL,
	MiddleName  NVARCHAR(50)     NULL,
	LastName    NVARCHAR(50) NOT NULL,
	HoursWorked DECIMAL(3,2) NOT NULL CONSTRAINT DF_ProjectHoursC_HoursWorked DEFAULT CAST(CAST(CRYPT_GEN_RANDOM(1) AS TINYINT) as DECIMAL(5,2)) / 128
);

CREATE UNIQUE CLUSTERED INDEX UQ_ProjectHoursC_LastName_FirstName_MiddleName ON dbo.ProjectHours_C(LastName, FirstName, MiddleName);

CREATE TABLE dbo.ProjectHours_D(
	FirstName   NVARCHAR(50) NOT NULL,
	MiddleName  NVARCHAR(50)     NULL,
	LastName    NVARCHAR(50) NOT NULL,
	HoursWorked DECIMAL(3,2) NOT NULL CONSTRAINT DF_ProjectHoursD_HoursWorked DEFAULT CAST(CAST(CRYPT_GEN_RANDOM(1) AS TINYINT) as DECIMAL(5,2)) / 128
);

CREATE UNIQUE CLUSTERED INDEX UQ_ProjectHoursD_LastName_FirstName_MiddleName ON dbo.ProjectHours_D(LastName, FirstName, MiddleName);

CREATE TABLE dbo.ProjectHours_E(
	FirstName   NVARCHAR(50) NOT NULL,
	MiddleName  NVARCHAR(50)     NULL,
	LastName    NVARCHAR(50) NOT NULL,
	HoursWorked DECIMAL(3,2) NOT NULL CONSTRAINT DF_ProjectHoursE_HoursWorked DEFAULT CAST(CAST(CRYPT_GEN_RANDOM(1) AS TINYINT) as DECIMAL(5,2)) / 128
);

CREATE UNIQUE CLUSTERED INDEX UQ_ProjectHoursE_LastName_FirstName_MiddleName ON dbo.ProjectHours_E(LastName, FirstName, MiddleName);

CREATE TABLE dbo.ProjectHours_F(
	FirstName   NVARCHAR(50) NOT NULL,
	MiddleName  NVARCHAR(50)     NULL,
	LastName    NVARCHAR(50) NOT NULL,
	HoursWorked DECIMAL(3,2) NOT NULL CONSTRAINT DF_ProjectHoursF_HoursWorked DEFAULT CAST(CAST(CRYPT_GEN_RANDOM(1) AS TINYINT) as DECIMAL(5,2)) / 128
);

CREATE UNIQUE CLUSTERED INDEX UQ_ProjectHoursF_LastName_FirstName_MiddleName ON dbo.ProjectHours_F(LastName, FirstName, MiddleName);

CREATE TABLE dbo.ProjectHours_G(
	FirstName   NVARCHAR(50) NOT NULL,
	MiddleName  NVARCHAR(50)     NULL,
	LastName    NVARCHAR(50) NOT NULL,
	HoursWorked DECIMAL(3,2) NOT NULL CONSTRAINT DF_ProjectHoursG_HoursWorked DEFAULT CAST(CAST(CRYPT_GEN_RANDOM(1) AS TINYINT) as DECIMAL(5,2)) / 128
);

CREATE UNIQUE CLUSTERED INDEX UQ_ProjectHoursG_LastName_FirstName_MiddleName ON dbo.ProjectHours_G(LastName, FirstName, MiddleName);

CREATE TABLE dbo.ProjectHours_H(
	FirstName   NVARCHAR(50) NOT NULL,
	MiddleName  NVARCHAR(50)     NULL,
	LastName    NVARCHAR(50) NOT NULL,
	HoursWorked DECIMAL(3,2) NOT NULL CONSTRAINT DF_HoursWorked DEFAULT CAST(CAST(CRYPT_GEN_RANDOM(1) AS TINYINT) as DECIMAL(5,2)) / 128
);

CREATE UNIQUE CLUSTERED INDEX UQ_ProjectHoursH_LastName_FirstName_MiddleName ON dbo.ProjectHours_H(LastName, FirstName, MiddleName);

CREATE TABLE dbo.LaborDetail(
	EmployeeID  INT  NOT NULL,
	WorkDay     DATE NOT NULL,
	ProjAHours  DECIMAL(3,2) NOT NULL CONSTRAINT DF_LaborDetail_ProjAHours DEFAULT 0,
	ProjBHours  DECIMAL(3,2) NOT NULL CONSTRAINT DF_LaborDetail_ProjBHours DEFAULT 0,
	ProjCHours  DECIMAL(3,2) NOT NULL CONSTRAINT DF_LaborDetail_ProjCHours DEFAULT 0,
	ProjDHours  DECIMAL(3,2) NOT NULL CONSTRAINT DF_LaborDetail_ProjDHours DEFAULT 0,
	ProjEHours  DECIMAL(3,2) NOT NULL CONSTRAINT DF_LaborDetail_ProjEHours DEFAULT 0,
	ProjFHours  DECIMAL(3,2) NOT NULL CONSTRAINT DF_LaborDetail_ProjFHours DEFAULT 0,
	ProjGHours  DECIMAL(3,2) NOT NULL CONSTRAINT DF_LaborDetail_ProjGHours DEFAULT 0,
	ProjHHours  DECIMAL(3,2) NOT NULL CONSTRAINT DF_LaborDetail_ProjHHours DEFAULT 0,
	HoursWorked DECIMAL(4,2) NOT NULL CONSTRAINT DF_LaborDetail_LaborCost DEFAULT 0,
	LaborCost   DECIMAL(5,2) NOT NULL CONSTRAINT DF_LaborCost DEFAULT 0,
	RecordBiggerInRealLife NCHAR(2000) NOT NULL CONSTRAINT DF_RecordBiggerInRealLife DEFAULT '',
	CONSTRAINT PK_LaborDetail PRIMARY KEY CLUSTERED(WorkDay, EmployeeID)
);

CREATE TABLE dbo.TimeHistogram(
	WorkDay  DATE NOT NULL,
	LoadTime INT  NOT NULL,
	CalcTime INT  NOT NULL,
	NewStats BIT  NOT NULL,
	TotalTime AS LoadTime + CalcTime,
	CONSTRAINT PK_TimeHistogram PRIMARY KEY CLUSTERED(WorkDay)
);

I used three stored procedures to generate the data for this writeup. SimulateDailyLoad and SimulateDailyCalc simulate the process of loading and transforming a day’s worth of data. To simplify my life, the third procedure, GenerateHistogram, loops through 3/4 of a year’s worth of calls to these routines and accumulates timing data into the TimeHistogram table.

CREATE PROCEDURE dbo.SimulateDailyLoad
	@LoadDate DATE
AS
BEGIN
	TRUNCATE TABLE dbo.ProjectHours_A;
	TRUNCATE TABLE dbo.ProjectHours_B;
	TRUNCATE TABLE dbo.ProjectHours_C;
	TRUNCATE TABLE dbo.ProjectHours_D;
	TRUNCATE TABLE dbo.ProjectHours_E;
	TRUNCATE TABLE dbo.ProjectHours_F;
	TRUNCATE TABLE dbo.ProjectHours_G;
	TRUNCATE TABLE dbo.ProjectHours_H;
	INSERT INTO dbo.LaborDetail(EmployeeID, WorkDay) SELECT EmployeeID, @LoadDate FROM dbo.Employee;
	INSERT INTO dbo.ProjectHours_A(FirstName, MiddleName, LastName)
		SELECT FirstName, MiddleName, LastName FROM dbo.Employee
	INSERT INTO dbo.ProjectHours_B(FirstName, MiddleName, LastName)
		SELECT FirstName, MiddleName, LastName FROM dbo.Employee
	INSERT INTO dbo.ProjectHours_C(FirstName, MiddleName, LastName)
		SELECT FirstName, MiddleName, LastName FROM dbo.Employee
	INSERT INTO dbo.ProjectHours_D(FirstName, MiddleName, LastName)
		SELECT FirstName, MiddleName, LastName FROM dbo.Employee
	INSERT INTO dbo.ProjectHours_E(FirstName, MiddleName, LastName)
		SELECT FirstName, MiddleName, LastName FROM dbo.Employee
	INSERT INTO dbo.ProjectHours_F(FirstName, MiddleName, LastName)
		SELECT FirstName, MiddleName, LastName FROM dbo.Employee
	INSERT INTO dbo.ProjectHours_G(FirstName, MiddleName, LastName)
		SELECT FirstName, MiddleName, LastName FROM dbo.Employee
	INSERT INTO dbo.ProjectHours_H(FirstName, MiddleName, LastName)
		SELECT FirstName, MiddleName, LastName FROM dbo.Employee
END;

GO

CREATE PROCEDURE dbo.SimulateCalculation
	@LoadDate DATE
WITH RECOMPILE AS
BEGIN
	UPDATE d SET d.ProjAHours = h.HoursWorked
	FROM dbo.LaborDetail d
		INNER JOIN dbo.Employee e ON d.EmployeeID = e.EmployeeID
		INNER JOIN dbo.ProjectHours_A h ON e.FirstName = h.FirstName AND e.MiddleName = h.MiddleName AND e.LastName = h.LastName
	WHERE d.WorkDay = @LoadDate
	UPDATE d SET d.ProjBHours = h.HoursWorked
	FROM dbo.LaborDetail d
		INNER JOIN dbo.Employee e ON d.EmployeeID = e.EmployeeID
		INNER JOIN dbo.ProjectHours_B h ON e.FirstName = h.FirstName AND e.MiddleName = h.MiddleName AND e.LastName = h.LastName
	WHERE d.WorkDay = @LoadDate
	UPDATE d SET d.ProjCHours = h.HoursWorked
	FROM dbo.LaborDetail d
		INNER JOIN dbo.Employee e ON d.EmployeeID = e.EmployeeID
		INNER JOIN dbo.ProjectHours_C h ON e.FirstName = h.FirstName AND e.MiddleName = h.MiddleName AND e.LastName = h.LastName
	WHERE d.WorkDay = @LoadDate
	UPDATE d SET d.ProjDHours = h.HoursWorked
	FROM dbo.LaborDetail d
		INNER JOIN dbo.Employee e ON d.EmployeeID = e.EmployeeID
		INNER JOIN dbo.ProjectHours_D h ON e.FirstName = h.FirstName AND e.MiddleName = h.MiddleName AND e.LastName = h.LastName
	WHERE d.WorkDay = @LoadDate
	UPDATE d SET d.ProjEHours = h.HoursWorked
	FROM dbo.LaborDetail d
		INNER JOIN dbo.Employee e ON d.EmployeeID = e.EmployeeID
		INNER JOIN dbo.ProjectHours_E h ON e.FirstName = h.FirstName AND e.MiddleName = h.MiddleName AND e.LastName = h.LastName
	WHERE d.WorkDay = @LoadDate
	UPDATE d SET d.ProjFHours = h.HoursWorked
	FROM dbo.LaborDetail d
		INNER JOIN dbo.Employee e ON d.EmployeeID = e.EmployeeID
		INNER JOIN dbo.ProjectHours_F h ON e.FirstName = h.FirstName AND e.MiddleName = h.MiddleName AND e.LastName = h.LastName
	WHERE d.WorkDay = @LoadDate
	UPDATE d SET d.ProjGHours = h.HoursWorked
	FROM dbo.LaborDetail d
		INNER JOIN dbo.Employee e ON d.EmployeeID = e.EmployeeID
		INNER JOIN dbo.ProjectHours_G h ON e.FirstName = h.FirstName AND e.MiddleName = h.MiddleName AND e.LastName = h.LastName
	WHERE d.WorkDay = @LoadDate
	UPDATE d SET d.ProjHHours = h.HoursWorked
	FROM dbo.LaborDetail d
		INNER JOIN dbo.Employee e ON d.EmployeeID = e.EmployeeID
		INNER JOIN dbo.ProjectHours_H h ON e.FirstName = h.FirstName AND e.MiddleName = h.MiddleName AND e.LastName = h.LastName
	WHERE d.WorkDay = @LoadDate
	UPDATE dbo.LaborDetail SET HoursWorked = ProjAHours + ProjBHours + ProjCHours + ProjDHours + ProjEHours + ProjFHours + ProjGHours + ProjHHours
		WHERE WorkDay = @LoadDate;
	UPDATE d SET LaborCost = d.HoursWorked * e.Wage, RecordBiggerInRealLife = e.AGazillionMoreFields
		FROM dbo.LaborDetail d INNER JOIN dbo.Employee e ON d.EmployeeID = e.EmployeeID
		WHERE d.WorkDay = @LoadDate;
END;

GO

CREATE PROCEDURE dbo.GenerateHistogram AS
BEGIN
	DECLARE @dt DATE;
	DECLARE @t0 DATETIME;
	DECLARE @t1 DATETIME;
	DECLARE @t2 DATETIME;
	DECLARE @t3 DATETIME;
	DECLARE @flag BIT;

	TRUNCATE TABLE dbo.LaborDetail;
	TRUNCATE TABLE dbo.TimeHistogram;
	CHECKPOINT;
	DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
	SET @dt = '20130101';

	WHILE @dt < '20131001'
	BEGIN
		SET @t0 = CURRENT_TIMESTAMP;
		EXEC dbo.SimulateDailyLoad @dt;
		SET @t1 = CURRENT_TIMESTAMP;
		CHECKPOINT;
		DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
		SET @t2 = CURRENT_TIMESTAMP;
		EXEC dbo.SimulateCalculation @dt;
		SET @t3 = CURRENT_TIMESTAMP;
		SET @flag = 0;
		IF STATS_DATE(OBJECT_ID('dbo.LaborDetail'), 1) >= @t0
		BEGIN
			SET @flag = 1;
		END;
		INSERT INTO dbo.TimeHistogram(WorkDay, LoadTime, CalcTime, NewStats)
			VALUES(@dt, DATEDIFF(ms, @t0, @t1), DATEDIFF(ms, @t2, @t3), @flag);
		SET @dt = DATEADD(d, 1, @dt);
	END;
END;

The CHECKPOINT and DROPCLEANBUFFERS in the last stored procedure are probably completely unnecessary. They were originally added when I was debugging an issue and I simply forgot to remove them before the last time I ran this code.