I Finally Get Cross Apply!

For years I’ve looked at various queries online in sample code, diagnostic queries using DMVs, and the like and seen CROSS APPLY in the FROM clauses. But I’ve never really managed to comprehend what it was for or how it worked because I never saw a case where it was directly applied to something I was doing.

Finally, this week I had a breakthrough. I was working on updating a bunch of data but it was breaking on a small subset of that data. In this case, I was attempting to JOIN two tables on fields that should have been INTs, but in a very small number of cases one side was using a comma-delimited string. The user told me that someone else had done these updates in the past and didn’t encounter the problem I was having (so I knew that it was something i was doing “wrong”), but given that it was only a handful of broken updates she was OK with manually doing the updates (we were scripting it because we were updating potentially tens of thousands of records).

I am not OK with manually fixing this in the future. I wanted to know how the other DBA had done it before. I dug into some history and found CROSS APPLY. My nemesis. I was determined to figure out how to use it this time.

Setting the Stage

Let’s set up three simple tables to keep track of airports and what state each airport is in. But our developer doesn’t totally get database design and in his state-to-airport mapping table, he allows for a comma-separated list of airports associated with each state.

    CREATE TABLE #States
    ([Id]      INT IDENTITY(1, 1),
    StateName NVARCHAR(30) NOT NULL
    );
    CREATE TABLE #Airports
    ([Id]     INT IDENTITY(1, 1),
    IATACode CHAR(3) NOT NULL
    );
    CREATE TABLE #StateAirports
    (StateId  INT PRIMARY KEY NOT NULL,
    Airports NVARCHAR(50)
    )

This makes getting a list of airports and their associated state names tricky at best if we don’t know about CROSS APPLY. With CROSS APPLY, it’s pretty straightforward.

Solution

Here’s the finished query.

    SELECT s.statename,
        a.iatacode
    FROM #StateAirports SA1
        CROSS APPLY string_split(SA1.airports, ',') AS SA2
        JOIN #Airports A ON A.Id = SA2.value
        JOIN #states S ON S.Id = SA1.stateid

string_split() is a Table Valued Function which we finally got in SQL Server 2016 after far too many years of having to write (or, let’s face it, copy from someone’s blog post) inefficient string splitting functions. Important note: even if your database engine is SQL Server 2016, the database you’re operating in must be at CompatibilityLevel 130

Breaking it down

If we take CROSS APPLY and break it down into its parts, it finally starts to make sense.
* APPLY the string_split() function to the Airports field of the #StateAirports table
* Append the each row of string_split()‘s output to the corresponding row on #StateAirports (similar to a CROSS JOIN but not exactly)

So now I have N rows for each StateId in #StateAirports, where N is the number of values in the comma-separated field. And JOINed to each row is one of the rows from the output of string_split().

    SELECT *
    FROM #StateAirports SA1
        CROSS APPLY string_split(SA1.airports, ',') AS SA2

From there, the query is pretty normal otherwise, JOINing to the other two tables to translate the state & airport ID numbers to their text values.

Hopefully this helps others get a handle on CROSS APPLY and find useful places for it. This had been a head-scratched for me for years, but only because I didn’t have an example that clearly broke down how to use it and what was going on. In hindsight, I probably could have used it in some analysis I did at a previous job but instead resorted to parsing & processing comma-separated data in a PowerShell script.

Advertisements

Don’t Count on Me

This post is in support of Tim Ford’s (b|t#iwanttohelp challenge. And also written because this has burned me twice in the past 3 months and by blogging about it, hopefully it’ll stick in my mind.

Setup

I’ve recently been doing a bunch of work with stored procedures, trying to improve performance that’s been suffering due to suboptimal queries. Some of this tuning has resulted in creating temporary tables. After making my changes and running the procedures in SSMS, everything looked good – data’s correct, performance metrics are all improved. Everyone wins!

Then I checked the web app. At first, it appeared to work OK. But on reloading the page with different parameters, I got no data, or the data from the previous parameters, or other data that was completely out of left field. Not good!

After a bit of head-scratching, I popped over to the SQL Server Slack and asked for ideas about why I’d be getting different results depending on how the procedure was called. After kicking a few ideas around, someone asked if the procedure included SET NOCOUNT ON. It didn’t, so I added it and my problems were solved!

Explanation

So what happened here?  When you execute a query against SQL Server, both your data and some additional information is sent back to the client. This additional information is sent via a separate channel which is accessible via the SqlConnection.InfoMessages (or if you’re still using classic ADO, the InfoMessage) event. When you run queries in SSMS, you see this information in the Messages tab of the results pane most often as X row(s) affected.

That’s where my new stored procedures were causing problems. Where the original procedures were returning only one event which corresponded to the number of records returned by the single query in each procedure. But now that I’m loading temp tables, I’m getting multiple messages back – at a minimum, a count of the records affected when loading the temp table plus a count of the records returned to the calling application.

I’m not sure what exactly my application was doing with this, but as soon as multiple messages were passed back through InfoMessage(s), it got very confused and started doing unexpected things with the query results. I suspect that it saw multiple events and attempted to use the data associated with the first one, of which there was none because it was just inserting into my temp table.

By starting the stored procedure with SET NOCOUNT ON, InfoMessages is disabled and this additional data isn’t transmitted to the client. It’s also said that this can improve performance (although it’s more about network traffic these days) but my primary interest in using it is to keep client applications that I can’t change from blowing up.

Something I find very interesting is that SSMS ships with two different templates for stored procedures and one does include SET NOCOUNT ON, while the other doesn’t.

Example

Here are three simple stored procedures to demonstrate the effect of this setting.

CREATE OR ALTER PROCEDURE dbo.GetCounties
AS
print 'GetCounties';
select s.name,c.countyname
from states s join counties c on s.StateId = c.StateId;
go

CREATE OR ALTER PROCEDURE dbo.GetCounties2
AS
create table #StatesCounties (
StateName nvarchar(100)
,CountyName nvarchar(100)
);
print 'GetCounties2';
insert into #StatesCounties
select s.name as StateName,c.countyname
from states s join counties c on s.StateId = c.StateId;
select StateName,CountyName from #StatesCounties;
go

CREATE OR ALTER PROCEDURE dbo.GetCounties3
AS
SET NOCOUNT ON
create table #StatesCounties (
StateName nvarchar(100)
,CountyName nvarchar(100)
);
print 'GetCounties3';
insert into #StatesCounties
select s.name as StateName,c.countyname
from states s join counties c on s.StateId = c.StateId;
select StateName,CountyName from #StatesCounties;

And the result of running each, from the SSMS Messages tab.

GetCounties

(122 row(s) affected)
GetCounties2

(122 row(s) affected)

(122 row(s) affected)
GetCounties3

Note how the first reports the number of rows returned, while the second reports both the number of rows inserted into the temp table and the number returned from the query. In the last example, no messages are returned. In all cases, the print messages are returned because they’re explicitly output by my code.

Summary

  • Unless you have a very specific need to get this alternate data stream in your calling application, use SET NOCOUNT ON in your stored procedures
  • The next time you’re working in a stored procedure, add it if it’s not already there
  • Add it to the template you use for creating new stored procedures

Name Your Defaults So SQL Server Doesn’t

Something in SQL Server that isn’t always obvious to beginners is that when you create a default value for a column on a table, SQL Server creates a constraint (much like a primary or foreign key). All constraints must have a name, and if one isn’t specified SQL Server will generate one for you. For example:

CREATE TABLE [dbo].[point_types] (
[typeid] [int] NOT NULL DEFAULT(NEXT VALUE FOR [pointtypeid])
,[typename] [nvarchar](30) NOT NULL DEFAULT 'Unspecified'
,CONSTRAINT [PK_PointType] PRIMARY KEY CLUSTERED ([typeid] ASC)
)
GO

We’ve got a simple table here and both fields have a default value set (the primary key’s value is generated from a sequence object, pointtypeid). Time goes on, and a change in policy comes up which requires that I change the default value of typename to Unknown. To do this, I have to drop the constraint and re-create it. To find the name of the constraint, I can either ask sp_help, or run this query:

SELECT all_columns.NAME
,default_constraints.NAME
,default_constraints.DEFINITION
FROM sys.all_columns
INNER JOIN sys.tables
ON all_columns.object_id = tables.object_id
INNER JOIN sys.schemas
ON tables.schema_id = schemas.schema_id
INNER JOIN sys.default_constraints
ON all_columns.default_object_id = default_constraints.object_id
WHERE schemas.NAME = 'dbo'
AND tables.NAME = 'point_types';

I’ve got my constraint name now, so I can drop it & re-create it

NameDefaults01

ALTER TABLE [dbo].[point_types]

DROP CONSTRAINT DF__point_typ__typen__21B6055D;
GO

ALTER TABLE [dbo].[point_types] ADD DEFAULT('Unknown')
FOR [typename];
GO

And if I re-run the above query, I can see that the constraint’s name is different.

NameDefaults02

This means that everywhere I need to change this constraint (development, test and production), I’ll need to figure out the constraint name in that particular database and drop it before re-creating it. This makes a deployment script a bit messier, as it needs more code to find those constraint names

DECLARE @constraintname VARCHAR(50);

SELECT @constraintname = default_constraints.NAME
FROM sys.all_columns
INNER JOIN sys.tables
ON all_columns.object_id = tables.object_id
INNER JOIN sys.schemas
ON tables.schema_id = schemas.schema_id
INNER JOIN sys.default_constraints
ON all_columns.default_object_id = default_constraints.object_id
WHERE schemas.NAME = 'dbo'
AND tables.NAME = 'point_types'

DECLARE @sql NVARCHAR(200) = N'alter table [dbo].[point_types] drop constraint ' + @constraintname;

PRINT @sql;

NameDefaults03

EXECUTE sp_executesql @sql;

ALTER TABLE [dbo].[point_types] ADD DEFAULT('Unknown')
FOR [typename];
GO

But this doesn’t really solve my problem, it just works around it. It’s still messy and fragile. If I need to do other operations on the default constraint, I need to go through the same exercise to find its name.

Fortunately, SQL Server lets us name default constraints just like any other constraint, and by doing so we avoid this trouble. By setting my own name for the constraint, I know what it’ll be in every database, without having to query system tables. The name can be set in both the CREATE TABLE statement and an independent ALTER TABLE.

CREATE TABLE [dbo].[point_types] (
[typeid] [int] NOT NULL DEFAULT(NEXT VALUE FOR [pointtypeid])
,[typename] [nvarchar](30) NOT NULL CONSTRAINT [DF_PT_TypeName] DEFAULT 'Unspecified'
,CONSTRAINT [PK_PointType] PRIMARY KEY CLUSTERED ([typeid] ASC)
);
GO

ALTER TABLE [dbo].[point_types]

DROP CONSTRAINT [DF_PT_TypeName];
GO

ALTER TABLE [dbo].[point_types] ADD CONSTRAINT [DF_PT_TypeName] DEFAULT('Unknown')
FOR [typename];
GO

I can also combine these in the next deployment that requires a change to the default constraint, dropping the system-generated name and establishing my own static name to make things simpler in the future.

Is explicitly naming default (or any other) constraints necessary? No, but doing so helps your database document itself, and it makes future deployment/promotion scripts simpler and less prone to breakage. SQL Server needs a name for the constraint regardless; it’s worth specifying it yourself.

Selectively Locking Down Data – Gracefully

I have a situation where I need to retrieve the data in an encrypted column from, but don’t want to give all my users access to the symmetric key used to encrypt that column. The data is of the sort where it’s important for the  application to produce the required output, but if a user runs the stored procedure to see what the application is getting from it, it’s not critical that they see this one field.

The catch is that if the stored procedure is written with the assumption that the caller has permission to access the encryption key or its certificate, they’ll get an error. After a bit of research and pondering later, I came up with two options:

  1. Create the stored procedure with EXECUTE AS OWNER (the owner in this case is dbo). This would let all users see the encrypted data; not an ideal solution.
  2. Use SQL Server’s TRY/CATCH construct to gracefully handle the error thrown when the user attempts to open the key, but doesn’t have permission to do so.

Let’s check out option 2. This example is simplified from my actual scenario to demonstrate the idea.


declare @BankId varchar(6) = '123456';

SELECT cast('' as varchar(50)) AS AccountNum,
,AccountName
,AccountOwner
INTO #AccountData
FROM dbo.Accounts
WHERE OriginatingBank = @BankId
AND AccountType = 'Checking'

DECLARE @AcctNo VARCHAR(30);

BEGIN TRY
OPEN SYMMETRIC KEY MyKey DECRYPTION BY CERTIFICATE My_Cert

SELECT @AcctNo = CONVERT(VARCHAR, decryptbykey(AccountNum))
FROM dbo.Accounts
WHERE OriginatingBank = @BankId
AND AccountType = 'Checking'

CLOSE SYMMETRIC KEY MyKey
END TRY

BEGIN CATCH
SET @AcctNo = 'Access Restricted';
END CATCH

UPDATE #AccountData SET AccountNum = @AcctNo;

SELECT * FROM #AccountData;

DROP TABLE #AccountData;

TRY/CATCH in T-SQL works similarly to how it does in languages like C# or PowerShell. It allows you to attempt an operation and take care of any error conditions fairly easily.

In this case, I’m attempting to open the encryption key. But if the user doesn’t have permission to do so, it doesn’t terminate the stored procedure with an error. Instead, it jumps to the CATCH block, where I’ve defined an alternate way of handling the situation. Here, if the user doesn’t have the appropriate permissions, they’ll just get “Access Restricted” for the account number, and access to that sensitive data is a little more tightly controlled – while still letting users access the data they do need.

Hello GETDATE() My Old Friend…

So you’ve decided that your new web application needs to record some page load time metrics so you can keep tabs on performance. Terrific!  You set up a couple page load/complete functions to write to a logging table when a page request comes in, and then update the record when it finishes loading.

INSERT INTO PageLogs (
    RequestTime
   ,ResponseTime
   ,RemoteIP
   ,UserName
   ,PageId
   ,Parameters
   ,SessionId
) VALUES (
    GETDATE() 
    ,NULL
    ,127.0.0.1
    ,'Dave'
    ,'Home'
    ,'Pd=2015Q2'
    ,'883666b1-99be-48c8-bf59-5a5739bc7d1d'
);
UPDATE PageLogs
    SET ResponseTime = GETDATE()
    WHERE SessionId = '883666b1-99be-48c8-bf59-5a5739bc7d1d';

You set up an hourly job to delete any logs older than 2 weeks (just to prevent information overload) and you call it a day. Each morning, you run a report to look at the previous day’s performance, watch the trends over the past week or so, and you’re pretty happy with things. Pages are loading in a fraction of a second, according to the logs. People find the application useful, word spreads around the office, and adoption takes off. The project is a success!

Then the support calls start rolling in. Users say it’s taking “forever” to load pages (they don’t have exact numbers, but it’s just too slow). This can’t be possible. The report says everything’s running just as fast as it did in test!

You walk down the hall and visit your friendly Senior DBA. He pulls up his monitoring tools and shows you that the hourly maintenance that keeps the PageLogs table fit & trim is causing a bunch of blocking while it does lots of DELETEs. And your INSERT queries are the victims.

Here’s the thing: GETDATE() (like any other function) doesn’t get evaluated until that query executes. Not when you call ExecuteNonQuery(), not even when SQL Server receives the query. So even if your INSERT isn’t holding up the execution of your page (because you’ve executed it asynchronously), it won’t accurately represent when the page load started. Instead, it tells you when your query executed. In this context that can be misleading because it won’t tell you how long it really took for your page to load.

If you need to log the time an event transpired accurately, GETDATE() isn’t your friend. You need to explicitly set the time in the query.

INSERT INTO PageLogs (
    RequestTime
   ,ResponseTime
   ,RemoteIP
   ,UserName
   ,PageId
   ,Parameters
   ,SessionId
) VALUES (
    '2015-05-15T09:45:00Z' 
    ,NULL
    ,127.0.0.1
    ,'Dave'
    ,'Home'
    ,'Pd=2015Q2'
    ,'883666b1-99be-48c8-bf59-5a5739bc7d1d'
);
UPDATE PageLogs
    SET ResponseTime = '2015-05-15T09:45:02Z'
    WHERE SessionId = '883666b1-99be-48c8-bf59-5a5739bc7d1d';

If you aren’t used to seeing significant blocking in your databases, you may not have run into this. But get into this habit anyway. At some point you probably will see blocking on a table like this, and logging with GETDATE() will make the data you attempted to write during that blocking invalid. If you can’t trust all of your data, can you trust any of it?