Thursday, April 28, 2011

Finding The Top One Value Per Group

I ran into an issue the other day where I had a table like this:

IDnamelocationcommentcomment_time
1Johnhomehello1/1/2011 01:16:23
2Johnworkcomment21/12/2011 13:23:10
3Johnhomeblah1/13/2011 02:02:02
4Billworksmurf1/15/2011 23:59:00
5Billhomejello1/17/2011 13:30:00
6Billhomesnafu1/18/2011 12:00:19
7Frankworkbatman1/19/2011 14:15:16


Somebody came up to me and said "Hey John, I would like to see the last comment made by each person in the database and where they were when they made the comment". The logical answer is to use a Group By query. The problem is that you can't include a non-aggregate field in a Group By query which means I can't get at both the location and the comment in single query. I need to get a set of data that I can query against to get the result set i need. So each row in the return set has to be unique. I would love to use this subquery:
SELECT TOP(1) id, name
FROM table1
GROUP BY name
ORDER BY comment_time desc

But you get the following error:
Msg 8120, Level 16, State 1, Line 63
Column 'dbo.table1.id' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

So I have to use an aggregate function (SUM, AVG, MIN, MAX or COUNT). At first I thought about using the MAX function on the ID to give me the highest ID but that assumes that the comment_time field is generated on INSERT and isn't updateable (for example, it wouldn't work if the comment_times for ids 2 and 3 were switched). Since I don't know and can't control that I decided to use MAX on the comment_time field, like this:

SELECT MAX(comment_time) as comment_time, name
FROM table1
GROUP BY name

Which gives me:


comment_timename
1/13/2011 02:02:02John
1/19/2011 14:15:16Frank
1/18/2011 12:00:19Bill

You have to alias the MAX(comment_time) or else it will return the column with the header "(No Column Name)" which wold make it difficult to use it in the next step.

Now I can use that set to get the result set i want like this:

SELECT name, location, comment, date
FROM table1
JOIN
(SELECT MAX(comment_time) as comment_time, name
FROM table1
GROUP BY name) as sub
ON table1.name = sub.name
AND table1.comment_time = sub.comment_time
ORDER BY
table1.comment_time

Which gives us:

namelocationcommentcomment_time
JohnHomeblah1/13/2011 02:02:02
Billhomesnafu1/18/2011 12:00:19
Frankworkbatman1/19/2011 14:15:16

The only problem might be if you have two comments by the same person in the same day at exactly the same time. If the completion time is generated by the server on insert, you shouldn't have this problem but if it is generated by the application you might (especially if the field is editable). My answer is that if two comments were made at the same time, they are both the last comment and should both be returned so my solution works fine (for my needs). The other problem is that my solution doesn't work if you need to return more than one row per group.

Let me know in the comments if you have a different solution.

Hasta Luego,
RutherfordTX