SQL Aggregation and aliases
Overview
Teaching: 50 min
Exercises: 10 minQuestions
How can I summarize my data by aggregating, filtering, or ordering query results?
How can I make sure column names from my queries make sense and aren’t too long?
Objectives
Apply aggregation to group records in SQL.
Filter and order results of a query based on aggregate functions.
Employ aliases to assign new names to items in a query.
Save a query to make a new table.
Apply filters to find missing values in SQL.
COUNT and GROUP BY
Aggregation allows us to combine results by grouping records based on value, also it is useful for calculating combined values in groups.
Let’s go to the demog table and find out how many individuals have participated in the study. Using the wildcard * simply counts the number of records (rows):
SELECT COUNT(*)
FROM demog;
We can also find out how many genetic risk alleles are present in the group:
SELECT COUNT(*), SUM(apoe4)
FROM demog;
If we’re feeling fancy, we can combine these to output the average number of risk alleles in the sample, rounding to two decimal places.
Notice the factor of 1.0 in here. Both SUM and COUNT in this case will return integer values, and if we left that ratio as it is, we would get integer division. Multiplying it by 1.0 tells the computer that we want the answer represented as a real number instead.
SELECT COUNT(*), 1.0*SUM(APOE4)/COUNT(*)
FROM demog;
You might notice something strange about this result, though – this column represents a count, so how is the average value negative? Let’s investigate using another aggregate function, MIN
.
SELECT MIN(APOE4)
FROM demog;
You may recall from the OpenRefine lesson that -4 represents a missing value. So a better answer would limit the rows included in this calculation to nonmissing values.
Challenge
Create a query that computes the average number of risk alleles in the sample, using WHERE to get a better answer.
Solution
SELECT COUNT(APOE4), ROUND(1.0*SUM(APOE4)/COUNT(APOE4),2) FROM demog WHERE APOE4 != -4;
Now let’s go back to the cog table and try out some of these aggregate functions there. We can start by counting the observations:
SELECT COUNT(*)
FROM cog;
We can see that this is many more rows than we had in the demog table. The reason for this
is that this table is structured at the person-visit level: that is, it has one row per
person per visit completed. If we want to know how many individuals were measured at each
visit, we can use GROUP BY
:
SELECT VISCODE, COUNT(*)
FROM cog
GROUP BY VISCODE;
Now instead of counting the number of observations overall, we’re grouping them by visit, and then counting those. Notice that the count of baseline visits (VISCODE=’bl’) matches the count of individuals we obtained earlier using COUNT(*) on the demog table.
Challenge
Write a query that returns the maximum, minimum, and average MMSE scores across all observations, rounding the average to the nearest tenth. Can you modify it so that it outputs these values only for non-missing observations from the 6-month followup visit?
Solution
-- All observations SELECT ROUND(AVG(MMSE),1), MAX(MMSE), MIN(MMSE) FROM cog; -- Only non-missing observations at m06 SELECT ROUND(AVG(MMSE),1), MAX(MMSE), MIN(MMSE) FROM cog WHERE VISCODE='m06' AND MMSE != -4;
If we want to group by multiple fields, we give GROUP BY
a comma separated list. For instance, maybe we also want to get separate counts by study phase:
SELECT COLPROT, VISCODE, COUNT(*)
FROM cog
GROUP BY COLPROT, VISCODE;
Challenge
Write queries that return:
- How many people in each diagnostic group (‘DX’) did we see at baseline?
- How many people have we seen in each diagnostic group at each visit?
- What was the average MMSE score for each diagnostic group at each visit?
Can you get the answer to both 2 and 3 in a single query?
Solution of 1
SELECT DX, COUNT(*) FROM cog WHERE VISCODE='bl' GROUP BY DX;
Solution of 2 and 3
SELECT DX, VISCODE, COUNT(*), AVG(MMSE) FROM cog GROUP BY DX, VISCODE;
Ordering Aggregated Results
We can order the results of our aggregation by a specific column, including the aggregated column. Let’s count the number of visits each person has had, ordering from most to least followup:
SELECT PTID, COUNT(*)
FROM cog
GROUP BY PTID
ORDER BY COUNT(*) DESC;
Aliases
As queries get more complex names can get long and unwieldy. To help make things clearer we can use aliases to assign new names to things in the query.
We can use aliases in column names or table names using AS
:
SELECT MAX(MMSE) AS max_MMSE
FROM cog AS c;
The AS
isn’t technically required, so you could do
SELECT MAX(MMSE) max_MMSE
FROM cog c;
but using AS
is much clearer so it is good style to include it.
The HAVING
keyword
In the previous episode, we have seen the keyword WHERE
, allowing to
filter the results according to some criteria. SQL offers a mechanism to
filter the results based on aggregate functions, through the HAVING
keyword.
For example, we can request to only return information about patients who have returned for at least one followup visit:
SELECT PTID, COUNT(VISCODE)
FROM cog
GROUP BY PTID
HAVING COUNT(VISCODE) > 2;
The HAVING
keyword works exactly like the WHERE
keyword, but uses
aggregate functions instead of database fields to filter.
If you use AS
in your query to rename a column, HAVING
you can use this
information to make the query more readable. For example, in the above
query, we can call the COUNT(VISCODE)
by another name, like
n_visits
. This can be written this way:
SELECT PTID, COUNT(VISCODE) AS n_visits
FROM cog
GROUP BY PTID
HAVING n_visits > 2;
Note that in both queries, HAVING
comes after GROUP BY
. One way to
think about this is: the data are retrieved (SELECT
), which can be filtered
(WHERE
), then joined in groups (GROUP BY
); finally, we can filter again based on some
of these groups (HAVING
).
Challenge
Write a query that returns, from the
cog
table, the number of patients observed in each baseline diagnostic group (DX_bl
) at each visit (VISCODE
), only for groups containing at least 100 patients.Solution
SELECT DX_bl, VISCODE, COUNT(*) AS n_patients FROM cog GROUP BY DX_bl, VISCODE HAVING n_patients > 100;
Saving Queries for Future Use
It is not uncommon to repeat the same operation more than once, for example for monitoring or reporting purposes. SQL comes with a very powerful mechanism to do this by creating views. Views are a form of query that is saved in the database, and can be used to look at, filter, and even update information. One way to think of views is as a table, that can read, aggregate, and filter information from several places before showing it to you.
Creating a view from a query requires to add CREATE VIEW viewname AS
before the query itself. For example, imagine that my project only covers
patients who started with a diagnosis of CN
. That query would look like:
SELECT *
FROM cog
WHERE DX_bl = 'CN';
But we don’t want to have to type that every time we want to ask a question about that particular subset of data. Hence, we can benefit from a view:
CREATE VIEW CN_baseline AS
SELECT *
FROM cog
WHERE DX_bl = 'CN';
You can also add a view using Create View in the View menu and see the results in the Views tab, the same way as creating a table from the menu.
Using a view we will be able to access these results with a much shorter notation:
SELECT *
FROM CN_baseline
WHERE MMSE < 10;
Key Points
Use the
GROUP BY
keyword to aggregate data.Functions like
MIN
,MAX
,AVERAGE
,SUM
,COUNT
, etc. operate on aggregated data.Aliases can help shorten long queries. To write clear and readible queries, use the
AS
keyword when creating aliases.Use the
HAVING
keyword to filter on aggregate properties.Use a
VIEW
to access the result of a query as though it was a new table.