SQL Aggregation and aliases

Overview

Teaching: 50 min
Exercises: 10 min
Questions
  • How can I summarize my data by aggregating, filtering, or ordering query results?

  • How can I make sure column names from my queries make sense and aren’t too long?

Objectives
  • Apply aggregation to group records in SQL.

  • Filter and order results of a query based on aggregate functions.

  • Employ aliases to assign new names to items in a query.

  • Save a query to make a new table.

  • Apply filters to find missing values in SQL.

COUNT and GROUP BY

Aggregation allows us to combine results by grouping records based on value, also it is useful for calculating combined values in groups.

Let’s go to the demog table and find out how many individuals have participated in the study. Using the wildcard * simply counts the number of records (rows):

SELECT COUNT(*)
FROM demog;

We can also find out how many genetic risk alleles are present in the group:

SELECT COUNT(*), SUM(apoe4)
FROM demog;

If we’re feeling fancy, we can combine these to output the average number of risk alleles in the sample, rounding to two decimal places.

Notice the factor of 1.0 in here. Both SUM and COUNT in this case will return integer values, and if we left that ratio as it is, we would get integer division. Multiplying it by 1.0 tells the computer that we want the answer represented as a real number instead.

SELECT COUNT(*), 1.0*SUM(APOE4)/COUNT(*)
FROM demog;

You might notice something strange about this result, though – this column represents a count, so how is the average value negative? Let’s investigate using another aggregate function, MIN.

SELECT MIN(APOE4)
FROM demog;

You may recall from the OpenRefine lesson that -4 represents a missing value. So a better answer would limit the rows included in this calculation to nonmissing values.

Challenge

Create a query that computes the average number of risk alleles in the sample, using WHERE to get a better answer.

Solution

SELECT COUNT(APOE4), ROUND(1.0*SUM(APOE4)/COUNT(APOE4),2)
FROM demog
WHERE APOE4 != -4;

Now let’s go back to the cog table and try out some of these aggregate functions there. We can start by counting the observations:

SELECT COUNT(*)
FROM cog;

We can see that this is many more rows than we had in the demog table. The reason for this is that this table is structured at the person-visit level: that is, it has one row per person per visit completed. If we want to know how many individuals were measured at each visit, we can use GROUP BY:

SELECT VISCODE, COUNT(*)
FROM cog
GROUP BY VISCODE;

Now instead of counting the number of observations overall, we’re grouping them by visit, and then counting those. Notice that the count of baseline visits (VISCODE=’bl’) matches the count of individuals we obtained earlier using COUNT(*) on the demog table.

Challenge

Write a query that returns the maximum, minimum, and average MMSE scores across all observations, rounding the average to the nearest tenth. Can you modify it so that it outputs these values only for non-missing observations from the 6-month followup visit?

Solution

-- All observations
SELECT ROUND(AVG(MMSE),1), MAX(MMSE), MIN(MMSE)
FROM cog;

-- Only non-missing observations at m06
SELECT ROUND(AVG(MMSE),1), MAX(MMSE), MIN(MMSE)
FROM cog
WHERE VISCODE='m06' AND MMSE != -4;

If we want to group by multiple fields, we give GROUP BY a comma separated list. For instance, maybe we also want to get separate counts by study phase:

SELECT COLPROT, VISCODE, COUNT(*)
FROM cog
GROUP BY COLPROT, VISCODE;

Challenge

Write queries that return:

  1. How many people in each diagnostic group (‘DX’) did we see at baseline?
  2. How many people have we seen in each diagnostic group at each visit?
  3. What was the average MMSE score for each diagnostic group at each visit?

Can you get the answer to both 2 and 3 in a single query?

Solution of 1

SELECT DX, COUNT(*)
FROM cog
WHERE VISCODE='bl'
GROUP BY DX;

Solution of 2 and 3

SELECT DX, VISCODE, COUNT(*), AVG(MMSE) 
FROM cog
GROUP BY DX, VISCODE;

Ordering Aggregated Results

We can order the results of our aggregation by a specific column, including the aggregated column. Let’s count the number of visits each person has had, ordering from most to least followup:

SELECT PTID, COUNT(*)
FROM cog
GROUP BY PTID
ORDER BY COUNT(*) DESC;

Aliases

As queries get more complex names can get long and unwieldy. To help make things clearer we can use aliases to assign new names to things in the query.

We can use aliases in column names or table names using AS:

SELECT MAX(MMSE) AS max_MMSE
FROM cog AS c;

The AS isn’t technically required, so you could do

SELECT MAX(MMSE) max_MMSE
FROM cog c;

but using AS is much clearer so it is good style to include it.

The HAVING keyword

In the previous episode, we have seen the keyword WHERE, allowing to filter the results according to some criteria. SQL offers a mechanism to filter the results based on aggregate functions, through the HAVING keyword.

For example, we can request to only return information about patients who have returned for at least one followup visit:

SELECT PTID, COUNT(VISCODE)
FROM cog
GROUP BY PTID
HAVING COUNT(VISCODE) > 2;

The HAVING keyword works exactly like the WHERE keyword, but uses aggregate functions instead of database fields to filter.

If you use AS in your query to rename a column, HAVING you can use this information to make the query more readable. For example, in the above query, we can call the COUNT(VISCODE) by another name, like n_visits. This can be written this way:

SELECT PTID, COUNT(VISCODE) AS n_visits
FROM cog
GROUP BY PTID
HAVING n_visits > 2;

Note that in both queries, HAVING comes after GROUP BY. One way to think about this is: the data are retrieved (SELECT), which can be filtered (WHERE), then joined in groups (GROUP BY); finally, we can filter again based on some of these groups (HAVING).

Challenge

Write a query that returns, from the cog table, the number of patients observed in each baseline diagnostic group (DX_bl) at each visit (VISCODE), only for groups containing at least 100 patients.

Solution

SELECT DX_bl, VISCODE, COUNT(*) AS n_patients
FROM cog
GROUP BY DX_bl, VISCODE
HAVING n_patients > 100;

Saving Queries for Future Use

It is not uncommon to repeat the same operation more than once, for example for monitoring or reporting purposes. SQL comes with a very powerful mechanism to do this by creating views. Views are a form of query that is saved in the database, and can be used to look at, filter, and even update information. One way to think of views is as a table, that can read, aggregate, and filter information from several places before showing it to you.

Creating a view from a query requires to add CREATE VIEW viewname AS before the query itself. For example, imagine that my project only covers patients who started with a diagnosis of CN. That query would look like:

SELECT *
FROM cog
WHERE DX_bl = 'CN';

But we don’t want to have to type that every time we want to ask a question about that particular subset of data. Hence, we can benefit from a view:

CREATE VIEW CN_baseline AS
SELECT *
FROM cog
WHERE DX_bl = 'CN';

You can also add a view using Create View in the View menu and see the results in the Views tab, the same way as creating a table from the menu.

Using a view we will be able to access these results with a much shorter notation:

SELECT *
FROM CN_baseline
WHERE MMSE < 10;

Key Points

  • Use the GROUP BY keyword to aggregate data.

  • Functions like MIN, MAX, AVERAGE, SUM, COUNT, etc. operate on aggregated data.

  • Aliases can help shorten long queries. To write clear and readible queries, use the AS keyword when creating aliases.

  • Use the HAVING keyword to filter on aggregate properties.

  • Use a VIEW to access the result of a query as though it was a new table.