Joins
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How do I bring data together from separate tables?
Objectives
Employ joins to combine data from two tables.
Apply functions to manipulate individual values.
Employ aliases to assign new names to tables and columns in a query.
Joins
To combine data from two tables we use the SQL JOIN
command, which comes after
the FROM
command.
The JOIN
command on its own will result in a cross product, where each row in
the first table is paired with each row in the second table. Usually this is not
what is desired when combining two tables with data that is related in some way.
For that, we need to tell the computer which columns provide the link between the two
tables using the word ON
. What we want is to join the data with the same
patient id.
SELECT demog.*, cog.MMSE, cog.MMSE_bl
FROM demog
JOIN cog
ON demog.ptid = cog.ptid;
ON
is like WHERE
, it filters things out according to a test condition. We use
the table.colname
format to tell the manager what column in which table we are
referring to.
The output of the JOIN
command will have all the columns specified from both tables.
Here, by selecting demog.*, we tell it we want all the variables from the first table, plus the
specified columns from the second table. For the above command, the output will be a table
that has the following column names:
RID | PTID | SITE | COLPROT | ORIGPROT | PTGENDER | PTEDUCAT | PTETHCAT | PTRACCAT_mod | PTMARRY | APOE4 | MMSE | MMSE_bl |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 011_S_0002 | 11 | ADNI1 | ADNI1 | Male | 16 | Not Hisp/Latino | White | Married | 0 | -4 | 28 |
Alternatively, we can use the word USING
, as a short-hand. USING
only
works on columns which share the same name. In this case we are
telling the manager that we want to combine demog
with cog
and that
the common column is ptid
.
SELECT demog.*, cog.MMSE, cog.MMSE_bl
FROM demog
JOIN cog
USING (ptid);
The output will only have one ptid column
|RID|PTID|SITE|COLPROT|ORIGPROT|PTGENDER|PTEDUCAT|PTETHCAT|PTRACCAT_mod|PTMARRY|APOE4|MMSE|MMSE_bl| |—|—|—|—|—|—|—|—|—|—|—|—|—|
We often won’t want all of the fields from both tables, so as you see in this example,
anywhere we would have used a field name in a non-join query, we can use table.colname
.
For example, what if we wanted information on which visits each person had completed, but instead of the visit code, we wanted a more fully descriptive label including which study phase it was in and the full name of the visit.
SELECT cog.ptid, cog.viscode, visits.phase, visits.visname
FROM cog
JOIN visits
ON cog.viscode = visits.viscode
ORDER BY cog.ptid, visits.phase, visits.visorder;
Note that you can order by a column that is not included in the SELECT statement. Here, that choice allows us to order each person’s participation in the study chronologically.
Many databases, including SQLite, also support a join through the WHERE
clause of a query.
For example, you may see the query above written without an explicit JOIN.
SELECT cog.ptid, cog.viscode, visits.phase, visits.visname
FROM cog, visits
WHERE cog.viscode = visits.viscode
ORDER BY cog.ptid, visits.phase, visits.visorder;
For the remainder of this lesson, we’ll stick with the explicit use of the JOIN
keyword for
joining tables in SQL.
Challenge:
- Write a query that returns the patient ID, the visit name, and the hippocampal volume of every person at each visit where it was measured.
Solution
SELECT mri.ptid, visits.visname, mri.Hippocampus FROM visits JOIN mri ON visits.viscode = mri.viscode;
Multi-column join
The joins we’ve done up til this point have been one to many: each patient in one table can be matched to multiple rows in another table, but joining on the patient ID is sufficient to get the matches right. However, sometimes you want to join two tables that each have multiple rows per patient. In this case, you will need to join on multiple identifiers.
SELECT cog.ptid, cog.viscode, cog.mmse, mri.Hippocampus
FROM cog
JOIN mri
ON cog.ptid = mri.ptid AND cog.viscode = mri.viscode
ORDER BY cog.ptid, cog.viscode;
Different join types
We can count the number of records returned by the join query above.
SELECT COUNT(*)
FROM cog
JOIN mri
ON cog.ptid = mri.ptid AND cog.viscode = mri.viscode
ORDER BY cog.ptid, cog.viscode;
Notice that this number is smaller than the number of records present in the cog data.
SELECT COUNT(*) FROM cog;
This is because, by default, SQL only returns records where the joining value
is present in the joined columns of both tables (i.e. it takes the intersection
of the two join columns). This joining behaviour is known as an INNER JOIN
.
In fact the JOIN
command is simply shorthand for INNER JOIN
and the two
terms can be used interchangably as they will produce the same result.
We can also tell the computer that we wish to keep all the records in the first
table by using the command LEFT OUTER JOIN
, or LEFT JOIN
for short.
Challenge:
- Re-write the original query to keep all the entries present in the
cogs
table. How many records are returned by this query?Solution
SELECT * FROM cog LEFT JOIN mri ON cog.ptid = mri.ptid AND cog.viscode = mri.viscode;
Challenge:
- Count the number of person-visits in the
cog
table that have aNULL
value in theHippocampus
column.Solution
SELECT COUNT(*) FROM cog LEFT JOIN mri ON cog.ptid = mri.ptid AND cog.viscode = mri.viscode WHERE Hippocampus IS NULL;
Remember: In SQL a NULL
value in one table can never be joined to a NULL
value in a
second table because NULL
is not equal to anything, not even itself.
Combining joins with sorting and aggregation
Joins can be combined with sorting, filtering, and aggregation. So, if we wanted average hippocampal volume of men and women, we would write:
SELECT demog.ptgender, AVG(mri.Hippocampus)
FROM demog
JOIN mri
ON demog.ptid = mri.ptid
GROUP BY demog.ptgender;
Challenge:
- Write a query that returns the number of MRI observations in each diagnostic category at each visit. Order the results by descending number of observations in each group.
Solution
SELECT cog.dx, cog.viscode, COUNT(*) AS number_obs FROM cog JOIN mri ON cog.ptid = mri.ptid and cog.viscode = mri.viscode GROUP BY cog.dx, cog.viscode ORDER BY number_obs DESC;
Challenge:
- Write a query that finds the average hippocampal volume by visit, and the number of hippocampal volume records obtained, for patients with a baseline MMSE score less than 24.
Solution
SELECT mri.viscode, COUNT(mri.Hippocampus), AVG(mri.Hippocampus) FROM mri JOIN cog ON mri.ptid = cog.PTID WHERE cog.mmse_bl<24 GROUP BY mri.viscode;
Functions IFNULL
and NULLIF
and more
SQL includes numerous functions for manipulating data. You’ve already seen some
of these being used for aggregation (SUM
and COUNT
) but there are functions
that operate on individual values as well. Probably the most important of these
are IFNULL
and NULLIF
. IFNULL
allows us to specify a value to use in
place of NULL
.
We can represent unknown sexes with -4
instead of NULL
:
SELECT ptid, Hippocampus, IFNULL(Hippocampus, -4)
FROM cog
LEFT JOIN mri
ON cog.ptid = mri.ptid AND cog.viscode = mri.viscode;
The lone “Hippocampus” column is only included in the query above to illustrate where IFNULL
has changed values; this isn’t a usage requirement.
The inverse of IFNULL
is NULLIF
. This returns NULL
if the first argument
is equal to the second argument. If the two are not equal, the first argument
is returned. This is useful for “nulling out” specific values.
We can “null out” values equal to -4, which is a missing value indicator in these data:
SELECT ptid, viscode, mmse, NULLIF(mmse, -4) as mmse_null
FROM cog;
Some more functions which are common to SQL databases are listed in the table below:
Function | Description |
---|---|
ABS(n) |
Returns the absolute (positive) value of the numeric expression n |
LENGTH(s) |
Returns the length of the string expression s |
LOWER(s) |
Returns the string expression s converted to lowercase |
NULLIF(x, y) |
Returns NULL if x is equal to y, otherwise returns x |
ROUND(n) or ROUND(n, x) |
Returns the numeric expression n rounded to x digits after the decimal point (0 by default) |
TRIM(s) |
Returns the string expression s without leading and trailing whitespace characters |
UPPER(s) |
Returns the string expression s converted to uppercase |
Finally, some useful functions which are particular to SQLite are listed in the table below:
Function | Description |
---|---|
IFNULL(x, y) |
Returns x if it is non-NULL, otherwise returns y |
RANDOM() |
Returns a random integer between -9223372036854775808 and +9223372036854775807. |
REPLACE(s, f, r) |
Returns the string expression s in which every occurrence of f has been replaced with r |
SUBSTR(s, x, y) or SUBSTR(s, x) |
Returns the portion of the string expression s starting at the character position x (leftmost position is 1), y characters long (or to the end of s if y is omitted) |
To practice we have some optional challenges for you.
Challenge (optional):
SQL queries help us ask specific questions which we want to answer about our data. The real skill with SQL is to know how to translate our scientific questions into a sensible SQL query (and subsequently visualize and interpret our results).
Have a look at the following questions; these questions are written in plain English. Can you translate them to SQL queries and give a suitable answer?
How many sites are there for each study phase?
How many patients are of each sex are there for each visit code?
How many patients with each diagnosis participated in each study phase? Be careful to count only one visit per study phase per person.
What is the average baseline hippocampal volume of those with 0, 1, and 2 APOE4 risk alleles? Round to some convenient length.
What are the minimum, maximum and average MMSE for each diagnostic group at each visit? Round the average to some convenient length.
What is the average change in MMSE from baseline to m12 followup for men and women? Round to some convenient length.
Proposed solutions:
- Solution:
select colprot, count(distinct site) as N from demog group by colprot;
- Solution:
select demog.ptgender, cog.viscode, count(*) as N from demog join cog on demog.ptid = cog.ptid group by ptgender, viscode;
- Solution:
select colprot, dx_bl, count(*) as N from cog where viscode = 'bl' and dx_bl != -4 group by colprot, dx_bl;
- Solution:
select demog.APOE4, round(avg(mri.Hippocampus_bl), 2) as mean_hc from demog join mri on demog.ptid = mri.ptid where mri.viscode = 'bl' and demog.APOE4 != -4 group by demog.APOE4 order by mean_hc desc;
- Solution:
select dx, viscode, min(mmse) as min_mmse, max(mmse) as max_mmse, round(avg(mmse), 2) as avg_mmse from cog where dx != -4 and mmse > -4 group by dx, viscode;
- Solution:
select demog.ptgender, round(avg(cog.mmse - cog.mmse_bl), 2) as mmse_change from demog left join cog on demog.ptid = cog.PTID where cog.viscode = 'm12' group by ptgender;
Key Points
Use the
JOIN
command to combine data from two tables—theON
orUSING
keywords specify which columns link the tables.Regular
JOIN
returns only matching rows. Other join commands provide different behavior, e.g.,LEFT JOIN
retains all rows of the table on the left side of the command.
IFNULL
allows you to specify a value to use in place ofNULL
, which can help in joins
NULLIF
can be used to replace certain values withNULL
in resultsMany other functions like
IFNULL
andNULLIF
can operate on individual values.