Joins

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • How do I bring data together from separate tables?

Objectives
  • Employ joins to combine data from two tables.

  • Apply functions to manipulate individual values.

  • Employ aliases to assign new names to tables and columns in a query.

Joins

To combine data from two tables we use the SQL JOIN command, which comes after the FROM command.

The JOIN command on its own will result in a cross product, where each row in the first table is paired with each row in the second table. Usually this is not what is desired when combining two tables with data that is related in some way.

For that, we need to tell the computer which columns provide the link between the two tables using the word ON. What we want is to join the data with the same patient id.

SELECT demog.*, cog.MMSE, cog.MMSE_bl
FROM demog
JOIN cog
ON demog.ptid = cog.ptid;

ON is like WHERE, it filters things out according to a test condition. We use the table.colname format to tell the manager what column in which table we are referring to.

The output of the JOIN command will have all the columns specified from both tables. Here, by selecting demog.*, we tell it we want all the variables from the first table, plus the specified columns from the second table. For the above command, the output will be a table that has the following column names:

RID PTID SITE COLPROT ORIGPROT PTGENDER PTEDUCAT PTETHCAT PTRACCAT_mod PTMARRY APOE4 MMSE MMSE_bl
2 011_S_0002 11 ADNI1 ADNI1 Male 16 Not Hisp/Latino White Married 0 -4 28

Alternatively, we can use the word USING, as a short-hand. USING only works on columns which share the same name. In this case we are telling the manager that we want to combine demog with cog and that the common column is ptid.

  SELECT demog.*, cog.MMSE, cog.MMSE_bl
FROM demog
JOIN cog
USING (ptid);

The output will only have one ptid column

|RID|PTID|SITE|COLPROT|ORIGPROT|PTGENDER|PTEDUCAT|PTETHCAT|PTRACCAT_mod|PTMARRY|APOE4|MMSE|MMSE_bl| |—|—|—|—|—|—|—|—|—|—|—|—|—|

We often won’t want all of the fields from both tables, so as you see in this example, anywhere we would have used a field name in a non-join query, we can use table.colname.

For example, what if we wanted information on which visits each person had completed, but instead of the visit code, we wanted a more fully descriptive label including which study phase it was in and the full name of the visit.

SELECT cog.ptid, cog.viscode, visits.phase, visits.visname
FROM cog
JOIN visits
ON cog.viscode = visits.viscode
ORDER BY cog.ptid, visits.phase, visits.visorder;

Note that you can order by a column that is not included in the SELECT statement. Here, that choice allows us to order each person’s participation in the study chronologically.

Many databases, including SQLite, also support a join through the WHERE clause of a query.
For example, you may see the query above written without an explicit JOIN.

  SELECT cog.ptid, cog.viscode, visits.phase, visits.visname
  FROM cog, visits
WHERE cog.viscode = visits.viscode
ORDER BY cog.ptid, visits.phase, visits.visorder;

For the remainder of this lesson, we’ll stick with the explicit use of the JOIN keyword for joining tables in SQL.

Challenge:

  • Write a query that returns the patient ID, the visit name, and the hippocampal volume of every person at each visit where it was measured.

Solution

SELECT mri.ptid, visits.visname, mri.Hippocampus
FROM visits
JOIN mri
ON visits.viscode = mri.viscode;

Multi-column join

The joins we’ve done up til this point have been one to many: each patient in one table can be matched to multiple rows in another table, but joining on the patient ID is sufficient to get the matches right. However, sometimes you want to join two tables that each have multiple rows per patient. In this case, you will need to join on multiple identifiers.

  SELECT cog.ptid, cog.viscode, cog.mmse, mri.Hippocampus
  FROM cog
  JOIN mri
ON cog.ptid = mri.ptid AND cog.viscode = mri.viscode
ORDER BY cog.ptid, cog.viscode;

Different join types

We can count the number of records returned by the join query above.

SELECT COUNT(*)
FROM cog
  JOIN mri
ON cog.ptid = mri.ptid AND cog.viscode = mri.viscode
ORDER BY cog.ptid, cog.viscode;

Notice that this number is smaller than the number of records present in the cog data.

SELECT COUNT(*) FROM cog;

This is because, by default, SQL only returns records where the joining value is present in the joined columns of both tables (i.e. it takes the intersection of the two join columns). This joining behaviour is known as an INNER JOIN. In fact the JOIN command is simply shorthand for INNER JOIN and the two terms can be used interchangably as they will produce the same result.

We can also tell the computer that we wish to keep all the records in the first table by using the command LEFT OUTER JOIN, or LEFT JOIN for short.

Challenge:

  • Re-write the original query to keep all the entries present in the cogs table. How many records are returned by this query?

Solution

SELECT * FROM cog
LEFT JOIN mri
ON cog.ptid = mri.ptid AND cog.viscode = mri.viscode;

Challenge:

  • Count the number of person-visits in the cog table that have a NULL value in the Hippocampus column.

Solution

SELECT COUNT(*)
FROM cog
LEFT JOIN mri
ON cog.ptid = mri.ptid AND cog.viscode = mri.viscode
WHERE Hippocampus IS NULL;

Remember: In SQL a NULL value in one table can never be joined to a NULL value in a second table because NULL is not equal to anything, not even itself.

Combining joins with sorting and aggregation

Joins can be combined with sorting, filtering, and aggregation. So, if we wanted average hippocampal volume of men and women, we would write:

SELECT demog.ptgender, AVG(mri.Hippocampus)
FROM demog
JOIN mri
ON demog.ptid = mri.ptid
GROUP BY demog.ptgender;

Challenge:

  • Write a query that returns the number of MRI observations in each diagnostic category at each visit. Order the results by descending number of observations in each group.

Solution

SELECT cog.dx, cog.viscode, COUNT(*) AS number_obs
FROM cog
JOIN mri
ON cog.ptid = mri.ptid and cog.viscode = mri.viscode
GROUP BY cog.dx, cog.viscode
ORDER BY number_obs DESC;

Challenge:

  • Write a query that finds the average hippocampal volume by visit, and the number of hippocampal volume records obtained, for patients with a baseline MMSE score less than 24.

Solution

SELECT mri.viscode, COUNT(mri.Hippocampus), AVG(mri.Hippocampus)
FROM mri
JOIN cog
ON mri.ptid = cog.PTID
WHERE cog.mmse_bl<24
GROUP BY mri.viscode;

Functions IFNULL and NULLIF and more

SQL includes numerous functions for manipulating data. You’ve already seen some of these being used for aggregation (SUM and COUNT) but there are functions that operate on individual values as well. Probably the most important of these are IFNULL and NULLIF. IFNULL allows us to specify a value to use in place of NULL.

We can represent unknown sexes with -4 instead of NULL:

SELECT ptid, Hippocampus, IFNULL(Hippocampus, -4)
FROM cog
LEFT JOIN mri
ON cog.ptid = mri.ptid AND cog.viscode = mri.viscode;

The lone “Hippocampus” column is only included in the query above to illustrate where IFNULL has changed values; this isn’t a usage requirement.

The inverse of IFNULL is NULLIF. This returns NULL if the first argument is equal to the second argument. If the two are not equal, the first argument is returned. This is useful for “nulling out” specific values.

We can “null out” values equal to -4, which is a missing value indicator in these data:

SELECT ptid, viscode, mmse, NULLIF(mmse, -4) as mmse_null
FROM cog;

Some more functions which are common to SQL databases are listed in the table below:

Function Description
ABS(n) Returns the absolute (positive) value of the numeric expression n
LENGTH(s) Returns the length of the string expression s
LOWER(s) Returns the string expression s converted to lowercase
NULLIF(x, y) Returns NULL if x is equal to y, otherwise returns x
ROUND(n) or ROUND(n, x) Returns the numeric expression n rounded to x digits after the decimal point (0 by default)
TRIM(s) Returns the string expression s without leading and trailing whitespace characters
UPPER(s) Returns the string expression s converted to uppercase

Finally, some useful functions which are particular to SQLite are listed in the table below:

Function Description
IFNULL(x, y) Returns x if it is non-NULL, otherwise returns y
RANDOM() Returns a random integer between -9223372036854775808 and +9223372036854775807.
REPLACE(s, f, r) Returns the string expression s in which every occurrence of f has been replaced with r
SUBSTR(s, x, y) or SUBSTR(s, x) Returns the portion of the string expression s starting at the character position x (leftmost position is 1), y characters long (or to the end of s if y is omitted)

To practice we have some optional challenges for you.

Challenge (optional):

SQL queries help us ask specific questions which we want to answer about our data. The real skill with SQL is to know how to translate our scientific questions into a sensible SQL query (and subsequently visualize and interpret our results).

Have a look at the following questions; these questions are written in plain English. Can you translate them to SQL queries and give a suitable answer?

  1. How many sites are there for each study phase?

  2. How many patients are of each sex are there for each visit code?

  3. How many patients with each diagnosis participated in each study phase? Be careful to count only one visit per study phase per person.

  4. What is the average baseline hippocampal volume of those with 0, 1, and 2 APOE4 risk alleles? Round to some convenient length.

  5. What are the minimum, maximum and average MMSE for each diagnostic group at each visit? Round the average to some convenient length.

  6. What is the average change in MMSE from baseline to m12 followup for men and women? Round to some convenient length.

Proposed solutions:

  1. Solution:
    select colprot, count(distinct site) as N
    from demog
    group by colprot;
    
  1. Solution:
    select demog.ptgender, cog.viscode, count(*) as N
    from demog join cog on demog.ptid = cog.ptid
    group by ptgender, viscode;
    
  1. Solution:
    select colprot, dx_bl, count(*) as N
    from cog
    where viscode = 'bl' and dx_bl != -4
    group by colprot, dx_bl;
    
  1. Solution:
    select demog.APOE4, round(avg(mri.Hippocampus_bl), 2) as mean_hc
    from demog join mri on demog.ptid = mri.ptid
    where mri.viscode = 'bl' and demog.APOE4 != -4
    group by demog.APOE4
    order by mean_hc desc;
    
  1. Solution:
    select dx, viscode, min(mmse) as min_mmse, max(mmse) as max_mmse, 
    round(avg(mmse), 2) as avg_mmse
    from cog
    where dx != -4 and mmse > -4
    group by dx, viscode;
    
  1. Solution:
    select demog.ptgender, round(avg(cog.mmse - cog.mmse_bl), 2) as mmse_change
    from demog left join cog on demog.ptid = cog.PTID
    where cog.viscode = 'm12'
    group by ptgender;
    

Key Points

  • Use the JOIN command to combine data from two tables—the ON or USING keywords specify which columns link the tables.

  • Regular JOIN returns only matching rows. Other join commands provide different behavior, e.g., LEFT JOIN retains all rows of the table on the left side of the command.

  • IFNULL allows you to specify a value to use in place of NULL, which can help in joins

  • NULLIF can be used to replace certain values with NULL in results

  • Many other functions like IFNULL and NULLIF can operate on individual values.