Basic Queries

Overview

Teaching: 30 min
Exercises: 5 min
Questions
  • How do I write a basic query in SQL?

Objectives
  • Write and build queries.

  • Filter data given various criteria.

  • Sort the results of a query.

Writing my first query

Let’s start by using the demog table. This table has one row for each participant in the study. The first few columns of this table record some information on each individual person’s study participation, including unique identifiers for each person and each site, and the study protocol under which they were recruited. The next group of columns demographic information for every participant in the study, including their gender, years of education, ethnicity, race, and marital status. Finally, the last column contains each person’s status on one gene that increases the risk of Alzheimer’s disease.

Let’s write an SQL query that selects only the gender column (PTGENDER) from the demog table. SQL queries can be written in the box located under the “Execute SQL” tab. Click on the right arrow above the query box to execute the query. (You can also use the keyboard shortcut “Cmd-Enter” on a Mac or “Ctrl-Enter” on a Windows machine to execute a query.) The results are displayed in the box below your query.

SELECT PTGENDER
FROM demog;

We have capitalized the words SELECT and FROM because they are SQL keywords. SQL is case insensitive, but it helps for readability, and is good style.

If we want more information, we can just add a new column to the list of fields, right after SELECT:

SELECT PTGENDER, PTEDUCAT, PTMARRY
FROM demog;

Or we can select all of the columns in a table using the wildcard *

SELECT *
FROM demog;

Limiting results

Sometimes you don’t want to see all the results, you just want to get a sense of of what’s being returned. In that case you can use the LIMIT command. In particular you would want to do this if you were working with large databases.

SELECT *
FROM demog
LIMIT 10; 

Unique values

If we want only the unique values so that we can quickly see, e.g., what the different study protocols are, we can use DISTINCT:

SELECT DISTINCT COLPROT
FROM demog;

If we select more than one column, then the distinct pairs of values are returned:

SELECT DISTINCT COLPROT, SITE
FROM demog;

Person-level tables vs person-visit-level tables

Let’s turn now to a different table, cog, which contains individual scores on cognitive tests. Some of these tests measure specific abilities, like memory for lists of words (RAVLT_learning) or for stories (LDELTOTAL), or processing speed (TRABSCOR), and others try to estimate a person’s overall cognitive health across a range of tasks (ADAS11).

Note that this table is a little different from demog, because it has one row per person for each study visit they completed. To track this, we have a new study column, VISCODE, that has a shorthand telling you which visit a given observation represents. We can use the SELECT DISTINCT trick again to see the visit schedule for each of the studies in this dataset.

SELECT DISTINCT COLPROT, VISCODE
FROM cog;

Because this table is person-visit level, meaning that its unique identifier is a combination of person ID and visit ID, it is sometimes handy to keep track of participants’ baseline values – that is, values at study entry – for each test. In this table, we see those recorded in separate columns ending in _bl. For example, ADAS11 records the unique value on this test at each visit, which can change from one visit to the next, whereas ADAS11_bl will have the same value for a given person at each visit.

SELECT PTID, VISCODE, ADAS11, ADAS11_bl
FROM cog;

Calculated values

Whereas the demog table mostly contained text columns, most of the scores in the cog table are numeric. That means that we can do calculations on them. For example, if we wanted to look at how much a person’s ADAS11 score has changed since baseline, we would write

SELECT PTID, ADAS11-ADAS11_bl
FROM cog;

When we run the query, the expression ADAS11-ADAS11_bl is evaluated for each row and appended to that row, in a new column. Expressions can use any fields and any arithmetic operators (+, -, *, and /), and can be nested.

Expressions can also use a variety of built-in functions. For example, we could round the values to the nearest whole number (0 decimal places).

SELECT PTID, ROUND(ADAS11-ADAS11_bl, 0)
FROM cog;

Challenge

  • Write a query that returns the percent change in ADAS11 from baseline, rounded to the nearest tenth.

Solution

SELECT PTID, ROUND(100*(ADAS11-ADAS11_bl)/ADAS11_bl, 1)
FROM cog;

Filtering

Databases can also filter data – selecting only the data meeting certain criteria. For example, let’s say we only want data for the original phase of ADNI, which has a protocol ID (COLPROT) of ADNI1. We need to add a WHERE clause to our query:

SELECT *
FROM cog
WHERE COLPROT='ADNI1';

We can do the same thing with numbers. Here, we only want scores of people whose baseline MMSE was at least 27.

SELECT * FROM cog
WHERE MMSE_bl >= 27;

We can use more sophisticated conditions by combining tests with AND and OR.
For example, suppose we want 6-month visit followup data for everyone with a baseline MMSE score of at least 27:

SELECT *
FROM cog
WHERE (MMSE_bl >= 27) AND (VISCODE = 'm06');

Note that the parentheses are not needed, but again, they help with readability. They also ensure that the computer combines AND and OR in the way that we intend.

If we wanted to get followup data for any post-baseline visits in the first 12 months, which have visit codes of m03, m06, and m12, we could combine the tests using OR:

SELECT *
FROM cog
WHERE (VISCODE = 'm03') OR (VISCODE = 'm06') OR (VISCODE = 'm12');

Challenge

  • Produce a table listing the data for all individuals whose baseline diagnosis was CN, telling us the patient ID, the visit code, their current diagnosis, their LDELTOTAL score, and their change in LDELTOTAL since baseline.

Solution

SELECT PTID, VISCODE, DX, LDELTOTAL, LDELTOTAL-LDELTOTAL_bl
FROM cog
WHERE DX_bl='CN';

Building more complex queries

Now, lets combine the above queries to get data for the same 3 followup visits for people with baseline MMSE scores of at least 27. This time, let’s use IN as one way to make the query easier to understand. It is equivalent to saying ` WHERE (VISCODE = ‘m03’) OR (VISCODE = ‘m06’) OR (VISCODE = ‘m12’)`, but reads more neatly:

SELECT *
FROM cog
WHERE (MMSE_bl >= 27) AND (VISCODE IN ('m03', 'm06', 'm12'));

We started with something simple, then added more clauses one by one, testing their effects as we went along. For complex queries, this is a good strategy, to make sure you are getting what you want. Sometimes it might help to take a subset of the data that you can easily see in a temporary database to practice your queries on before working on a larger or more complicated database.

When the queries become more complex, it can be useful to add comments. In SQL, comments are started by --, and end at the end of the line. For example, a commented version of the above query can be written as:

-- Get first year followup data for those with baseline MMSE of 27 or higher
-- These are in the cog table, and we are interested in all columns
SELECT * FROM cog
-- Baseline MMSE is in the column `MMSE_bl`, and we want to include 27
WHERE (MMSE_bl >= 27)
-- These visits `VISCODE` m03, m06, and m12
AND (VISCODE IN ('m03', 'm06', 'm12'));

Although SQL queries often read like plain English, it is always useful to add comments; this is especially true of more complex queries.

Sorting

We can also sort the results of our queries by using ORDER BY. For instance, perhaps we want to group together all visits from an individual patient. We can do this by ordering by patient ID.

SELECT *
FROM cog
ORDER BY PTID ASC;

The keyword ASC tells us to order it in Ascending order. We could alternately use DESC to get descending order.

SELECT *
FROM cog
ORDER BY PTID DESC;

ASC is the default.

We can also sort on several fields at once. Since patients can be in more than one phase of the ADNI study, perhaps we want to group together their visits by phase.

SELECT *
FROM cog
ORDER BY PTID ASC, COLPROT ASC;

Alternatively, perhaps we want an ordering that puts all the ADNI1 visits together, then all the ADNI2 visits, etc., with single patients’ visits grouped together inside those larger groupings.

SELECT *
FROM cog
ORDER BY COLPROT ASC, PTID ASC;

Challenge

  • Write a query that returns patient ID, diagnosis, and MMSE score from the cog table, sorted first by patient, then by protocol (descending), then by visit code.

Solution

SELECT PTID, DX, MMSE
FROM cog
ORDER BY PTID ASC, COLPROT DESC, VISCODE ASC;

Order of execution

Another note for ordering. We don’t actually have to display a column to sort by it, or to select on it. For example, let’s say we want to see all records from patients with MCI, ordered by the study phase, but we only want to see patient ID and MMSE.

SELECT PTID, MMSE
FROM cog
WHERE DX = 'MCI'
ORDER BY COLPROT ASC;

We can do this because sorting occurs earlier in the computational pipeline than field selection.

The computer is basically doing this:

  1. Filtering rows according to WHERE
  2. Sorting results according to ORDER BY
  3. Displaying requested columns or expressions.

Clauses are written in a fixed order: SELECT, FROM, WHERE, then ORDER BY. It is possible to write a query as a single line, but for readability, we recommend to put each clause on its own line.

Challenge

  • Let’s try to combine what we’ve learned so far in a single query. Using the cog table write a query to display patient ID, diagnosis, and ADAS11 (rounded to one decimal places), for individuals in ADNI1, ordered alphabetically by patient ID and visit code.
  • Write the query as a single line, then put each clause on its own line, and see how more legible the query becomes!

Solution

SELECT PTID, DX, round(ADAS11, 1)
FROM cog
WHERE COLPROT = 'ADNI1'
ORDER BY PTID, VISCODE;

Key Points

  • It is useful to apply conventions when writing SQL queries to aid readability.

  • Use logical connectors such as AND or OR to create more complex queries.

  • Calculations using mathematical symbols can also be performed on SQL queries.

  • Adding comments in SQL helps keep complex queries understandable.