Filtering and Sorting with OpenRefine
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How can we select only a subset of our data to work with?
How can we sort our data?
Objectives
Employ text filter or include/exclude to filter to a subset of rows.
Sort tables by a column.
Sort tables by multiple columns.
Lesson
Filtering
There are many entries in our data table. We can filter it to work on a subset of the data in the list for the next set of operations. Please ensure you perform this step to save time during the class.
- Click the down arrow next to
CMMED
>Text filter
. ACMMED
facet will appear on the left margin. - Type in
aspirin
and press return. There are 1931 matching rows of the original 55360 rows (and these rows are selected for the subsequent steps). - At the top, change the view to
Show
50rows
. This way you will see all the matching rows.
Exercise
- Scroll over so you can see CMREASON (the indication for the prescription). What is the most frequent indication for this medication?
- How would you restrict this list to just the most frequent indication?
Solution
- Do
Facet
>Text facet
on theCMREASON
column after filtering, then select “Sort by count.” Notice that the most frequentCMREASON
values are numeric. Among the text values, the most frequent indication isgeneral health
(28 rows).- To restrict to only cases taking aspirin for general health, add a second text filter. Notice that if we don’t select the checkbox for “case sensitive,” we see 32 rows, a few more than the 28 returned above.
Excluding entries
In addition to the solutions included above, another way to narrow our filter is to include
and/or exclude
entries in a facet. If you still have your facet for CMMED
, you can use it, or use drop-down menu > Facet
> Text facet
to create a new facet. Only the entries with names that agree with your Text filter
will be included in this facet.
Faceting and filtering look very similar. A good distinction is that faceting gives you an overview description of all of the data that is currently selected, while filtering allows you to select a subset of your data for analysis.
Exercise
Use
include / exclude
to select only entries matchinggeneral health
.Solution
- Remove the filter we applied in the previous solution.
- In the facet (left margin), click on
general health
. Notice that when you click on the name, or hover over it, there are entries to the right foredit
andinclude
.- Click
include
. This will explicitly include this indication, and exclude others that are not expicitly included. Notice that the option now changes toexclude
.- Click
include
andexclude
on another indication (e.g.cardiovascular prophylaxis
) and notice how the two entries appear and disappear from the table.
Important: Remove these selection criteria from your filtered dataset before continuing with the rest of the exercises.
Sort
You can sort the data in a column by using the drop-down menu available in that column.
There you can sort by text
, numbers
, dates
or booleans
(TRUE
or FALSE
values). You can also specify what order to put Blanks
and Errors
in the sorted results.
If this is your first time sorting this table, then the drop-down menu for the selected column shows Sort...
. Select what you would like to sort by (such as numbers
). Additional options will then appear for you to fine-tune your sorting.
Exercise
Sort by USERDATE such that the latest dates are first. What happened?
Solution
In the
USERDATE
column, selectSort...
>dates
and selectlatest first
. Nothing happens! This is because it is not formatted as a date. SelectEdit cells
>Common transforms
>To date
. Then try the sort again.
If you try to re-sort a column that you have already used, the drop-down menu changes slightly, to > Sort
without the ...
, to remind you that you have already used this column. It will give you additional options:
-
Sort
>Sort...
- This option enables you to modify your original sort. -
Sort
>Reverse
- This option allows you to reverse the order of the sort. -
Sort
>Remove sort
- This option allows you to undo your sort.
Sorting by multiple columns.
You can sort by multiple columns by performing sort on additional columns. The sort will depend on the order in which you select columns to sort. To restart the sorting process with a particular column, check the sort by this column alone
box in the Sort
pop-up menu.
Exercise
You might like to look for trends in your data by study phases.
- How do you sort your data by study phase?
- How would you do this differently if you were instead trying to see all of your entries ordered by subject ID, and then by study phase?
Solution
- For the
Phase
column, click onSort...
and thentext
. This will group all ADNI1 entries, for example, regardless of subject ID.- For the
RID
column, click onSort
>Sort...
>numbers
and selectsort by this column alone
. This will undo the sorting by study phase step. Once you’ve sorted byRID
you can then apply another sorting step to sort by study phase. To do this for thePhase
column, click onSort
>text
but do not selectsort by this column alone
.
If you go back to one of the already sorted columns and select > Sort
> Remove sort
, that column is removed from your multiple sort. If it is the only column sorted, then data reverts to its original order.
Exercise
For the
Phase
column, click onSort
>Remove sort
.
Key Points
OpenRefine provides a way to sort and filter data without affecting the raw data.