Working with OpenRefine
Overview
Teaching: 15 min
Exercises: 20 minQuestions
How can we bring our data into OpenRefine?
How can we sort and summarize our data?
How can we find and correct errors in our raw data?
Objectives
Create a new OpenRefine project from a CSV file.
Recall what facets are and how they are used to sort and summarize data.
Recall what clustering is and how it is applied to group and edit typos.
Manipulate data using previous steps with undo/redo.
Employ drop-downs to split values from one column into multiple columns.
Employ drop-downs to remove white spaces from cells.
Lesson
Creating a Project
Start the program. (Double-click on the openrefine.exe file (or google-refine.exe if using an older version). Java services will start on your machine, and OpenRefine will open in your browser).
Launch OpenRefine (see Getting Started with OpenRefine).
OpenRefine can import a variety of file types, including tab separated (tsv
), comma separated (csv
), Excel (xls
, xlsx
), JSON, XML, RDF as XML, Google Spreadsheets. See the OpenRefine Importers page for more information.
In this first step, we’ll browse our computer to the sample data file for this lesson. In this case, we will use the reccmeds
CSV file. This contains info
on medication usage among all the people who participated in at least the
screening visit for ADNI. For each of those people, at each visit, there will
be one record per medication. For example, a patient who is taking three
medications at each of two visits will have six rows in this dataset.
If you haven’t already, download the data from:
Once OpenRefine is launched in your browser, the left margin has options to Create Project
, Open Project
, or Import Project
. Here we will create a new project:
- Click
Create Project
and selectGet data from
This Computer
. - Click
Choose Files
and select the filereccmeds.csv
. ClickOpen
or double-click on the filename. - Click
Next>>
under the browse button to upload the data into OpenRefine. - OpenRefine gives you a preview - a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, so you would choose the correct separator in the box shown and click
Update Preview
(bottom right). If this is the wrong file, click<<Start Over
(upper left). - If all looks well, click
Create Project>>
(upper right).
Note that at step 1, you could upload data in a standard form from a web address by selecting Get data from
Web Addresses (URLs)
. However, this won’t work for all URLs.
Faceting
Exploring data by applying multiple filters
Facets are one of the most useful features of OpenRefine and can help both get an overview of the data in a project as well as helping you bring more consistency to the data. OpenRefine supports faceted browsing as a mechanism for
- seeing a big picture of your data, and
- filtering down to just the subset of rows that you want to change in bulk.
A ‘Facet’ groups all the like values that appear in a column, and then allow you to filter the data by these values and edit values across many records at the same time.
One type of Facet is called a ‘Text facet’. This groups all the identical text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.
Here we will use faceting to look for potential errors in data entry in the CMMED
column. This column contains the names of medications. Data of this
type frequently have typos, and we will see lots of examples here.
- Scroll over to the
CMMED
column. - Click the down arrow and choose
Facet
>Text facet
. - In the left panel, you’ll now see a box containing every unique value in the
CMMED
column along with a number representing how many times that value occurs in the column. - Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
- Hover the mouse over one of the names in the
Facet
list. You should see that you have anedit
function available. - You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.
Solution
There will be several near-identical entries in
CMMED
. For example, there are many misspellings of ‘Acetaminophen’, including ‘acetametophin’, ‘Acetaminophin’, and ‘Aceteminophen’, as well as variations like ‘Acetaminophen PM’ and ‘Acetaminophen ER’. We will see how to correct these misspelled and mistyped entries in a later exercise.
More on Facets
As well as ‘Text facets’ OpenRefine also supports a range of other types of facet. These include:
- Numeric facets
- Timeline facets (for dates)
- Custom facets
- Scatterplot facets
Numeric and Scatterplot facets display graphs instead of lists of values. The numeric facet graph includes ‘drag and drop’ controls you can use to set a start and end range to filter the data displayed. These facets are explored further in Examining Numbers in OpenRefine
Custom facets are a range of different types of facets. Some of the default custom facets are:
- Word facet - this breaks down text into words and counts the number of records each word appears in
- Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
- Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
- Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.
Facets are intended to group together common values and OpenRefine limits the number of values allowed in a single facet to ensure the software does not perform slowly or run out of memory. If you create a facet where there are many unique values (for example, a facet on a ‘book title’ column in a data set that has one row per book) the facet created will be very large and may either slow down the application, or OpenRefine will not create the facet.
Exercise
Using faceting, find out how many sites are represented in the dataset.
Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?
Which sites have the most and least observations?
Solution
- For the column
SITEID
doFacet
>Text facet
. A box will appear in the left panel showing that there are 87 unique entries in this column.- By default, the column
SITEID
is formatted as Text. You can change the format by doingEdit cells
>Common transforms
>To number
. DoingFacet
>Numeric facet
creates a box in the left panel that shows a histogram of the number of entries per site.- After creating a text facet, click
Sort by count
in the facet box. The site with the most observations is site 47 (1739 obs). The least is site 63 (42 obs).
Clustering
In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings New York
and new york
are very likely to refer to the same concept and just have capitalization differences. Likewise, Gödel
and Godel
probably refer to the same person. Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.
- In the
CMMED
Text Facet we created in the step above, click theCluster
button. - In the resulting pop-up window, you can change the
Method
and theKeying Function
. Try different combinations to see what different mergers of values are suggested. - Select the
key collision
method andmetaphone3
keying function. It should identify 3,949 clusters. - Click the
Merge?
box beside the first cluster, then clickMerge Selected and Recluster
to apply the corrections to the dataset. - Try selecting different
Methods
andKeying Functions
again, to see what new merges are suggested. You may find there are still improvements that can be made, but don’tMerge
again; justClose
when you’re done. We’ll now see other operations that will help us detect and correct the remaining problems, and that have other, more general uses.
Important: If you Merge
using a different method or keying function, or more times than described in the instructions above, your solutions for later exercises will not be the same as shown in those exercise solutions.
Split
If data in a column needs to be split into multiple columns, and the parts are separated by a common separator (say a comma, or a space), you can use that separator to divide up the pieces into their own columns.
- Let us suppose we want to split the
CMDOSE_mod
column into separate columns for dose amount (numeric) and unit (text). - Click the down arrow at the top of the
CMDOSE_mod
column. ChooseEdit Column
>Split into several columns...
- In the pop-up, in the
Separator
box, replace the comma with a space. - Uncheck the box that says
Remove this column
. - Click
OK
. You’ll get several new columns calledCMDOSE_mod 1
,CMDOSE_mod 2
, and so on. - What properties do
CMDOSE_mod 1
andCMDOSE_mod 2
seem to have, on first glance? Are there exceptions? - Why do you think there are so many apparently blank columns?
Solution
In many cases,
CMDOSE_mod 1
andCMDOSE_mod 2
correspond to dosage amounts and units, respectively. There are some exceptions, which we will discuss in a later exercise. The entries that have data inCMDOSE_mod 3
,CMDOSE_mod 4
, etc. have extra information somewhere in the entry. One candidate is excess white spaces, which are very difficult to notice when cleaning data manually. This is another advantage of using OpenRefine to clean your data. We’ll look at how to fix excess white spaces in another exercise.
Exercise
Try to change the name of the second new column to “CMDOSE_units”.
Solution
On the
CMDOSE_mod 2
column, click the down arrow and thenEdit column
>Rename this column
. Type “CMDOSE_units” into the box that appears.
Undo / Redo
It’s common while exploring and cleaning a dataset to discover after you’ve made a change that you really should have done something else first. OpenRefine provides Undo
and Redo
operations to make this easy.
- Click where it says
Undo / Redo
on the left side of the screen. All the changes you have made so far are listed here. - Click on the step that you want to go back to, in this case the previous step. The added columns will disappear.
- Notice that you can still click on the last step and make the columns reappear, and toggle back and forth between these states.
- Leave the dataset in the state in which the
CMMED
values were clustered, but theCMDOSE_mod
values were not yet split.
Important: If you skip this step, your solutions for later exercises will not be the same as shown in those exercise solutions.
Collapse Whitespace
Words with extra spaces are particularly hard for we humans to tell from strings without, but the blank characters will make a difference to the computer. We usually want to remove these. OpenRefine provides a tool to remove blank characters from the beginning, end, or middle of any entries that have them.
- In the header for the column
CMDOSE_mod
, chooseEdit cells
>Common transforms
>Collapse consecutive whitespace
. - Notice that the
Split
step has now disappeared from theUndo / Redo
pane on the left and is replaced with aText transform on 22 cells
- Perform a
Split
operation onCMDOSE_mod
, similar to the one that you undid earlier. However, this time, cap the number of columns by entering3
into the box markedSplit into X columns at most
. This should split the columns into dose amount, dose units, and some leftovers. - To find out what the leftovers are, in the header for the column
CMDOSE_mod 3
, chooseSort
>Sort...
>Sort cell values as text
and clickOK
.
Solution
Removing the extra white spaces will solve some of the problem, but not all of it. We also have a number of observations for which the dosage appears to vary, with entries like “50 mg / 250 mg / 40 mg”. An important part of data cleaning for medications data is figuring out how you want to handle these edge cases.
Important: Undo
the splitting step before moving on to the next lesson. If you skip this step, your solutions
for later exercises will not be the same as shown in those exercise solutions.
Key Points
Faceting and clustering approaches can identify errors or outliers in data.