Summary and Schedule
Welcome to R! Working with a programming language (especially if it’s your first time) often feels intimidating, but the rewards will eventually outweigh any frustrations. An important secret of coding is that even experienced programmers find it difficult and frustrating at times – so if even the best feel that way, try not to let intimidation stop you. Given time and practice* you will soon find it easier and easier to accomplish what you want.
Why learn to code? Bioinformatics – like biology – is messy. Different organisms, different systems, different conditions, all behave differently. Experiments at the bench require a variety of approaches – from tested protocols to trial-and-error. Bioinformatics is also an experimental science, otherwise we could use the same software and same parameters for every genome assembly. Learning to code opens up the full possibilities of computing, especially given that most bioinformatics tools exist only at the command line. Think of it this way: if you could only do molecular biology using a kit, you could probably accomplish a fair amount. However, if you don’t understand the biochemistry of the kit, how would you troubleshoot? How would you do experiments for which there are no kits?
R is one of the most widely-used and powerful programming languages in bioinformatics. R especially shines where a variety of statistical tools are required (e.g. RNA-Seq, population genomics, etc.) and in the generation of publication-quality graphs and figures. Many researchers at this point ask, Should I learn R or python?“. In this lesson we will teach R, but keep in mind that many of the concepts you will learn apply to Python and other programming languages.
R is can be difficult for some to learn. However, don’t get discouraged! The truth is that even with the modest amount of R we will cover today, you can start using some sophisticated R software packages, and have a general sense of how to interpret an R script. Get through these lessons, and you are on your way to being an accomplished R user!
* We very intentionally used the word practice. One of the other “secrets” of programming is that you can only learn so much by reading about it. Do the exercises in class, re-do them on your own, and then work on your own problems.
Prerequisites
- Experimenter’s Mindset: We define the “Experimenter’s mindset” as an approach to bioinformatics that treats it like any other experiment. There are probably a variety of metaphors we could employ (data are our reagents, scripts are our protocols, etc.), but the most important idea of the mindset is to remind you that as a researcher, you need to employ all of your training in the bench or field to working with analyses. Evaluate results critically, and don’t expect that things will always work the first time, or that they will always work in the same way.
- Genomics Data Carpentry Instance: This lesson assumes you are using a Genomics Data Carpentry instance as described on the Genomics Workshop setup page
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Introducing R and RStudio IDE |
Why use R? Why use RStudio and how does it differ from R? |
Duration: 00h 45m | 2. R Basics |
What will these lessons not cover? What are the basic features of the R language? What are the most common objects in R? |
Duration: 02h 05m | 3. Introduction to the example dataset and file type |
What data are we using in the lesson? What are VCF files? |
Duration: 02h 20m | 4. Data Wrangling and Analyses with Tidyverse | How can I manipulate data frames without repeating myself? |
Duration: 03h 15m | 5. Data Visualization with ggplot2 |
What is ggplot2? What is mapping, and what is aesthetics? What is the process of creating a publication-quality plots with ggplot in R? |
Duration: 04h 45m | 6. Getting help with R | How do I get help using R and RStudio? |
Duration: 05h 00m | 7. Extra: R Basics continued - factors | How can I use an object with multiple objects in it? |
Duration: 05h 40m | 8. Extra: R Basics continued - factors and data frames |
How do I get started with tabular data (e.g. spreadsheets) in
R? What are some best practices for reading data into R? How do I save tabular data generated in R? |
Duration: 07h 10m | 9. Extra: Using packages from Bioconductor | How do I use packages from the Bioconductor repository? |
Duration: 07h 23m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
This lesson is an additional lesson to the genomics workshop. Below, is a detailed setup instructions for the main workshop which can also be found on the main setup page. If you are only here for the Intro to R and RStudio for Genomics lesson, and do not wish to work on the cloud, you can go for option B below where you will only need to download the data files to your local working directory where you will create the r-project in.
Genomics workshop setup directions
Overview
This workshop is designed to be run on pre-imaged Amazon Web Services (AWS) instances. With the exception of a spreadsheet program, all of the software and data used in the workshop are hosted on an Amazon Machine Image (AMI). Please follow the instructions below to prepare your computer for the workshop:
- Required additional software + Option A
OR - Required additional software + Option B
Required additional software
This lesson requires a working spreadsheet program. If you don’t have a spreadsheet program already, you can use LibreOffice. It’s a free, open source spreadsheet program. Directions to install are included for each Windows, Mac OS X, and Linux systems below. For Windows, you will also need to install Git Bash, PuTTY, or the Ubuntu Subsystem.
- Install LibreOffice by going to the installation page. The version for Windows should automatically be selected. Click Download Version X.X.X (whichever is the most recent version). You will go to a page that asks about a donation, but you don’t need to make one. Your download should begin automatically.
- Once the installer is downloaded, double click on it and LibreOffice should install.
- Download the Git for
Windows installer. Run the installer and follow the steps below:
- Click on “Next” four times (two times if you’ve previously installed Git). You don’t need to change anything in the Information, location, components, and start menu screens.
- From the dropdown menu select “Use the Nano editor by default” (NOTE: you will need to scroll up to find it) and click on “Next”.
- On the page that says “Adjusting the name of the initial branch in new repositories”, ensure that “Let Git decide” is selected. This will ensure the highest level of compatibility for our lessons.
- Ensure that “Git from the command line and also from 3rd-party software” is selected and click on “Next”. (If you don’t do this Git Bash will not work properly, requiring you to remove the Git Bash installation, re-run the installer and to select the “Git from the command line and also from 3rd-party software” option.)
- Ensure that “Use the native Windows Secure Channel Library” is selected and click on “Next”.
- Ensure that “Checkout Windows-style, commit Unix-style line endings” is selected and click on “Next”.
- Ensure that “Use Windows’ default console window” is selected and click on “Next”.
- Ensure that “Default (fast-forward or merge) is selected and click”Next”
- Ensure that “Git Credential Manager Core” is selected and click on “Next”.
- Ensure that “Enable file system caching” is selected and click on “Next”.
- Click on “Install”.
- Click on “Finish”.
- Check the settings for you your “HOME” environment variable.
- If your “HOME” environment variable is not set (or you don’t know what this is):
- Open command prompt (Open Start Menu then type
cmd
and press [Enter]) - Type the following line into the command prompt window exactly as
shown:
setx HOME "%USERPROFILE%"
- Press [Enter], you should see
SUCCESS: Specified value was saved.
- Quit command prompt by typing
exit
then pressing [Enter]
- An alternative option is to install PuTTY by going to the the installation page. For most newer computers, click on putty-64bit-X.XX-installer.msi to download the 64-bit version. If you have an older laptop, you may need to get the 32-bit version putty-X.XX-installer.msi. If you aren’t sure whether you need the 64 or 32 bit version, you can check your laptop version by following the instructions here. Once the installer is downloaded, double click on it, and PuTTY should install.
- Another alternative option is to use the Ubuntu Subsystem for Windows. This option is only available for Windows 10 - detailed instructions are available here.
- Install LibreOffice by going to the installation page. The version for Mac should automatically be selected. Click Download Version X.X.X (whichever is the most recent version). You will go to a page that asks about a donation, but you don’t need to make one. Your download should begin automatically.
- Once the installer is downloaded, double click on it and LibreOffice should install.
- Install LibreOffice by going to the installation page. The version for Linux should automatically be selected. Click Download Version X.X.X (whichever is the most recent version). You will go to a page that asks about a donation, but you don’t need to make one. Your download should begin automatically.
- Once the installer is downloaded, double click on it and LibreOffice should install.
Option A (Recommended): Using the lessons with Amazon Web Services (AWS)
If you are signed up to take a Genomics Data Carpentry workshop, you do not need to worry about setting up an AMI instance. The Carpentries staff will create an instance for you and this will be provided to you at no cost. This is true for both self-organized and centrally-organized workshops. Your Instructor will provide instructions for connecting to the AMI instance at the workshop.
If you would like to work through these lessons independently,
outside of a workshop, you will need to start your own AMI instance.
Follow these instructions
on creating an Amazon instance. Use the AMI
ami-04b3bc83255f918b0
(Data Carpentry Genomics with R 4.0)
listed on the Community AMIs page. Please note that you must set your
location as N. Virginia
in order to access this community
AMI. You can change your location in the upper right corner of the main
AWS menu bar. The cost of using this AMI for a few days, with the
t2.medium instance type is very low (about USD $1.50 per user, per day).
Data Carpentry has no control over AWS pricing structure and
provides this cost estimate with no guarantees. Please read AWS
documentation on pricing for up-to-date information.
If you’re an Instructor or Maintainer or want to contribute to these lessons, please get in touch with us team@carpentries.org and we will start instances for you.
Option B: Using the lessons on your local machine
While not recommended, it is possible to work through the lessons on your local machine (i.e. without using AWS). To do this, you will need to install all of the software used in the workshop and obtain a copy of the dataset. Instructions for doing this are below.
Data
The data used in this workshop is available on FigShare. Because this workshop works with real data, be aware that file sizes for the data are large. Please read the FigShare page linked below for information about the data and access to the data files.
FigShare Data Carpentry Genomics Beta 2.0
More information about these data will be presented in the first lesson of the workshop.
Software
Software | Version | Manual | Available for | Description |
---|---|---|---|---|
FastQC | 0.11.7 | Documentation | Linux, MacOS, Windows | Quality control tool for high throughput sequence data. |
Trimmomatic | 0.38 | Documentation | Linux, MacOS, Windows | A flexible read trimming tool for Illumina NGS data. |
BWA | 0.7.17 | Documentation | Linux, MacOS | Mapping DNA sequences against reference genome. |
SAMtools | 1.9 | Documentation | Linux, MacOS | Utilities for manipulating alignments in the SAM format. |
BCFtools | 1.8 | Documentation | Linux, MacOS | Utilities for variant calling and manipulating VCFs and BCFs. |
IGV | Documentation | Documentation | Linux, MacOS, Windows | Visualization and interactive exploration of large genomics datasets. |
QuickStart Software Installation Instructions
These are the QuickStart installation instructions. They assume familiarity with the command line and with installation in general. As there are different operating systems and many different versions of operating systems and environments, these may not work on your computer. If an installation doesn’t work for you, please refer to the user guide for the tool, listed in the table above.
We have installed software using miniconda. Miniconda is a package manager that simplifies the installation process. Please first install miniconda3 (installation instructions below), and then proceed to the installation of individual tools.
FastQC
To install FastQC, type:
FastQC Source Code Installation
If you prefer to install from source, follow the directions below:
BASH
$ cd ~/src
$ curl -O http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.7.zip
$ unzip fastqc_v0.11.7.zip
Link the fastqc executable to the ~/bin folder that you have already added to the path.
Due to what seems a packaging error the executable flag on the fastqc program is not set. We need to set it ourselves.
Test your installation by running:
Trimmomatic
Trimmomatic Source Code Installation
If you prefer to install from source, follow the directions below:
BASH
$ cd ~/src
$ curl -O http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.38.zip
$ unzip Trimmomatic-0.38.zip
The program can be invoked via:
$ java -jar ~/src/Trimmomatic-0.38/trimmomatic-0.38.jar
The ~/src/Trimmomatic-0.38/adapters/ directory contains Illumina specific adapter sequences.
Test your installation by running: (assuming things are installed in ~/src)
SAMtools
SAMtools Versions
SAMtools has changed the command line invocation (for the better). But this means that most of the tutorials on the web indicate an older and obsolete usage.
Using SAMtools version 1.9 is important to work with the commands we present in these lessons.
SAMtools Source Code Installation
If you prefer to install from source, follow the instructions below:
BASH
$ cd ~/src
$ curl -OkL https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
$ tar jxvf samtools-1.9.tar.bz2
$ cd samtools-1.9
$ make
Add directory to the path if necessary:
Test your installation by running:
BCFtools
BCF tools Source Code Installation
If you prefer to install from source, follow the instructions below:
BASH
$ cd ~/src
$ curl -OkL https://github.com/samtools/bcftools/releases/download/1.8/bcftools-1.8.tar.bz2
$ tar jxvf bcftools-1.8.tar.bz2
$ cd bcftools-1.8
$ make
Add directory to the path if necessary:
Test your installation by running:
IGV
- Download the IGV installation files
- Install and run IGV using the instructions for your operating system.