Extra: R Basics continued - factors

Last updated on 2024-11-19 | Edit this page

Overview

Questions

  • How can I use an object with multiple objects in it?

Objectives

  • Be able to retrieve (subset), name, or replace, values from a vector
  • Be able to use logical operators in a subsetting operation

Vectors


Vectors are probably the most used commonly used object type in R. A vector is a collection of values that are all of the same type (numbers, characters, etc.). One of the most common ways to create a vector is to use the c() function - the “concatenate” or “combine” function. Inside the function you may enter one or more values; for multiple values, separate each value with a comma:

R

# Create the SNP gene name vector

snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1")

Vectors always have a mode and a length. You can check these with the mode() and length() functions respectively. Another useful function that gives both of these pieces of information is the str() (structure) function.

R

# Check the mode, length, and structure of 'snp_genes'
mode(snp_genes)

OUTPUT

[1] "character"

R

length(snp_genes)

OUTPUT

[1] 4

R

str(snp_genes)

OUTPUT

 chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1"

Vectors are quite important in R. Another data type that we will work with later in this lesson, data frames, are collections of vectors. What we learn here about vectors will pay off even more when we start working with data frames.

Creating and subsetting vectors


Let’s create a few more vectors to play around with:

R

# Some interesting human SNPs
# while accuracy is important, typos in the data won't hurt you here

snps <- c("rs53576", "rs1815739", "rs6152", "rs1799971")
snp_chromosomes <- c("3", "11", "X", "6")
snp_positions <- c(8762685, 66560624, 67545785, 154039662)

Once we have vectors, one thing we may want to do is specifically retrieve one or more values from our vector. To do so, we use bracket notation. We type the name of the vector followed by square brackets. In those square brackets we place the index (e.g. a number) in that bracket as follows:

R

# get the 3rd value in the snp vector
snps[3]

OUTPUT

[1] "rs6152"

In R, every item your vector is indexed, starting from the first item (1) through to the final number of items in your vector. You can also retrieve a range of numbers:

R

# get the 1st through 3rd value in the snp vector

snps[1:3]

OUTPUT

[1] "rs53576"   "rs1815739" "rs6152"   

If you want to retrieve several (but not necessarily sequential) items from a vector, you pass a vector of indices; a vector that has the numbered positions you wish to retrieve.

R

# get the 1st, 3rd, and 4th value in the snp vector

snps[c(1, 3, 4)]

OUTPUT

[1] "rs53576"   "rs6152"    "rs1799971"

There are additional (and perhaps less commonly used) ways of subsetting a vector (see these examples). Also, several of these subsetting expressions can be combined:

R

# get the 1st through the 3rd value, and 4th value in the snp vector
# yes, this is a little silly in a vector of only 4 values.
snps[c(1:3,4)]

OUTPUT

[1] "rs53576"   "rs1815739" "rs6152"    "rs1799971"

Adding to, removing, or replacing values in existing vectors


Once you have an existing vector, you may want to add a new item to it. To do so, you can use the c() function again to add your new value:

R

# add the gene "CYP1A1" and "APOA5" to our list of snp genes
# this overwrites our existing vector
snp_genes <- c(snp_genes, "CYP1A1", "APOA5")

We can verify that “snp_genes” contains the new gene entry

R

snp_genes

OUTPUT

[1] "OXTR"   "ACTN3"  "AR"     "OPRM1"  "CYP1A1" "APOA5" 

Using a negative index will return a version of a vector with that index’s value removed:

R

snp_genes[-6]

OUTPUT

[1] "OXTR"   "ACTN3"  "AR"     "OPRM1"  "CYP1A1"

We can remove that value from our vector by overwriting it with this expression:

R

snp_genes <- snp_genes[-6]
snp_genes

OUTPUT

[1] "OXTR"   "ACTN3"  "AR"     "OPRM1"  "CYP1A1"

We can also explicitly rename or add a value to our index using double bracket notation:

R

snp_genes[6]<- "APOA5"
snp_genes

OUTPUT

[1] "OXTR"   "ACTN3"  "AR"     "OPRM1"  "CYP1A1" "APOA5" 

Exercise: Examining and subsetting vectors

Answer the following questions to test your knowledge of vectors

Which of the following are true of vectors in R? A) All vectors have a mode or a length
B) All vectors have a mode and a length
C) Vectors may have different lengths
D) Items within a vector may be of different modes
E) You can use the c() to add one or more items to an existing vector
F) You can use the c() to add a vector to an existing vector

  1. False - Vectors have both of these properties
  2. True
  3. True
  4. False - Vectors have only one mode (e.g. numeric, character); all items in
    a vector must be of this mode.
  5. True
  6. True

Logical Subsetting


There is one last set of cool subsetting capabilities we want to introduce. It is possible within R to retrieve items in a vector based on a logical evaluation or numerical comparison. For example, let’s say we wanted get all of the SNPs in our vector of SNP positions that were greater than 100,000,000. We could index using the ‘>’ (greater than) logical operator:

R

snp_positions[snp_positions > 100000000]

OUTPUT

[1] 154039662

In the square brackets you place the name of the vector followed by the comparison operator and (in this case) a numeric value. Some of the most common logical operators you will use in R are:

Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x not x
a | b a or b
a & b a and b

The magic of programming

The reason why the expression snp_positions[snp_positions > 100000000] works can be better understood if you examine what the expression “snp_positions > 100000000” evaluates to:

R

snp_positions > 100000000

OUTPUT

[1] FALSE FALSE FALSE  TRUE

The output above is a logical vector, the 4th element of which is TRUE. When you pass a logical vector as an index, R will return the true values:

R

snp_positions[c(FALSE, FALSE, FALSE, TRUE)]

OUTPUT

[1] 154039662

If you have never coded before, this type of situation starts to expose the “magic” of programming. We mentioned before that in the bracket notation you take your named vector followed by brackets which contain an index: named_vector[index]. The “magic” is that the index needs to evaluate to a number. So, even if it does not appear to be an integer (e.g. 1, 2, 3), as long as R can evaluate it, we will get a result. That our expression snp_positions[snp_positions > 100000000] evaluates to a number can be seen in the following situation. If you wanted to know which index (1, 2, 3, or 4) in our vector of SNP positions was the one that was greater than 100,000,000?

We can use the which() function to return the indices of any item that evaluates as TRUE in our comparison:

R

which(snp_positions > 100000000)

OUTPUT

[1] 4

Why this is important

Often in programming we will not know what inputs and values will be used when our code is executed. Rather than put in a pre-determined value (e.g 100000000) we can use an object that can take on whatever value we need. So for example:

R

snp_marker_cutoff <- 100000000
snp_positions[snp_positions > snp_marker_cutoff]

OUTPUT

[1] 154039662

Ultimately, it’s putting together flexible, reusable code like this that gets at the “magic” of programming!

A few final vector tricks


Finally, there are a few other common retrieve or replace operations you may want to know about. First, you can check to see if any of the values of your vector are missing (i.e. are NA, that stands for not avaliable). Missing data will get a more detailed treatment later, but the is.NA() function will return a logical vector, with TRUE for any NA value:

R

# current value of 'snp_genes':
# chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5"

is.na(snp_genes)

OUTPUT

[1] FALSE FALSE FALSE FALSE FALSE FALSE

Sometimes, you may wish to find out if a specific value (or several values) is present a vector. You can do this using the comparison operator %in%, which will return TRUE for any value in your collection that is in the vector you are searching:

R

# current value of 'snp_genes':
# chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5"

# test to see if "ACTN3" or "APO5A" is in the snp_genes vector
# if you are looking for more than one value, you must pass this as a vector

c("ACTN3","APOA5") %in% snp_genes

OUTPUT

[1] TRUE TRUE

Review Exercise 1

What data modes are the following vectors? a. snps
b. snp_chromosomes
c. snp_positions

R

mode(snps)

OUTPUT

[1] "character"

R

mode(snp_chromosomes)

OUTPUT

[1] "character"

R

mode(snp_positions)

OUTPUT

[1] "numeric"

Review Exercise 2

Add the following values to the specified vectors: a. To the snps vector add: “rs662799”
b. To the snp_chromosomes vector add: 11
c. To the snp_positions vector add: 116792991

R

snps <- c(snps, "rs662799")
snps

OUTPUT

[1] "rs53576"   "rs1815739" "rs6152"    "rs1799971" "rs662799" 

R

snp_chromosomes <- c(snp_chromosomes, "11") # did you use quotes?
snp_chromosomes

OUTPUT

[1] "3"  "11" "X"  "6"  "11"

R

snp_positions <- c(snp_positions, 116792991)
snp_positions

OUTPUT

[1]   8762685  66560624  67545785 154039662 116792991

Review Exercise 3

Make the following change to the snp_genes vector:

Hint: Your vector should look like this in ‘Environment’: chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5". If not recreate the vector by running this expression: snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1", "CYP1A1", NA, "APOA5")

  1. Create a new version of snp_genes that does not contain CYP1A1 and then
  2. Add 2 NA values to the end of snp_genes

R

snp_genes <- snp_genes[-5]
snp_genes <- c(snp_genes, NA, NA)
snp_genes

OUTPUT

[1] "OXTR"  "ACTN3" "AR"    "OPRM1" "APOA5" NA      NA     

Review Exercise 4

Using indexing, create a new vector named combined that contains:

  • The the 1st value in snp_genes
  • The 1st value in snps
  • The 1st value in snp_chromosomes
  • The 1st value in snp_positions

R

combined <- c(snp_genes[1], snps[1], snp_chromosomes[1], snp_positions[1])
combined

OUTPUT

[1] "OXTR"    "rs53576" "3"       "8762685"

Review Exercise 5

What type of data is combined?

R

typeof(combined)

OUTPUT

[1] "character"

Lists

Lists are quite useful in R, but we won’t be using them in the genomics lessons. That said, you may come across lists in the way that some bioinformatics programs may store and/or return data to you. One of the key attributes of a list is that, unlike a vector, a list may contain data of more than one mode. Learn more about creating and using lists using this nice tutorial. In this one example, we will create a named list and show you how to retrieve items from the list.

R

# Create a named list using the 'list' function and our SNP examples
# Note, for easy reading we have placed each item in the list on a separate line
# Nothing special about this, you can do this for any multiline commands
# To run this command, make sure the entire command (all 4 lines) are highlighted
# before running
# Note also, as we are doing all this inside the list() function use of the
# '=' sign is good style
snp_data <- list(genes = snp_genes,
                 refference_snp = snps,
                 chromosome = snp_chromosomes,
                 position = snp_positions)
# Examine the structure of the list
str(snp_data)

OUTPUT

List of 4
 $ genes         : chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" ...
 $ refference_snp: chr [1:5] "rs53576" "rs1815739" "rs6152" "rs1799971" ...
 $ chromosome    : chr [1:5] "3" "11" "X" "6" ...
 $ position      : num [1:5] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 1.17e+08

To get all the values for the position object in the list, we use the $ notation:

R

# return all the values of position object

snp_data$position

OUTPUT

[1]   8762685  66560624  67545785 154039662 116792991

To get the first value in the position object, use the [] notation to index:

R

# return first value of the position object

snp_data$position[1]

OUTPUT

[1] 8762685

Key Points

  • Working with vectors effectively prepares you for understanding how data are organized in R.