DESEq2 for time series

Karina Jhingan

Mar 14, 2024

DESEq2 for time series

This protocol is a draft, published without a DOI.

Karina Jhingan¹

¹Fred Hutch

Karina Jhingan

Fred Hutch

Protocol Citation: Karina Jhingan 2024. DESEq2 for time series. protocols.io https://protocols.io/view/deseq2-for-time-series-dajh2cj6

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: March 12, 2024

Last Modified: March 14, 2024

Protocol Integer ID: 96585

Disclaimer

This code in this protocol is an altered version of https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential-expression_rnaseq_01.html

Abstract

This is a differential analysis for a time series experiment using DESeq2.

Introduction

This protocol is heavily based off of the protocol provided by https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential-expression_rnaseq_01.html.

Note that DESeq and any other differential analysis requires replicates.

Imports/Libraries

Citation: “ashr” for LFC shrinkage
Stephens, M. (2016) False discovery rates: a new
deal. Biostatistics, 18:2. https://doi.org/10.1093/biostatistics/kxw041
if (!("DESeq2" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("DESeq2", update = FALSE)
}
if (!("apeglm" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("apeglm", update = FALSE)
}
if (!("ashr" %in% installed.packages())) {
  # Install this package if it isn't installed yet
  BiocManager::install("ashr", update = FALSE)
}

# Attach the DESeq2 library
library(DESeq2)

# Attach the ggplot2 library for plotting
library(ggplot2)

# We will need this so we can use the pipe: %>%
library(magrittr)

library(tidyverse)

#set seed for DESeq2::plotCounts() for reproducibility
set.seed(12345)

Data Set up

Set Project Path

#replace with your directory
projPath="/fh/fast/greenberg_p/user/kjhingan/DESeq2"

Gene Expression Data Matrix
A csv file where rows are genes and columns are samples/time stamps
# Replace with the path to your dataset file
data_file <- file.path(projPath, "gsva_data.csv")

Metadata File
A csv file with nrows(metadata) == ncols(datafile)
three columns:
-id: sample name, i.e the column names from the
data file 
-group (i.e time stamp or control group)
-replicate

# Replace with the path to your metadata file
metadata_file <- file.path(projPath, "gsva_metadata.csv")

# Check if the gene expression matrix file is at the path stored in `data_file`
file.exists(data_file)

# Check if the metadata file is at the file path stored in `metadata_file`
file.exists(metadata_file)
You should see TRUE outputted twice 

Read Data

# Read in metadata CSV file
metadata <- readr::read_csv(metadata_file)

# Read in data CSV file
expression_df <- readr::read_csv(data_file)

Set up data matrix

# ensure unique row names for the next function
expression_df <- expression_df[!(duplicated(expression_df[[1]]) | 
                                   duplicated(expression_df[[1]], fromLast=TRUE)),]


expression_df <-expression_df %>%
  # store the gene IDs as row names so we have a numeric matrix to perform calculations on later
  tibble::column_to_rownames("Gene")

Set up metadata
We are using the naive samples ("0 hours") as a reference/control. 
metadata <- metadata %>%
  dplyr::mutate(time_stamp = dplyr::case_when(
    stringr::str_detect(group, "naïve") ~ "reference",
    stringr::str_detect(group, "24") ~ "24_hours",
    stringr::str_detect(group, "48") ~ "48_hours",
    stringr::str_detect(group, "72") ~ "72_hours"
  ))

# Let's take a look at the original metadata column's info
# and our new `mutation_status` column
dplyr::select(metadata, group, time_stamp)



# Print out a preview of `mutation_status`
str(metadata$time_stamp)

# Make mutation_status a factor and set the levels appropriately
metadata <- metadata %>%
  dplyr::mutate(
    # Here we define the values our factor variable can have and their order.
    time_stamp = factor(time_stamp, levels = c("reference", "24_hours", "48_hours", "72_hours"))
  )

levels(metadata$time_stamp)
You should see the following printed out: [1] "reference" "24_hours"  "48_hours"  "72_hours" 

# Define a minimum counts cutoff and filter the data to include
# only rows (genes) that have total counts above the cutoff
filtered_expression_df <- expression_df %>%
  dplyr::filter(rowSums(.) >= 10)

# round all expression counts
gene_matrix <- round(filtered_expression_df)

DESeq2

Create DESeq2 Object
ddset <- DESeqDataSetFromMatrix(
  # Here we supply non-normalized count data
  countData = gene_matrix,
  # Supply the `colData` with our metadata data frame
  colData = metadata,
  # Supply our experimental variable to `design`
  design = ~time_stamp
)

Run DESeq2
deseq_object <- DESeq(ddset)

View Results

deseq_results <- results(deseq_object)
resultsNames(deseq_object)
[1] "Intercept"                        "time_stamp_24_hours_vs_reference" "time_stamp_48_hours_vs_reference"
[4] "time_stamp_72_hours_vs_reference"
head(deseq_results)

Normalize results by log fold change
Here we are shrinking log fold change by ashr, for more information on ashr go to  step 2 and view the citation.
deseq_results <- lfcShrink(
  deseq_object, # The original DESeq2 object after running DESeq()
  contrast=c("reference","24_hours","48_hours", "72_hours"),
  type = "ashr",
  res = deseq_results # The original DESeq2 results table
)
head(deseq_results)

Set up DESeq results for visualization
Convert the results into a dataframe
deseq_df <- deseq_results %>%
  # make into data.frame
  as.data.frame() %>%
  # the gene names are row names -- let's make them a column for easy display
  tibble::rownames_to_column("Gene") %>%
  # add a column for significance threshold results
  dplyr::mutate(threshold = padj < 0.05) %>%
  # sort by statistic -- the highest values will be genes with
  # higher expression in RPL10 mutated samples
  dplyr::arrange(dplyr::desc(log2FoldChange))

head(deseq_df)

Visualizations

Plot counts
IL2RA was only chosen as an example, change to your gene of interest into the gene parameter.
plotCounts(ddset, gene = "IL2RA", intgroup = "time_stamp")

Save Results

Save DESeq2 results
readr::write_tsv(
  deseq_df,
  file.path(
    projPath,
    "SRP123625_diff_expr_results.tsv" # Replace with a relevant output file name
  )
)

Public workspaceDESEq2 for time series

DESEq2 for time series