-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdata-preprocessing.Rmd
More file actions
243 lines (187 loc) · 9.59 KB
/
data-preprocessing.Rmd
File metadata and controls
243 lines (187 loc) · 9.59 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# **RNA Data preprocessing**
## **Introduction**
Transcriptomic data primarily encompasses microarray data and RNA-seq data. Microarray data are predominantly derived from platforms such as Affymetrix and Illumina, while RNA-seq data are largely generated using second-generation sequencing technologies, with third-generation sequencing becoming increasingly available. When performing TME analysis and calculating signature scores, it is essential to first annotate and normalize the gene expression matrix.
The IOBR package offers a suite of functions designed to facilitate rapid preprocessing of transcriptomic data, including:
- **count2tpm**: Performs gene annotation, removes duplicate genes, and applies TPM normalization.
- **anno_eset**: Handles duplicate gene removal and gene annotation.
- **find_outlier_samples**: Identifies and removes outlier samples.
- **iobr_pca**: Visualizes batch effects and examines biological variability.
- **remove_batcheffect**: Removes batch effects and integrates multiple datasets.
**Streamlining Gene Annotation for Downstream Analysis**
Since downstream TME analysis tools and scoring functions in IOBR primarily recognize gene symbols rather than probe IDs, Ensembl IDs, or Entrez IDs, the **`anno_eset()`** function simplifies the annotation process. It enables efficient annotation of microarray data (e.g., Affymetrix and Illumina) and RNA-seq data (e.g., Ensembl and Entrez IDs), converting these identifiers into the more familiar gene symbols.
**Improving Inter-Sample Comparability Through Normalization**
In RNA-seq data analysis, TPM (transcripts per million) is a widely adopted normalization method that converts raw read counts into relative abundance or expression levels. TPM normalization accounts not only for the read count of each gene but also for the gene length, providing a more accurate comparison of expression levels across genes. This method effectively addresses the bias introduced by varying gene lengths, improving the comparability of expression levels between different genes.
IOBR offers the **`count2tpm()`** function for TPM conversion of RNA-seq count data. For microarray data, which is typically pre-normalized, packages such as affy and limma are commonly used for preprocessing. However, due to the higher cost and relatively limited information content of microarrays, RNA-seq has become the dominant technology in transcriptomic studies.
## Loading packages
Load the IOBR package in your R session after the installation is complete:
```{r, eval=TRUE, warning=FALSE, message=FALSE}
library(IOBR)
library(tidyverse)
library(clusterProfiler)
```
## Gene Annotation
Annotation of genes in the expression matrix and removal of duplicate genes.
```{r,message=FALSE,warning=FALSE}
# Load the annotation file `anno_hug133plus2` in IOBR.
anno_hug133plus2 <- load_data("anno_hug133plus2")
head(anno_hug133plus2)
```
```{r,message=FALSE,warning=FALSE}
# Load the annotation file `anno_grch38` in IOBR.
anno_grch38 <- load_data("anno_grch38")
head(anno_grch38)
```
```{r,message=FALSE,warning=FALSE}
# Load the annotation file `anno_gc_vm32` in IOBR for mouse RNAseq data
anno_gc_vm32 <- load_data("anno_gc_vm32")
head(anno_gc_vm32)
```
## Download array data using `GEOquery`
Obtaining data set from GEO [Gastric cancer: GSE62254](https://pubmed.ncbi.nlm.nih.gov/25894828/) using `GEOquery` R package.
```{r,message=FALSE,warning=FALSE}
if (!requireNamespace("GEOquery", quietly = TRUE)) BiocManager::install("GEOquery")
library("GEOquery")
# NOTE: This process may take a few minutes which depends on the internet connection speed. Please wait for its completion.
eset_geo <- getGEO(GEO = "GSE62254", getGPL = F, destdir = "./")
eset <- eset_geo[[1]]
eset <- exprs(eset)
eset[1:5, 1:5]
```
### For Array data: HGU133PLUS-2 (Affaymetrix)
```{r,message=FALSE,warning=FALSE}
# Conduct gene annotation using `anno_hug133plus2` file; If identical gene symbols exists, these genes would be ordered by the mean expression levels. The gene symbol with highest mean expression level is selected and remove others.
eset <- anno_eset(
eset = eset,
annotation = anno_hug133plus2,
symbol = "symbol",
probe = "probe_id",
method = "mean"
)
eset[1:5, 1:3]
```
## Download RNAseq data using `UCSCXenaTools`
In this section, we are going to download RNA-seq data from The Cancer Genome Atlas (TCGA) for applying the downstream analysis workflow of **IOBR**. Particularly, we will use the convenient R package [**UCSCXenaTools**](https://cran.r-project.org/web/packages/UCSCXenaTools/vignettes/USCSXenaTools.html) to query and download the RNA-seq data of TCGA stomach cancer cohort.
Use the following code to check and install **UCSCXenaTools**.
```{r, eval=FALSE}
if (!requireNamespace("UCSCXenaTools", quietly = TRUE)) {
BiocManager::install("ropensci/UCSCXenaTools")
}
```
**UCSCXenaTools** provides an R interface to access public cancer datasets from [UCSC Xena data hubs](https://xenabrowser.net/datapages/), including multiple pan-cancer studies like TCGA and PCAWG. You can directly access information of all datasets in R.
```{r}
library(UCSCXenaTools)
data(XenaData)
head(XenaData)
# You can use view(XenaData) to find your dataset of interest
```
**UCSCXenaTools** provides [workflow functions](https://cran.r-project.org/web/packages/UCSCXenaTools/vignettes/USCSXenaTools.html#workflow) to generate object, filter, query, download and load the dataset(s) of interest. The following code show a standardized **UCSCXenaTools** data workflow to query the data from UCSC Xena data hub and load it into R.
```{r, eval=FALSE}
library(UCSCXenaTools)
# NOTE: This process may take a few minutes which depends on the internet connection speed. Please wait for its completion.
eset_stad <- XenaGenerate(subset = XenaCohorts == "GDC TCGA Stomach Cancer (STAD)") %>%
XenaFilter(filterDatasets = "TCGA-STAD.star_counts.tsv") %>%
XenaQuery() %>%
XenaDownload() %>%
XenaPrepare()
eset_stad[1:5, 1:3]
```
As the metadata of this dataset have been stored in the `XeneData` data.frame. You can easily recheck the dataset with code.
```{r, eval=FALSE}
dplyr::filter(XenaData, XenaDatasets == "TCGA-STAD.star_counts.tsv") |>
as.list()
```
## Normalization and Gene annotation
Transform gene expression matrix into TPM format, and conduct subsequent annotation.
```{r, eval=FALSE}
# Remove Ensembl IDs with the suffix '_PAR_Y'.
eset_stad <- eset_stad[!grepl("_PAR_Y$", eset_stad$Ensembl_ID), ]
# Remove the version numbers in Ensembl ID.
eset_stad$Ensembl_ID <- substring(eset_stad$Ensembl_ID, 1, 15)
eset_stad <- column_to_rownames(eset_stad, var = "Ensembl_ID")
# Revert back to original format because the data from UCSC was log2(x+1)transformed.
eset_stad <- (2^eset_stad) - 1
eset_stad <- count2tpm(countMat = eset_stad, idType = "Ensembl", org = "hsa", source = "local")
eset_stad[1:5, 1:5]
```
## Identifying outlier samples
Take ACRG microarray data for example
```{r, message = F, warning = F, fig.width= 7, fig.height=7, fig.align='center'}
res <- find_outlier_samples(eset = eset, project = "ACRG", show_plot = TRUE)
```
Removing potential outlier samples
```{r, message = F, }
eset1 <- eset[, !colnames(eset) %in% res]
```
## PCA analysis of molecular subtypes
```{r, message = F, warning = F, fig.width= 7.6, fig.height=7, fig.align='center'}
pdata_acrg <- load_data("pdata_acrg")
res <- iobr_pca(
data = eset1,
is.matrix = TRUE,
scale = TRUE,
is.log = FALSE,
pdata = pdata_acrg,
id_pdata = "ID",
group = "Subtype",
geom.ind = "point",
cols = "normal",
palette = "jama",
repel = FALSE,
ncp = 5,
axes = c(1, 2),
addEllipses = TRUE
)
res
```
## Batch effect correction
### For microarray data
Obtaining another data set from GEO [Gastric cancer: GSE57303](https://www.ncbi.nlm.nih.gov/pubmed/24935174/) using `GEOquery` R package.
```{r,message=FALSE,warning=FALSE}
# NOTE: This process may take a few minutes which depends on the internet connection speed. Please wait for its completion.
eset_geo <- getGEO(GEO = "GSE57303", getGPL = F, destdir = "./")
eset2 <- eset_geo[[1]]
eset2 <- exprs(eset2)
eset2[1:5, 1:5]
```
Annotation of genes in the expression matrix and removal of duplicate genes.
```{r,message=FALSE,warning=FALSE}
eset2 <- anno_eset(
eset = eset2,
annotation = anno_hug133plus2,
symbol = "symbol",
probe = "probe_id",
method = "mean"
)
eset2[1:5, 1:5]
```
```{r, message=FALSE, fig.width= 11, fig.height = 5, fig.align='center'}
eset_com <- remove_batcheffect(
eset1 = eset1,
eset2 = eset2,
eset3 = NULL,
id_type = "symbol",
data_type = "array",
cols = "normal",
palette = "jama",
log2 = TRUE,
check_eset = TRUE,
adjust_eset = TRUE,
repel = FALSE,
path = "result"
)
dim(eset_com)
```
### For RNAseq count data
```{r, message=FALSE, fig.width= 11, fig.height = 5, fig.align='center'}
eset_stad <- load_data("eset_stad")
head(eset_stad)
eset_blca <- load_data("eset_blca")
head(eset_blca)
eset_com <- remove_batcheffect(eset_stad, eset_blca, id_type = "ensembl", data_type = "count")
# The returned matrix is the count matrix after removing the batches.
head(eset_com)
```
## References
Wang et al., (2019). The UCSCXenaTools R package: a toolkit for accessing genomics data from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq. Journal of Open Source Software, 4(40), 1627, https://doi.org/10.21105/joss.01627
Zhang et al., ComBat-seq: batch effect adjustment for RNA-seq count data, NAR Genomics and Bioinformatics, Volume 2, Issue 3, September 2020, lqaa078, https://doi.org/10.1093/nargab/lqaa078
Leek, J. T., et al., (2012). The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics, 28(6), 882-883.