RNAshapeQC provides protocol-specific, RNA coverage-shape-based quality control (QC) metrics. This package operates at the gene-level coverage shape rather than isoform-level quantification. Effective transcript length may vary across samples due to alternative splicing. However, we define gene coordinates using the union of exon regions across annotated transcripts to ensure consistent comparability of coverage profiles across samples.
The package is designed to:
RNAshapeQC provides two complementary workflows: one for mRNA-seq and one for total RNA-seq.
mRNA-seq:
BAM → (BAM slicing) → pileup → DR → DS → PD → DII → downstream applications
Total RNA-seq:
BAM → (BAM slicing) → pileup → MCD & wCV → AUC → PD → SOI → downstream applications
In this vignette, we illustrate a minimal, fully runnable workflow using small synthetic toy datasets included in the package. These toy objects mimic typical patterns of RNA degradation and coverage nonuniformity but do not contain any real cohort data.
For detailed method descriptions and extended examples, please refer to the external user manual (RNAshapeQCdocs).
## Installation and loading
if (!requireNamespace("BiocManager", quietly=TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("RNAshapeQC")
library(RNAshapeQC)
RNAshapeQC provides the function construct_pileup(), to generate per-gene pileups as a single object from BAMs, optionally across multiple studies. For a single study, construct_pileup() saves pileup, regions, and geneRanges. For multiple studies, it saves pileupList, regions, and geneRanges.
For example, one might run:
construct_pileup(
Gene = "Gene001",
studylist = "TOY",
regionsFile = "/path/to/hg38.gene.regions.txt",
regionsFormat = "gencode.regions",
regionsCol = 2,
bamFilesList = list(TOY=toy_bam_paths),
caseIDList = list(TOY=toy_case_ids),
outFile = "pergene/TOY_Gene001_pileup.RData"
)
Once pileups are available, gen_DR(), gen_MCD(), and gen_wCV() functions help to generate processed matrices (e.g., DR for mRNA-seq or MCD & wCV for total RNA-seq). In HPC environment, each row vector of the processed matrix is saved by submitting jobs:
sbatch /inst/bash/sbatch_pergene.sh \ # script to submit a higher-level job
genelist.txt \ # list of gene symbol (rownames of a processed matrix)
TOY \ # study name
Calculation.R \ # R code including `gen_*` function
/inst/bash/submit_pergene.sh # script to submit per-gene jobs to Slurm
For large cohorts or high-throughput analyses on HPC systems, additional execution considerations are discussed in the following note.
Full HPC/Slurm workflows (e.g., per-gene batch submission) are highly dataset- and cluster-specific. Therefore, RNAshapeQC only provides simple example bash scripts under inst/bash/, and users can write their own R wrappers following the same pattern as shown in this vignette.
RNAshapeQC distinguishes between two types of functions for computing QC metrics:
get_* functions (e.g., get_DR(), get_MCD(), get_wCV())
Designed to compute a gene × sample matrix for a given gene list in a single study. They use multicore parallelization within a single R session (via parallel / doParallel). In practice, they are typically run on a single HPC node for an entire gene list. For very large lists, users may split the gene list into chunks, run multiple get_* calls in separate jobs, and combine the results.
gen_* functions (e.g., gen_DR(), gen_MCD(), gen_wCV())
Designed for per-gene jobs, suitable for HPC/Slurm submission.
They operate on a single gene at a time and can be applied to:
pileupList objects (e.g., TCGA-LUAD + TCGA-HNSC)Typical usage is one gene per job in a per-gene Slurm array, followed by combining outputs using helper functions (e.g., combine_vecObj()).
Users who do not yet have processed matrices can generate them by following the workflow described in Section 3. This section demonstrates the QC analysis using the toy datasets included in the package.
Two primary toy objects are shipped with the package:
TOY_mrna_se: synthetic mRNA-seq-like data (DR and TPM are stored as assays, and gene_length is stored in rowData().)TOY_total_se: synthetic total RNA-seq-like data (MCD and wCV are stored as assays.)RNAshapeQC functions support direct input of SummarizedExperiment objects, which is the recommended Bioconductor container format. Optionally, examples in Section 4.2 and Section 4.3 also demonstrate usage with matrix inputs (TOY_mrna_mat and TOY_total_mat). The resulting SummarizedExperiment objects from RNAshapeQC outputs can be directly integrated into standard Bioconductor workflows as examples in Section 5.
data("TOY_mrna_se")
data("TOY_total_se")
We can quickly inspect the structure as follows:
TOY_mrna_se
#> class: SummarizedExperiment
#> dim: 100 10
#> metadata(0):
#> assays(2): DR TPM
#> rownames(100): Gene001 Gene002 ... Gene099 Gene100
#> rowData names(1): gene_length
#> colnames(10): T01 T02 ... T09 T10
#> colData names(0):
TOY_total_se
#> class: SummarizedExperiment
#> dim: 100 10
#> metadata(0):
#> assays(2): MCD wCV
#> rownames(100): Gene001 Gene002 ... Gene099 Gene100
#> rowData names(0):
#> colnames(10): A01 A02 ... A09 A10
#> colData names(0):
For mRNA-seq data, RNAshapeQC uses the PD of DS as a sample-level degradation metric.
We can derive the DII using gene weights via get_DIIwt() with the toy objects:
## Using SE input directly
res_dii_se <- get_DIIwt(TOY_mrna_se)
names(res_dii_se)
#> [1] "DR2" "ds.vec" "gene.df" "s"
head(res_dii_se$ds.vec)
#> Sample DS PD DII
#> 1 T01 1.296897e-03 0.4734540 Intact
#> 2 T02 1.094145e-04 -0.4996940 Intact
#> 3 T03 -4.199701e-04 -1.4649429 Intact
#> 4 T04 8.505004e-04 0.2420750 Intact
#> 5 T05 -6.863086e-05 -0.8243315 Intact
#> 6 T06 -1.823597e-04 -1.0316980 Intact
## (Optional) Using matrix/vector input
data(TOY_mrna_mat)
res_dii_mat <- get_DIIwt(
DR = TOY_mrna_mat$DR,
TPM = TOY_mrna_mat$TPM,
genelength = TOY_mrna_mat$gene_length
)
## Compare SE and matrix inputs
all.equal(res_dii_se$DR2, res_dii_mat$DR2)
#> [1] TRUE
all.equal(res_dii_se$ds.vec, res_dii_mat$ds.vec)
#> [1] TRUE
all.equal(res_dii_se$gene.df, res_dii_mat$gene.df)
#> [1] TRUE
all.equal(res_dii_se$s, res_dii_mat$s)
#> [1] TRUE
When a SummarizedExperiment object is supplied, get_DIIwt() extracts the required assays and row-level metadata internally. As shown above, results obtained from SummarizedExperiment input and matrix/vector input are identical.
The returned list typically includes:
DR2: matrix of decay rates with filtered genesds.vec: data frame containing
Sample: sample IDDS: degradation scorePD: projection depth of DSDII: Degraded or Intact label based on a cutoffgene.df: data frame with gene information used to compute gene weightss: scale factor used to adjust gene-length and TPM mediansWe can plot the distributions of DS and PD, as well as the heatmap of DR with gene weights:
## Path to a temporary file
outfile <- file.path(tempdir(), "Fig_DIIwt.png")
plot_DIIwt(
DR = res_dii_se$DR2,
DIIresult = res_dii_se,
outFile = outfile
)
## Saving DIIwt plot to: /.../Fig_DIIwt.png
The function prints the file path of the saved figure in the console, and we can open the file directly from that location.
For total RNA-seq data, RNAshapeQC focuses on the relationship between MCD and wCV.
We can combine these metrics to measure coverage-shape optimality using get_SOI():
## Using SE input directly
## The cutoff value can be chosen by inspecting the PD distribution.
res_soi_se <- get_SOI(TOY_total_se, cutoff=1)
#> `geom_smooth()` using formula = 'y ~ x'
names(res_soi_se)
#> [1] "auc.vec" "auc.coord" "rangeMCD"
head(res_soi_se$auc.vec)
#> # A tibble: 6 × 4
#> Sample AUC PD SOI
#> <chr> <dbl> <dbl> <chr>
#> 1 A01 2.26 -0.347 Optimal
#> 2 A02 2.20 -1.11 Optimal
#> 3 A03 2.31 0.0190 Optimal
#> 4 A04 2.34 0.0438 Optimal
#> 5 A05 2.26 -0.365 Optimal
#> 6 A06 2.16 -1.67 Optimal
## (Optional) Using matrix input
data(TOY_total_mat)
res_soi_mat <- get_SOI(
MCD = TOY_total_mat$MCD,
wCV = TOY_total_mat$wCV,
cutoff = 1
)
#> `geom_smooth()` using formula = 'y ~ x'
## Compare SE and matrix inputs
all.equal(res_soi_se$auc.vec, res_soi_mat$auc.vec)
#> [1] TRUE
all.equal(res_soi_se$auc.coord, res_soi_mat$auc.coord)
#> [1] TRUE
all.equal(res_soi_se$rangeMCD, res_soi_mat$rangeMCD)
#> [1] TRUE
The results obtained from SummarizedExperiment input and matrix input are identical, and it confirms consistent behavior of get_SOI() across supported input types.
Here, auc.vec contains:
Sample: sample IDAUC: smoothed area under the curve in the MCD-wCV spacePD: projection depth of the AUCSOI: Suboptimal or Optimal label based on a cutoffThis provides a simple way to flag suboptimal total RNA-seq samples based on their coverage shape.
We can visualize the smoothed relationship using plot_SOI():
outfile <- file.path(tempdir(), "Fig_SOI.png")
## The cutoff value can be chosen by inspecting the PD distribution.
plot_SOI(
SOIresult = res_soi_se,
cutoff = 1,
outFile = outfile
)
## Saving SOI plot to: /.../Fig_SOI.png
This figure illustrates how wCV behaves as a function of MCD, and which samples are classified as suboptimal versus optimal.
RNAshapeQC outputs can be organized using the SummarizedExperiment class, which ensures consistent handling of assay and associated meta-data during subsetting. In the container, matrices of DR, TPM, MCD, and wCV can be stored as assays(), where rows denote genes and columns represent samples. The gene or sample information tables (annotations) are stored using the rowData() or colData() functions, respectively. metadata() can contain additional QC-related information such as scale factors, MCD ranges, and cutoff values.
This is an example of constructing SummarizedExperiment objects and applying sample-level QCs to filter out low-quality samples before downstream analysis. This coordinated subsetting operation is simultaneously applied to both assays and column data, allowing proper filtering and preventing mismatches between QC metrics and downstream analysis inputs, as shown in se_mrna_hq and se_total_hq.
library(SummarizedExperiment)
## Store QC results in mRNA-seq
se_mrna <- SummarizedExperiment(
assays = list(DR=assay(TOY_mrna_se, "DR"), TPM=assay(TOY_mrna_se, "TPM")),
colData = DataFrame(res_dii_se$ds.vec),
metadata = list(s=res_dii_se$s)
)
se_mrna
#> class: SummarizedExperiment
#> dim: 100 10
#> metadata(1): s
#> assays(2): DR TPM
#> rownames(100): Gene001 Gene002 ... Gene099 Gene100
#> rowData names(0):
#> colnames(10): T01 T02 ... T09 T10
#> colData names(4): Sample DS PD DII
## Subsetting to keep high-quality samples in mRNA-seq
se_mrna_hq <- se_mrna[, se_mrna$DII == "Intact"]
head(assay(se_mrna_hq, "TPM"))
#> T01 T02 T03 T04 T05
#> Gene001 2.955266 35.284653 5.611110 49.458330 2.820761e+01
#> Gene002 10098.621664 5317.070771 3255.004616 4123.260968 1.099778e+04
#> Gene003 34.015654 3.917251 1.105082 5.539151 1.177839e+01
#> Gene004 10.624285 3.459340 4.681335 21.934307 3.737044e+01
#> Gene005 5.136389 3.313921 32.509708 16.837166 9.163171e-01
#> Gene006 40856.586781 2183.009259 6505.488826 16107.565863 5.109681e+03
#> T06 T07 T08
#> Gene001 20.553214 60.726834 17.527987
#> Gene002 8715.989132 7249.783041 5149.009510
#> Gene003 11.894382 2.684087 11.855347
#> Gene004 12.187713 3.237398 16.119091
#> Gene005 4.684436 5.913339 9.131831
#> Gene006 6038.519806 6417.175512 17337.680073
This QC filtering can be applied prior to class discovery to reduce clustering instability caused by degraded samples.
library(ConsensusClusterPlus)
## TPM log transform
tpm_hq <- log2(assay(se_mrna_hq, "TPM") + 1)
## Set a random seed for reproducibility
set.seed(1)
cc <- ConsensusClusterPlus(
tpm_hq,
maxK = 6,
reps = 100,
pItem = 0.8,
pFeature = 0.8,
clusterAlg = "hc",
distance = "pearson",
plot = "png",
title = tempdir()
)
cc[[3]]$consensusClass
#> T01 T02 T03 T04 T05 T06 T07 T08
#> 1 2 2 1 1 1 3 1
The protocol-specific QC metrics stored in colData() can be included as covariates in a differential expression analysis (DESeq2) (e.g., design = ~ batch + condition + SOI).
## Store QC results in total RNA-seq
se_total <- SummarizedExperiment(
assays = list(MCD=assay(TOY_total_se, "MCD"), wCV=assay(TOY_total_se, "wCV")),
rowData = DataFrame(Gene=rownames(assay(TOY_total_se, "MCD"))),
colData = DataFrame(res_soi_se$auc.vec),
metadata = list(auc.coord=res_soi_se$auc.coord, rangeMCD=res_soi_se$rangeMCD)
)
se_total
#> class: SummarizedExperiment
#> dim: 100 10
#> metadata(2): auc.coord rangeMCD
#> assays(2): MCD wCV
#> rownames(100): Gene001 Gene002 ... Gene099 Gene100
#> rowData names(1): Gene
#> colnames(10): A01 A02 ... A09 A10
#> colData names(4): Sample AUC PD SOI
## Subsetting to keep high-quality samples in total RNA-seq
se_total_hq <- se_total[, se_total$SOI == "Optimal"]
head(assay(se_total_hq, "wCV"))
#> A01 A02 A03 A04 A05 A06 A07
#> Gene001 1.902783 2.104655 1.824006 2.216481 2.125495 1.846423 2.356645
#> Gene002 1.890622 1.966493 2.026579 2.266050 1.863042 1.982628 2.102911
#> Gene003 1.971822 2.134141 2.011843 2.081495 1.840883 1.853184 1.830847
#> Gene004 1.888715 2.061299 2.178234 1.996085 2.013771 2.268859 2.051379
#> Gene005 2.131341 1.996294 1.877507 2.489296 1.921696 1.738298 2.123472
#> Gene006 1.908249 2.354457 2.128700 2.108734 1.980232 1.914649 2.178310
This vignette demonstrated a minimal, runnable example of how to:
construct_pileup() fits into a real workflow,SummarizedExperiment datasets (TOY_mrna_se and TOY_total_se),get_DIIwt() to derive a degraded/intact index from DR,get_SOI() to derive a suboptimal/optimal index from MCD and wCV.The examples here are intentionally small and synthetic so that the vignette builds quickly and reproducibly. More detailed workflows, methodological explanations, and real-data case studies are documented separately in the full-length user manual.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 26.3.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Chicago
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ConsensusClusterPlus_1.68.0 SummarizedExperiment_1.34.0
#> [3] Biobase_2.64.0 GenomicRanges_1.56.1
#> [5] GenomeInfoDb_1.40.1 IRanges_2.38.1
#> [7] S4Vectors_0.42.1 BiocGenerics_0.50.0
#> [9] MatrixGenerics_1.16.0 matrixStats_1.5.0
#> [11] RNAshapeQC_0.99.7 knitr_1.48
#> [13] BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] gld_2.6.6 readxl_1.4.3 rlang_1.1.4
#> [4] magrittr_2.0.3 clue_0.3-65 GetoptLong_1.0.5
#> [7] e1071_1.7-14 compiler_4.4.1 mgcv_1.9-1
#> [10] png_0.1-8 vctrs_0.6.5 stringr_1.5.1
#> [13] pkgconfig_2.0.3 shape_1.4.6.1 crayon_1.5.3
#> [16] fastmap_1.2.0 backports_1.5.0 magick_2.8.5
#> [19] XVector_0.44.0 labeling_0.4.3 utf8_1.2.4
#> [22] PupillometryR_0.0.5 rmarkdown_2.28 UCSC.utils_1.0.0
#> [25] purrr_1.0.2 xfun_0.50 zlibbioc_1.50.0
#> [28] cachem_1.1.0 jsonlite_1.8.8 highr_0.11
#> [31] DelayedArray_0.30.1 broom_1.0.6 parallel_4.4.1
#> [34] cluster_2.1.6 DescTools_0.99.57 R6_2.5.1
#> [37] bslib_0.8.0 stringi_1.8.4 RColorBrewer_1.1-3
#> [40] boot_1.3-31 car_3.1-3 cellranger_1.1.0
#> [43] jquerylib_0.1.4 Rcpp_1.0.13 bookdown_0.46
#> [46] iterators_1.0.14 zoo_1.8-12 Matrix_1.7-0
#> [49] splines_4.4.1 tidyselect_1.2.1 rstudioapi_0.16.0
#> [52] abind_1.4-8 yaml_2.3.10 doParallel_1.0.17
#> [55] codetools_0.2-20 lattice_0.22-6 tibble_3.2.1
#> [58] withr_3.0.1 evaluate_0.24.0 proxy_0.4-27
#> [61] circlize_0.4.16 pillar_1.9.0 BiocManager_1.30.25
#> [64] ggpubr_0.6.0.999 carData_3.0-5 foreach_1.5.2
#> [67] generics_0.1.3 ggplot2_3.5.1 rootSolve_1.8.2.4
#> [70] munsell_0.5.1 scales_1.3.0 class_7.3-22
#> [73] glue_1.7.0 lmom_3.2 tools_4.4.1
#> [76] data.table_1.16.0 ggsignif_0.6.4 Exact_3.3
#> [79] mvtnorm_1.3-1 cowplot_1.1.3 grid_4.4.1
#> [82] tidyr_1.3.1 colorspace_2.1-1 nlme_3.1-166
#> [85] GenomeInfoDbData_1.2.12 patchwork_1.3.0 Formula_1.2-5
#> [88] cli_3.6.3 expm_1.0-0 fansi_1.0.6
#> [91] S4Arrays_1.4.1 ComplexHeatmap_2.20.0 dplyr_1.1.4
#> [94] gtable_0.3.5 rstatix_0.7.2 sass_0.4.9
#> [97] digest_0.6.37 SparseArray_1.4.8 rjson_0.2.22
#> [100] farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4
#> [103] httr_1.4.7 GlobalOptions_0.1.2 MASS_7.3-61