Contents

1 Introduction

RNAshapeQC provides protocol-specific, RNA coverage-shape-based quality control (QC) metrics. This package operates at the gene-level coverage shape rather than isoform-level quantification. Effective transcript length may vary across samples due to alternative splicing. However, we define gene coordinates using the union of exon regions across annotated transcripts to ensure consistent comparability of coverage profiles across samples.

The package is designed to:

1.1 Overview of the workflow

RNAshapeQC provides two complementary workflows: one for mRNA-seq and one for total RNA-seq.

1.1.1 mRNA-seq

  1. Decay rate (DR)1 Choi, H.Y., Jo, H., Zhao, X. et al. SCISSOR: a framework for identifying structural changes in RNA transcripts. Nat Commun 12, 286 (2021). https://doi.org/10.1038/s41467-020-20593-3: extent of degradation, measured by the mean-corrected slope of log-transformed RNA-seq coverage (gene-by-sample level)
  2. Degradation score (DS): weighted average of detectable degradation signals (sample-level)
  3. Projection depth (PD): robust distance of each sample from the center of the DS distribution (sample-level)
  4. Degraded/Intact index (DII): binary classification of samples as Degraded or Intact (sample-level)

1.1.2 Total RNA-seq

  1. Mean coverage depth (MCD): measure of coverage magnitude (gene-by-sample level)
  2. Window coefficient of variation (wCV): measure of coverage nonuniformity (gene-by-sample level)
  3. Area under the curve (AUC): smoothed relationship between MCD and wCV (sample-level)
  4. Projection depth (PD): robust distance computed using the AUC distribution (sample-level)
  5. Suboptimal/Optimal index (SOI): binary classification of samples as Suboptimal or Optimal (sample-level)

2 Installation

## Installation and loading
if (!requireNamespace("BiocManager", quietly=TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("RNAshapeQC")

library(RNAshapeQC)

3 Data Processing

3.1 Base-level coverage

RNAshapeQC provides the function construct_pileup(), to generate per-gene pileups as a single object from BAMs, optionally across multiple studies. For a single study, construct_pileup() saves pileup, regions, and geneRanges. For multiple studies, it saves pileupList, regions, and geneRanges.

For example, one might run:

construct_pileup(
    Gene          = "Gene001",
    studylist     = "TOY",   
    regionsFile   = "/path/to/hg38.gene.regions.txt",
    regionsFormat = "gencode.regions",
    regionsCol    = 2,
    bamFilesList  = list(TOY=toy_bam_paths),
    caseIDList    = list(TOY=toy_case_ids),
    outFile       = "pergene/TOY_Gene001_pileup.RData"  
)

3.2 Processed matrices

Once pileups are available, gen_DR(), gen_MCD(), and gen_wCV() functions help to generate processed matrices (e.g., DR for mRNA-seq or MCD & wCV for total RNA-seq). In HPC environment, each row vector of the processed matrix is saved by submitting jobs:

sbatch /inst/bash/sbatch_pergene.sh \  # script to submit a higher-level job
  genelist.txt \                       # list of gene symbol (rownames of a processed matrix)
  TOY \                                # study name
  Calculation.R \                      # R code including `gen_*` function
  /inst/bash/submit_pergene.sh         # script to submit per-gene jobs to Slurm

For large cohorts or high-throughput analyses on HPC systems, additional execution considerations are discussed in the following note.

HPC and Slurm execution strategies

Full HPC/Slurm workflows (e.g., per-gene batch submission) are highly dataset- and cluster-specific. Therefore, RNAshapeQC only provides simple example bash scripts under inst/bash/, and users can write their own R wrappers following the same pattern as shown in this vignette.

RNAshapeQC distinguishes between two types of functions for computing QC metrics:

  • get_* functions (e.g., get_DR(), get_MCD(), get_wCV())
    Designed to compute a gene × sample matrix for a given gene list in a single study. They use multicore parallelization within a single R session (via parallel / doParallel). In practice, they are typically run on a single HPC node for an entire gene list. For very large lists, users may split the gene list into chunks, run multiple get_* calls in separate jobs, and combine the results.

  • gen_* functions (e.g., gen_DR(), gen_MCD(), gen_wCV())
    Designed for per-gene jobs, suitable for HPC/Slurm submission. They operate on a single gene at a time and can be applied to:

    • single-study pileup files
    • multi-study pileupList objects (e.g., TCGA-LUAD + TCGA-HNSC)

    Typical usage is one gene per job in a per-gene Slurm array, followed by combining outputs using helper functions (e.g., combine_vecObj()).

4 Data Analysis

Users who do not yet have processed matrices can generate them by following the workflow described in Section 3. This section demonstrates the QC analysis using the toy datasets included in the package.

4.1 Toy datasets

Two primary toy objects are shipped with the package:

  • TOY_mrna_se: synthetic mRNA-seq-like data (DR and TPM are stored as assays, and gene_length is stored in rowData().)
  • TOY_total_se: synthetic total RNA-seq-like data (MCD and wCV are stored as assays.)

RNAshapeQC functions support direct input of SummarizedExperiment objects, which is the recommended Bioconductor container format. Optionally, examples in Section 4.2 and Section 4.3 also demonstrate usage with matrix inputs (TOY_mrna_mat and TOY_total_mat). The resulting SummarizedExperiment objects from RNAshapeQC outputs can be directly integrated into standard Bioconductor workflows as examples in Section 5.

data("TOY_mrna_se")
data("TOY_total_se")

We can quickly inspect the structure as follows:

TOY_mrna_se
#> class: SummarizedExperiment 
#> dim: 100 10 
#> metadata(0):
#> assays(2): DR TPM
#> rownames(100): Gene001 Gene002 ... Gene099 Gene100
#> rowData names(1): gene_length
#> colnames(10): T01 T02 ... T09 T10
#> colData names(0):
TOY_total_se
#> class: SummarizedExperiment 
#> dim: 100 10 
#> metadata(0):
#> assays(2): MCD wCV
#> rownames(100): Gene001 Gene002 ... Gene099 Gene100
#> rowData names(0):
#> colnames(10): A01 A02 ... A09 A10
#> colData names(0):

4.2 mRNA-seq example

For mRNA-seq data, RNAshapeQC uses the PD of DS as a sample-level degradation metric.

We can derive the DII using gene weights via get_DIIwt() with the toy objects:

## Using SE input directly
res_dii_se <- get_DIIwt(TOY_mrna_se)

names(res_dii_se)
#> [1] "DR2"     "ds.vec"  "gene.df" "s"
head(res_dii_se$ds.vec)
#>   Sample            DS         PD    DII
#> 1    T01  1.296897e-03  0.4734540 Intact
#> 2    T02  1.094145e-04 -0.4996940 Intact
#> 3    T03 -4.199701e-04 -1.4649429 Intact
#> 4    T04  8.505004e-04  0.2420750 Intact
#> 5    T05 -6.863086e-05 -0.8243315 Intact
#> 6    T06 -1.823597e-04 -1.0316980 Intact

## (Optional) Using matrix/vector input
data(TOY_mrna_mat)

res_dii_mat <- get_DIIwt(
    DR         = TOY_mrna_mat$DR, 
    TPM        = TOY_mrna_mat$TPM, 
    genelength = TOY_mrna_mat$gene_length
)

## Compare SE and matrix inputs
all.equal(res_dii_se$DR2, res_dii_mat$DR2)
#> [1] TRUE
all.equal(res_dii_se$ds.vec, res_dii_mat$ds.vec)
#> [1] TRUE
all.equal(res_dii_se$gene.df, res_dii_mat$gene.df)
#> [1] TRUE
all.equal(res_dii_se$s, res_dii_mat$s)
#> [1] TRUE

When a SummarizedExperiment object is supplied, get_DIIwt() extracts the required assays and row-level metadata internally. As shown above, results obtained from SummarizedExperiment input and matrix/vector input are identical.

The returned list typically includes:

  • DR2: matrix of decay rates with filtered genes
  • ds.vec: data frame containing
    • Sample: sample ID
    • DS: degradation score
    • PD: projection depth of DS
    • DII: Degraded or Intact label based on a cutoff
  • gene.df: data frame with gene information used to compute gene weights
  • s: scale factor used to adjust gene-length and TPM medians

We can plot the distributions of DS and PD, as well as the heatmap of DR with gene weights:

## Path to a temporary file
outfile <- file.path(tempdir(), "Fig_DIIwt.png")

plot_DIIwt(
    DR        = res_dii_se$DR2,
    DIIresult = res_dii_se,
    outFile   = outfile
)

## Saving DIIwt plot to: /.../Fig_DIIwt.png

The function prints the file path of the saved figure in the console, and we can open the file directly from that location.

4.3 Total RNA-seq example

For total RNA-seq data, RNAshapeQC focuses on the relationship between MCD and wCV.

We can combine these metrics to measure coverage-shape optimality using get_SOI():

## Using SE input directly
## The cutoff value can be chosen by inspecting the PD distribution.
res_soi_se <- get_SOI(TOY_total_se, cutoff=1)
#> `geom_smooth()` using formula = 'y ~ x'

names(res_soi_se)
#> [1] "auc.vec"   "auc.coord" "rangeMCD"
head(res_soi_se$auc.vec)
#> # A tibble: 6 × 4
#>   Sample   AUC      PD SOI    
#>   <chr>  <dbl>   <dbl> <chr>  
#> 1 A01     2.26 -0.347  Optimal
#> 2 A02     2.20 -1.11   Optimal
#> 3 A03     2.31  0.0190 Optimal
#> 4 A04     2.34  0.0438 Optimal
#> 5 A05     2.26 -0.365  Optimal
#> 6 A06     2.16 -1.67   Optimal

## (Optional) Using matrix input
data(TOY_total_mat)

res_soi_mat <- get_SOI(
    MCD    = TOY_total_mat$MCD,
    wCV    = TOY_total_mat$wCV,
    cutoff = 1
)
#> `geom_smooth()` using formula = 'y ~ x'

## Compare SE and matrix inputs
all.equal(res_soi_se$auc.vec, res_soi_mat$auc.vec)
#> [1] TRUE
all.equal(res_soi_se$auc.coord, res_soi_mat$auc.coord)
#> [1] TRUE
all.equal(res_soi_se$rangeMCD, res_soi_mat$rangeMCD)
#> [1] TRUE

The results obtained from SummarizedExperiment input and matrix input are identical, and it confirms consistent behavior of get_SOI() across supported input types.

Here, auc.vec contains:

  • Sample: sample ID
  • AUC: smoothed area under the curve in the MCD-wCV space
  • PD: projection depth of the AUC
  • SOI: Suboptimal or Optimal label based on a cutoff

This provides a simple way to flag suboptimal total RNA-seq samples based on their coverage shape.

We can visualize the smoothed relationship using plot_SOI():

outfile <- file.path(tempdir(), "Fig_SOI.png")

## The cutoff value can be chosen by inspecting the PD distribution.
plot_SOI(
    SOIresult = res_soi_se,
    cutoff    = 1,
    outFile   = outfile 
)

## Saving SOI plot to: /.../Fig_SOI.png

This figure illustrates how wCV behaves as a function of MCD, and which samples are classified as suboptimal versus optimal.

5 Interoperability with Bioconductor

RNAshapeQC outputs can be organized using the SummarizedExperiment class, which ensures consistent handling of assay and associated meta-data during subsetting. In the container, matrices of DR, TPM, MCD, and wCV can be stored as assays(), where rows denote genes and columns represent samples. The gene or sample information tables (annotations) are stored using the rowData() or colData() functions, respectively. metadata() can contain additional QC-related information such as scale factors, MCD ranges, and cutoff values.

This is an example of constructing SummarizedExperiment objects and applying sample-level QCs to filter out low-quality samples before downstream analysis. This coordinated subsetting operation is simultaneously applied to both assays and column data, allowing proper filtering and preventing mismatches between QC metrics and downstream analysis inputs, as shown in se_mrna_hq and se_total_hq.

library(SummarizedExperiment)

## Store QC results in mRNA-seq
se_mrna <- SummarizedExperiment(
    assays   = list(DR=assay(TOY_mrna_se, "DR"), TPM=assay(TOY_mrna_se, "TPM")),
    colData  = DataFrame(res_dii_se$ds.vec),
    metadata = list(s=res_dii_se$s)
)
se_mrna
#> class: SummarizedExperiment 
#> dim: 100 10 
#> metadata(1): s
#> assays(2): DR TPM
#> rownames(100): Gene001 Gene002 ... Gene099 Gene100
#> rowData names(0):
#> colnames(10): T01 T02 ... T09 T10
#> colData names(4): Sample DS PD DII

## Subsetting to keep high-quality samples in mRNA-seq
se_mrna_hq <- se_mrna[, se_mrna$DII == "Intact"]
head(assay(se_mrna_hq, "TPM"))
#>                  T01         T02         T03          T04          T05
#> Gene001     2.955266   35.284653    5.611110    49.458330 2.820761e+01
#> Gene002 10098.621664 5317.070771 3255.004616  4123.260968 1.099778e+04
#> Gene003    34.015654    3.917251    1.105082     5.539151 1.177839e+01
#> Gene004    10.624285    3.459340    4.681335    21.934307 3.737044e+01
#> Gene005     5.136389    3.313921   32.509708    16.837166 9.163171e-01
#> Gene006 40856.586781 2183.009259 6505.488826 16107.565863 5.109681e+03
#>                 T06         T07          T08
#> Gene001   20.553214   60.726834    17.527987
#> Gene002 8715.989132 7249.783041  5149.009510
#> Gene003   11.894382    2.684087    11.855347
#> Gene004   12.187713    3.237398    16.119091
#> Gene005    4.684436    5.913339     9.131831
#> Gene006 6038.519806 6417.175512 17337.680073

This QC filtering can be applied prior to class discovery to reduce clustering instability caused by degraded samples.

library(ConsensusClusterPlus)

## TPM log transform
tpm_hq  <- log2(assay(se_mrna_hq, "TPM") + 1)

## Set a random seed for reproducibility
set.seed(1)

cc <- ConsensusClusterPlus(
    tpm_hq,
    maxK       = 6,
    reps       = 100,
    pItem      = 0.8,
    pFeature   = 0.8,
    clusterAlg = "hc",
    distance   = "pearson",
    plot       = "png",
    title      = tempdir()
)
cc[[3]]$consensusClass
#> T01 T02 T03 T04 T05 T06 T07 T08 
#>   1   2   2   1   1   1   3   1

The protocol-specific QC metrics stored in colData() can be included as covariates in a differential expression analysis (DESeq2) (e.g., design = ~ batch + condition + SOI).

## Store QC results in total RNA-seq
se_total <- SummarizedExperiment(
    assays   = list(MCD=assay(TOY_total_se, "MCD"), wCV=assay(TOY_total_se, "wCV")),
    rowData  = DataFrame(Gene=rownames(assay(TOY_total_se, "MCD"))),
    colData  = DataFrame(res_soi_se$auc.vec),
    metadata = list(auc.coord=res_soi_se$auc.coord, rangeMCD=res_soi_se$rangeMCD)
)
se_total
#> class: SummarizedExperiment 
#> dim: 100 10 
#> metadata(2): auc.coord rangeMCD
#> assays(2): MCD wCV
#> rownames(100): Gene001 Gene002 ... Gene099 Gene100
#> rowData names(1): Gene
#> colnames(10): A01 A02 ... A09 A10
#> colData names(4): Sample AUC PD SOI

## Subsetting to keep high-quality samples in total RNA-seq
se_total_hq <- se_total[, se_total$SOI == "Optimal"]
head(assay(se_total_hq, "wCV"))
#>              A01      A02      A03      A04      A05      A06      A07
#> Gene001 1.902783 2.104655 1.824006 2.216481 2.125495 1.846423 2.356645
#> Gene002 1.890622 1.966493 2.026579 2.266050 1.863042 1.982628 2.102911
#> Gene003 1.971822 2.134141 2.011843 2.081495 1.840883 1.853184 1.830847
#> Gene004 1.888715 2.061299 2.178234 1.996085 2.013771 2.268859 2.051379
#> Gene005 2.131341 1.996294 1.877507 2.489296 1.921696 1.738298 2.123472
#> Gene006 1.908249 2.354457 2.128700 2.108734 1.980232 1.914649 2.178310

6 Summary

This vignette demonstrated a minimal, runnable example of how to:

The examples here are intentionally small and synthetic so that the vignette builds quickly and reproducibly. More detailed workflows, methodological explanations, and real-data case studies are documented separately in the full-length user manual.

7 Session Information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 26.3.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] ConsensusClusterPlus_1.68.0 SummarizedExperiment_1.34.0
#>  [3] Biobase_2.64.0              GenomicRanges_1.56.1       
#>  [5] GenomeInfoDb_1.40.1         IRanges_2.38.1             
#>  [7] S4Vectors_0.42.1            BiocGenerics_0.50.0        
#>  [9] MatrixGenerics_1.16.0       matrixStats_1.5.0          
#> [11] RNAshapeQC_0.99.7           knitr_1.48                 
#> [13] BiocStyle_2.32.1           
#> 
#> loaded via a namespace (and not attached):
#>   [1] gld_2.6.6               readxl_1.4.3            rlang_1.1.4            
#>   [4] magrittr_2.0.3          clue_0.3-65             GetoptLong_1.0.5       
#>   [7] e1071_1.7-14            compiler_4.4.1          mgcv_1.9-1             
#>  [10] png_0.1-8               vctrs_0.6.5             stringr_1.5.1          
#>  [13] pkgconfig_2.0.3         shape_1.4.6.1           crayon_1.5.3           
#>  [16] fastmap_1.2.0           backports_1.5.0         magick_2.8.5           
#>  [19] XVector_0.44.0          labeling_0.4.3          utf8_1.2.4             
#>  [22] PupillometryR_0.0.5     rmarkdown_2.28          UCSC.utils_1.0.0       
#>  [25] purrr_1.0.2             xfun_0.50               zlibbioc_1.50.0        
#>  [28] cachem_1.1.0            jsonlite_1.8.8          highr_0.11             
#>  [31] DelayedArray_0.30.1     broom_1.0.6             parallel_4.4.1         
#>  [34] cluster_2.1.6           DescTools_0.99.57       R6_2.5.1               
#>  [37] bslib_0.8.0             stringi_1.8.4           RColorBrewer_1.1-3     
#>  [40] boot_1.3-31             car_3.1-3               cellranger_1.1.0       
#>  [43] jquerylib_0.1.4         Rcpp_1.0.13             bookdown_0.46          
#>  [46] iterators_1.0.14        zoo_1.8-12              Matrix_1.7-0           
#>  [49] splines_4.4.1           tidyselect_1.2.1        rstudioapi_0.16.0      
#>  [52] abind_1.4-8             yaml_2.3.10             doParallel_1.0.17      
#>  [55] codetools_0.2-20        lattice_0.22-6          tibble_3.2.1           
#>  [58] withr_3.0.1             evaluate_0.24.0         proxy_0.4-27           
#>  [61] circlize_0.4.16         pillar_1.9.0            BiocManager_1.30.25    
#>  [64] ggpubr_0.6.0.999        carData_3.0-5           foreach_1.5.2          
#>  [67] generics_0.1.3          ggplot2_3.5.1           rootSolve_1.8.2.4      
#>  [70] munsell_0.5.1           scales_1.3.0            class_7.3-22           
#>  [73] glue_1.7.0              lmom_3.2                tools_4.4.1            
#>  [76] data.table_1.16.0       ggsignif_0.6.4          Exact_3.3              
#>  [79] mvtnorm_1.3-1           cowplot_1.1.3           grid_4.4.1             
#>  [82] tidyr_1.3.1             colorspace_2.1-1        nlme_3.1-166           
#>  [85] GenomeInfoDbData_1.2.12 patchwork_1.3.0         Formula_1.2-5          
#>  [88] cli_3.6.3               expm_1.0-0              fansi_1.0.6            
#>  [91] S4Arrays_1.4.1          ComplexHeatmap_2.20.0   dplyr_1.1.4            
#>  [94] gtable_0.3.5            rstatix_0.7.2           sass_0.4.9             
#>  [97] digest_0.6.37           SparseArray_1.4.8       rjson_0.2.22           
#> [100] farver_2.1.2            htmltools_0.5.8.1       lifecycle_1.0.4        
#> [103] httr_1.4.7              GlobalOptions_0.1.2     MASS_7.3-61