Contents

1 Overview

Kim et al. (2025)1 Kim M, Lamlertthon W, Jo H, et al. Lung Adenocarcinoma Just Desserts: An Expanding Pie of Activating Oncogenes or a Layer Cake of Integrated Alterations. bioRxiv 2025. https://doi.org/10.1101/2025.09.19.677365 introduced a family of centroid-based predictors for mutant (mt) versus wild-type (WT) using the Classification to the Nearest Centroids (ClaNC) method. EGFRmSig demonstrates this approach to calculate an EGFR mutation signature (mSig) score, including data preprocessing and distance-based prediction. For detailed background of the methods, please see the reference paper.

2 Datasets

First, let’s load EGFRmSig and check the structure of two datasets provided in the package.

library(EGFRmSig)

2.1 All_subtypes_centroids

A matrix of predefined centroids for 1,020 genes and 10 mean values. The centroids are averaged by mutation status for subtype-adjusted, subtype-unadjusted, and subtype-specific values in z-score space.

  • Notation

    • \(g\): gene
    • \(i\): sample
    • \(x_{g,i}\): expression for gene \(g\) and sample \(i\)
    • \(s \in \{bronch, mag, sqm\}\): LUAD expression subtype (bronch, bronchioid; mag, magnoid; sqm, squamoid)
    • \(y \in \{WT, mt\}\): EGFR mutation status (WT, wild-type; mt, mutant)
    • \(n_y = |\mathscr{I}_{y}|,\) where \(\mathscr{I}_{y}\) denotes all samples with mutation status \(y\)
    • \(n_{s,y} = |\mathscr{I}_{s,y}|,\) where \(\mathscr{I}_{s,y}\) denotes samples with subtype \(s\) and mutation status \(y\)
  • mean_subtype.i_*: subtype-adjusted centroids (equal weights) \[ \mu^{\mathrm{subtype.i}}_{g,y} = \frac{1}{|\mathscr{S}|} \sum_{s \in \mathscr{S}} \left( \frac{1}{n_{s,y}} \sum_{i \in \mathscr{I}_{s,y}} x_{g,i} \right) = \sum_{s \in \mathscr{S}} \frac{1}{|\mathscr{S}|} \mu_{g,y}^s \]

  • mean_all_*: subtype-unadjusted centroids (subtype-weighted) \[ \mu^{\mathrm{all}}_{g,y} = \frac{1}{n_y} \sum_{i \in \mathscr{I}_y} x_{g,i} = \sum_{s \in \mathscr{S}} \frac{n_{s,y}}{n_y} \mu_{g,y}^s \]

  • mean_{subtype}_*: subtype-specific centroids \[ \mu^{s}_{g,y} = \frac{1}{n_{s,y}} \sum_{i \in \mathscr{I}_{s,y}} x_{g,i} \]

data("All_subtypes_centroids")
str(All_subtypes_centroids)
#>  num [1:1020, 1:10] -0.00641 -0.09077 -0.09828 0.02295 -0.08542 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : chr [1:1020] "HIP1" "PIGQ" "C16orf58" "UPK3A" ...
#>   ..$ : chr [1:10] "mean_subtype.i_wt" "mean_subtype.i_mt" "mean_all_wt" "mean_all_mt" ...

2.2 TOY_expr

An example expression matrix derived from RSEM TPM values, where rows and columns correspond to 1,000 gene symbols and 50 sample IDs, respectively. The toy dataset was generated to illustrate the analysis workflow and does not contain identifiable or controlled-access data.

data("TOY_expr")
str(TOY_expr)
#>  num [1:1000, 1:50] 199.6007 1097.2179 3654.4478 21.9136 -0.0353 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : chr [1:1000] "RECK" "THOC3" "SCRN1" "TMEM151A" ...
#>   ..$ : chr [1:50] "T01" "T02" "T03" "T04" ...

3 Demo

compute_mSig() function computes the predicted EGFR mSig score using a RSEM TPM matrix and predefined LUAD centroids. expr is required, and the default centroids parameter is given as the All_subtypes_centroids data in EGFRmSig. This function presents the number of overlapping genes and the saved file location (if export=TRUE).

3.1 Preprocessing

Log-transformed expression values are scaled on a gene-wise basis. meanCol is the centroid column to use, and two options are available: "subtype.i" (subtype-adjusted centroids) or "all" (subtype-unadjusted centroids). The latter is a default setting.

3.2 Distance-based prediction

After comparing each sample to WT centroid and mt centroid using Euclidean distance, assign the nearest centroid. The output table is a samples x 3 data frame containing:

  • SampleID: sample IDs
  • EGFR_mSig_class: EGFR_mSig_WT_like (pred.ind=1), EGFR_mSig_mut_like (pred.ind=2)
  • EGFR_mSig_distance: predicted EGFR mSig distance (pred.score)
compute_mSig(expr=TOY_expr)
# Overlapping genes (n_genes_used): 43
# Output saved to: ./Pred_mSig_yyyy-mm-dd.txt

4 Real Data Analysis

Suppose that we have TCGA_expr, a 20,501 x 576 matrix from TCGA-LUAD. The compute_mSig() function with export=FALSE returns calculation results to the res object. The table presents a summary of EGFR mSig distance for TCGA-LUAD samples by class.

# TCGA expression matrix is assumed to be preloaded
# (not included due to data sharing restrictions)
res <- compute_mSig(expr=TCGA_expr, export=FALSE)

library(dplyr)
res %>%
  group_by(EGFR_mSig_class) %>%
  summarise(
    n      = n(),
    min    = min(EGFR_mSig_distance, na.rm=TRUE),
    q1     = quantile(EGFR_mSig_distance, 0.25, na.rm=TRUE),
    median = median(EGFR_mSig_distance, na.rm=TRUE),
    mean   = mean(EGFR_mSig_distance, na.rm=TRUE),
    q3     = quantile(EGFR_mSig_distance, 0.75, na.rm=TRUE),
    max    = max(EGFR_mSig_distance, na.rm=TRUE)
  )
# # A tibble: 2 × 8
# EGFR_mSig_class        n   min    q1 median  mean    q3   max
# <chr>              <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
#   1 EGFR_mSig_WT_like    455  297.  649.   800.  949. 1125. 3951.
# 2 EGFR_mSig_mut_like   121  254.  572.   679.  759.  860. 2014.

plot_mSig() function plots the distribution of EGFR mSig distance by class using violin/boxplot from the output object res returned by compute_mSig(). In the figure, EGFR mSig WT-like samples show a higher median and a larger IQR with long right-tail distribution, suggesting transcriptomic heterogeneity.

library(ggplot2)
plot_mSig(res=res)

5 Session Information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 26.2
#> 
#> Matrix products: default
#> BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] EGFRmSig_0.1.0   BiocStyle_2.32.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9          stringi_1.8.4       digest_0.6.37       magrittr_2.0.3      evaluate_0.24.0     bookdown_0.46      
#>  [7] pkgload_1.4.0       fastmap_1.2.0       rprojroot_2.1.1     jsonlite_1.8.8      pkgbuild_1.4.4      sessioninfo_1.2.2  
#> [13] brio_1.1.5          urlchecker_1.0.1    promises_1.3.0      BiocManager_1.30.25 purrr_1.0.2         jquerylib_0.1.4    
#> [19] cli_3.6.3           shiny_1.9.1         rlang_1.1.4         ellipsis_0.3.2      remotes_2.5.0       withr_3.0.1        
#> [25] cachem_1.1.0        yaml_2.3.10         devtools_2.4.5      tools_4.4.1         memoise_2.0.1       httpuv_1.6.15      
#> [31] vctrs_0.6.5         R6_2.5.1            mime_0.12           lifecycle_1.0.4     stringr_1.5.1       fs_1.6.4           
#> [37] htmlwidgets_1.6.4   usethis_3.0.0       miniUI_0.1.1.1      desc_1.4.3          bslib_0.8.0         later_1.3.2        
#> [43] rsconnect_1.7.0     glue_1.7.0          profvis_0.3.8       Rcpp_1.0.13         highr_0.11          xfun_0.50          
#> [49] rstudioapi_0.16.0   knitr_1.48          xtable_1.8-4        htmltools_0.5.8.1   rmarkdown_2.28      testthat_3.2.1.1   
#> [55] compiler_4.4.1