Kim et al. (2025)1 Kim M, Lamlertthon W, Jo H, et al. Lung Adenocarcinoma Just Desserts: An Expanding Pie of Activating Oncogenes or a Layer Cake of Integrated Alterations. bioRxiv 2025. https://doi.org/10.1101/2025.09.19.677365 introduced a family of centroid-based predictors for mutant (mt) versus wild-type (WT) using the Classification to the Nearest Centroids (ClaNC) method. EGFRmSig demonstrates this approach to calculate an EGFR mutation signature (mSig) score, including data preprocessing and distance-based prediction. For detailed background of the methods, please see the reference paper.
First, let’s load EGFRmSig and check the structure of two datasets provided in the package.
library(EGFRmSig)
All_subtypes_centroidsA matrix of predefined centroids for 1,020 genes and 10 mean values. The centroids are averaged by mutation status for subtype-adjusted, subtype-unadjusted, and subtype-specific values in z-score space.
Notation
mean_subtype.i_*: subtype-adjusted centroids (equal weights)
\[
\mu^{\mathrm{subtype.i}}_{g,y}
=
\frac{1}{|\mathscr{S}|}
\sum_{s \in \mathscr{S}}
\left(
\frac{1}{n_{s,y}}
\sum_{i \in \mathscr{I}_{s,y}}
x_{g,i}
\right)
=
\sum_{s \in \mathscr{S}}
\frac{1}{|\mathscr{S}|}
\mu_{g,y}^s
\]
mean_all_*: subtype-unadjusted centroids (subtype-weighted)
\[
\mu^{\mathrm{all}}_{g,y}
=
\frac{1}{n_y}
\sum_{i \in \mathscr{I}_y}
x_{g,i}
=
\sum_{s \in \mathscr{S}}
\frac{n_{s,y}}{n_y}
\mu_{g,y}^s
\]
mean_{subtype}_*: subtype-specific centroids
\[
\mu^{s}_{g,y}
=
\frac{1}{n_{s,y}}
\sum_{i \in \mathscr{I}_{s,y}}
x_{g,i}
\]
data("All_subtypes_centroids")
str(All_subtypes_centroids)
#> num [1:1020, 1:10] -0.00641 -0.09077 -0.09828 0.02295 -0.08542 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:1020] "HIP1" "PIGQ" "C16orf58" "UPK3A" ...
#> ..$ : chr [1:10] "mean_subtype.i_wt" "mean_subtype.i_mt" "mean_all_wt" "mean_all_mt" ...
TOY_exprAn example expression matrix derived from RSEM TPM values, where rows and columns correspond to 1,000 gene symbols and 50 sample IDs, respectively. The toy dataset was generated to illustrate the analysis workflow and does not contain identifiable or controlled-access data.
data("TOY_expr")
str(TOY_expr)
#> num [1:1000, 1:50] 199.6007 1097.2179 3654.4478 21.9136 -0.0353 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:1000] "RECK" "THOC3" "SCRN1" "TMEM151A" ...
#> ..$ : chr [1:50] "T01" "T02" "T03" "T04" ...
compute_mSig() function computes the predicted EGFR mSig score using a RSEM TPM matrix and predefined LUAD centroids. expr is required, and the default centroids parameter is given as the All_subtypes_centroids data in EGFRmSig. This function presents the number of overlapping genes and the saved file location (if export=TRUE).
Log-transformed expression values are scaled on a gene-wise basis. meanCol is the centroid column to use, and two options are available: "subtype.i" (subtype-adjusted centroids) or "all" (subtype-unadjusted centroids). The latter is a default setting.
After comparing each sample to WT centroid and mt centroid using Euclidean distance, assign the nearest centroid. The output table is a samples x 3 data frame containing:
SampleID: sample IDsEGFR_mSig_class: EGFR_mSig_WT_like (pred.ind=1), EGFR_mSig_mut_like (pred.ind=2)EGFR_mSig_distance: predicted EGFR mSig distance (pred.score)compute_mSig(expr=TOY_expr)
# Overlapping genes (n_genes_used): 43
# Output saved to: ./Pred_mSig_yyyy-mm-dd.txt
Suppose that we have TCGA_expr, a 20,501 x 576 matrix from TCGA-LUAD. The compute_mSig() function with export=FALSE returns calculation results to the res object. The table presents a summary of EGFR mSig distance for TCGA-LUAD samples by class.
# TCGA expression matrix is assumed to be preloaded
# (not included due to data sharing restrictions)
res <- compute_mSig(expr=TCGA_expr, export=FALSE)
library(dplyr)
res %>%
group_by(EGFR_mSig_class) %>%
summarise(
n = n(),
min = min(EGFR_mSig_distance, na.rm=TRUE),
q1 = quantile(EGFR_mSig_distance, 0.25, na.rm=TRUE),
median = median(EGFR_mSig_distance, na.rm=TRUE),
mean = mean(EGFR_mSig_distance, na.rm=TRUE),
q3 = quantile(EGFR_mSig_distance, 0.75, na.rm=TRUE),
max = max(EGFR_mSig_distance, na.rm=TRUE)
)
# # A tibble: 2 × 8
# EGFR_mSig_class n min q1 median mean q3 max
# <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EGFR_mSig_WT_like 455 297. 649. 800. 949. 1125. 3951.
# 2 EGFR_mSig_mut_like 121 254. 572. 679. 759. 860. 2014.
plot_mSig() function plots the distribution of EGFR mSig distance by class using violin/boxplot from the output object res returned by compute_mSig(). In the figure, EGFR mSig WT-like samples show a higher median and a larger IQR with long right-tail distribution, suggesting transcriptomic heterogeneity.
library(ggplot2)
plot_mSig(res=res)
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 26.2
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Chicago
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] EGFRmSig_0.1.0 BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 stringi_1.8.4 digest_0.6.37 magrittr_2.0.3 evaluate_0.24.0 bookdown_0.46
#> [7] pkgload_1.4.0 fastmap_1.2.0 rprojroot_2.1.1 jsonlite_1.8.8 pkgbuild_1.4.4 sessioninfo_1.2.2
#> [13] brio_1.1.5 urlchecker_1.0.1 promises_1.3.0 BiocManager_1.30.25 purrr_1.0.2 jquerylib_0.1.4
#> [19] cli_3.6.3 shiny_1.9.1 rlang_1.1.4 ellipsis_0.3.2 remotes_2.5.0 withr_3.0.1
#> [25] cachem_1.1.0 yaml_2.3.10 devtools_2.4.5 tools_4.4.1 memoise_2.0.1 httpuv_1.6.15
#> [31] vctrs_0.6.5 R6_2.5.1 mime_0.12 lifecycle_1.0.4 stringr_1.5.1 fs_1.6.4
#> [37] htmlwidgets_1.6.4 usethis_3.0.0 miniUI_0.1.1.1 desc_1.4.3 bslib_0.8.0 later_1.3.2
#> [43] rsconnect_1.7.0 glue_1.7.0 profvis_0.3.8 Rcpp_1.0.13 highr_0.11 xfun_0.50
#> [49] rstudioapi_0.16.0 knitr_1.48 xtable_1.8-4 htmltools_0.5.8.1 rmarkdown_2.28 testthat_3.2.1.1
#> [55] compiler_4.4.1