This document describes isma (Di Nanni et al 2018), an R package that, beyond site integration and filtering, provides functions for a more in-depth analysis of mutation sites (both SNV and INDELs) occurrence across subjects and tools. The results generated by isma underline common patterns (e.g. recurrent calls, typical tool consensus across subjects) and specificities (e.g. outlier samples, pipeline specific sites, genes enriched in calls from a single pipeline) as well as sites already catalogued by The Cancer Genome Atlas (TCGA) (Tomczak et al. 2015), so as to design and apply filtering strategies to screen more reliable sites. The package is available at the URL http://www.interomics.eu/tools/isma.
The following instuctions - to be executed within the R environment - refer to R packages that must be installed in case you are NOT using RStudio docker image with isma package.
install.packages(c("devtools", "gplots", "ggplot2", "igraph", "lattice", "knitr", "RColorBrewer", "rmarkdown", "stringr", "UpSetR", "vcfR"))
devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")
R >= 3.5:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("IRanges", "GenomicRanges", "GenomicFeatures", "org.Hs.eg.db", "Rsamtools", "SummarizedExperiment", "TxDb.Hsapiens.UCSC.hg19.knownGene", "TxDb.Hsapiens.UCSC.hg38.knownGene", "VariantAnnotation"))
Older versions of R:
source("https://bioconductor.org/biocLite.R")
biocLite(c("IRanges", "GenomicRanges", "GenomicFeatures", "org.Hs.eg.db", "Rsamtools", "SummarizedExperiment", "TxDb.Hsapiens.UCSC.hg19.knownGene", "TxDb.Hsapiens.UCSC.hg38.knownGene", "VariantAnnotation"))
Lastly, isma:
install.packages("/yourpath/isma_0.3.0.tar.gz", repos=NULL)
library(isma)
The package takes as input Variant Call Format (VCF) files generated by mutation callers. Currently isma supports VCF files from: mutect (versions 1 and 2) (Cibulskis et al. 2013), muse (Fan et al. 2016), somaticsniper (Larson et al. 2011), strelka (Saunder et al. 2012) and varscan2 (Koboldt et al. 2012). Alternatively isma supports tab-delimited files (see “Custom input format” section). Input file location and other parameters must be specified in a tab-delimited configuration file with the following columns:
## File Tool_ID Subject_ID Design Tumor_ID Normal_ID
## 1 S15_S14_M1.vcf mutect1 186 t1_n S15 S14
## 2 S19_S18_M1.vcf mutect1 187 t_n S19 S18
## 3 S15_S14.vcf mutect2 186 t1_n TUMOR NORMAL
## 4 S19_S18.vcf mutect2 187 t_n TUMOR NORMAL
## 5 S15_14_INDEL.vcf varscan2 186 t1_n TUMOR NORMAL
## 6 S15_S14_SNP.vcf varscan2 186 t1_n TUMOR NORMAL
## 7 S19_S18_INDEL.vcf varscan2 187 t_n TUMOR NORMAL
## 8 S19_S18_SNP.vcf varscan2 187 t_n TUMOR NORMAL
All the analysis can be run using the following functions:
sites <- pre_process(configuration_file)
site_analysis(sites, directory = "/yourpath/output", cores = 2)
sites_ann <- site_annotation(sites, genome = "hg38")
gene_analysis(sites_ann, directory = "/yourpath/output", cores = 2)
ese1 <- ese_allsubj(sites, type = "Site")
ese2 <- ese_tool_subj(sites, type = "Site")
ese3 <- ese_subj_tool(sites, type = "Site")
sites_ann<- integrate_TCGA(sites_ann, genome = "hg38", project = "BRCA", pipeline = "mutect2", gene_info = TRUE)
Note that annotated mutation sites can be also analysed at gene level, using the argument type = "Gene"
. In the following sections we will show the main functions of isma considering a small dataset (somatic mutations of 50 breast cancer subjects) from The Cancer Genome Atlas (TCGA) (Tomczak et al. 2015).
The function pre_process
reads input VCF files and generates a data.frame
, in which variants are merged from different callers, experimental evidences from different VCF files are harmonized, unique site identifiers are generated and some fields are added (the total number of reads supporting a site in each sample, variant allele frequency (VAF), INDEL size and mutation type):
sites <- pre_process(configuration_file)
To demonstrate the capabilities of isma, we implemented the function get_TCGA_sites()
, which collects somatic mutations from TCGA (Tomczak et al. 2015) and returns the same output of the function pre_process
. In the following example we collect mutation sites detected by 4 mutation callers from 50 breast cancer subjects (you can load the barcodes of the selected subjects using data(barcode_50)
, selected to include different types of outliers as well as samples without abnormal behaviours. Since data retrieval from TCGA may take some time according to your internet connection, you can load precomputed results using data(sites)
.
sites <- get_TCGA_sites(tools = c("muse", "mutect2", "varscan2","somaticsniper"), n_subjects = 50, seed_n = 2) #demo with TCGA data, use seed option to obtain each time the same subjects. NB. The barcode of subjects download are not the same of the selected 50 subjects.
data(barcode_50) #load the barcodes of the 50 selected subjects
sites <- get_TCGA_sites(tools = c("muse", "mutect2", "varscan2","somaticsniper"), barcode = barcode_50) #demo with TCGA data of the selected 50 subjects
data(sites) #precomputed results
head(sites)
## CHROM POS REF ALT Site_ID Variant_Type Subject_ID Design
## 52 chr3 179218303 G A S12 SNP A0YC WXS
## 53 chr3 179218303 G A S12 SNP A1J5 WXS
## 139 chr3 179218303 G A S12 SNP A1JN WXS
## 145 chr3 179218303 G A S12 SNP A0YC WXS
## 158 chr3 179218303 G A S12 SNP A1JN WXS
## 159 chr3 179218303 G A S12 SNP A0YC WXS
## Tumor_ref_reads Tumor_var_reads Tumor_reads Tumor_Vaf Normal_ref_reads
## 52 87 40 127 0.3149606 NA
## 53 44 20 64 0.3125000 NA
## 139 27 30 57 0.5263158 NA
## 145 90 40 130 0.3076923 NA
## 158 33 34 67 0.5074627 NA
## 159 91 40 131 0.3053435 NA
## Normal_var_reads Normal_reads Normal_Vaf size_INDEL Tool_ID
## 52 NA 133 NA 0 mutect2
## 53 NA 95 NA 0 varscan2
## 139 NA 72 NA 0 varscan2
## 145 NA 135 NA 0 somaticsniper
## 158 NA 77 NA 0 muse
## 159 NA 135 NA 0 muse
## SiteID_Subject SiteID_Subject_Tool_ID
## 52 S12_A0YC S12_A0YC_mutect2
## 53 S12_A1J5 S12_A1J5_varscan2
## 139 S12_A1JN S12_A1JN_varscan2
## 145 S12_A0YC S12_A0YC_somaticsniper
## 158 S12_A1JN S12_A1JN_muse
## 159 S12_A0YC S12_A0YC_muse
The function site_analysis
runs a series of function to obtain statistics at sites level, like the overlap among subjects and callers, the detection of outlier subjects and tools, the number of sites detected by each tool and the occurrence of mutation sites across callers and subject, calculating a consensus measure among callers in each subject. It only requires as input the data.frame
generated by pre_process
(or get_TCGA_sites()
) and the output directory.
site_analysis(sites, directory = "/yourpath/output", cores = 2)
The overall consensus plot shows the consensus among pipelines.
The detailed consensus plot displays the total amount of sites, consensus on sites for each subject and subject recognized as outlier according to site number, imbalance in site number across tools, imbalance in consensus among tool and tool consensus score.
The previous analysis on outlier detection is summarized into a table:
head(df_outlier,4)
## subj hypermutated Imbalance_in_site_number_across_tools
## 15 A0U0 1 1
## 3 A08H 1 0
## 28 A1J5 1 0
## 33 A1XQ 1 0
## Imbalance_in_consensus_among_tools CS
## 15 0 0
## 3 0 0
## 28 0 0
## 33 0 0
Site co-occurrence matrices reports quantities for all-pairs of callers and subjects. For example, the figure below underlines how mutect2 detected up to 3 times more mutation sites than other tools, while somatic sniper shared the majority of its calls with other tools.
isma annotates sites relying on other R packages, like VariantAnnotion
(Obenchain et al. 2014), TxDb.Hsapiens.UCSC.hg38.knownGene
(Team BC and Maintainer BP 2016) and org.Hs.eg.db
(Carlson M 2017).
sites_ann <- site_annotation(sites, genome = "hg38")
head(sites_ann)
## CHROM POS REF ALT Site_ID Variant_Type Subject_ID Design
## 1 chr3 179218303 G A S12 SNP A1JN WXS
## 3 chr3 179218303 G A S12 SNP A1J5 WXS
## 5 chr3 179218303 G A S12 SNP A1JN WXS
## 7 chr3 179218303 G A S12 SNP A0YC WXS
## 9 chr3 179218303 G A S12 SNP A0YC WXS
## 11 chr3 179218303 G A S12 SNP A1J5 WXS
## Tumor_ref_reads Tumor_var_reads Tumor_reads Tumor_Vaf Normal_ref_reads
## 1 27 30 57 0.5263158 NA
## 3 55 25 80 0.3125000 NA
## 5 33 34 67 0.5074627 NA
## 7 91 40 131 0.3053435 NA
## 9 74 35 109 0.3211009 NA
## 11 52 24 76 0.3157895 NA
## Normal_var_reads Normal_reads Normal_Vaf size_INDEL Tool_ID
## 1 NA 72 NA 0 varscan2
## 3 NA 111 NA 0 somaticsniper
## 5 NA 77 NA 0 muse
## 7 NA 135 NA 0 muse
## 9 NA 110 NA 0 varscan2
## 11 NA 108 NA 0 mutect2
## SiteID_Subject SiteID_Subject_Tool_ID Location locstart locend txid
## 1 S12_A1JN S12_A1JN_varscan2 coding 1633 1633 33941
## 3 S12_A1J5 S12_A1J5_somaticsniper coding 1633 1633 33941
## 5 S12_A1JN S12_A1JN_muse coding 1633 1633 33941
## 7 S12_A0YC S12_A0YC_muse coding 1633 1633 33941
## 9 S12_A0YC S12_A0YC_varscan2 coding 1633 1633 33941
## 11 S12_A1J5 S12_A1J5_mutect2 coding 1633 1633 33941
## Gene Consequence
## 1 5290 coding
## 3 5290 coding
## 5 5290 coding
## 7 5290 coding
## 9 5290 coding
## 11 5290 coding
A column is added to classify mutation sites as coding
and noncoding
, on the basis of their Consequence
. In our TCGA data most of the sites are coding.
barplot(table(sites_ann$Location, sites_ann$Tool_ID), col = c(8,1), legend = c("coding", "noncoding"), beside = T)
The statistics obtained with site_analysis
can be carried out at gene level using the gene_analysis
wrapper function, which further provides information like gene mutation frequency across subjects and its dispersion estimated considering multiple callers.
gene_analysis(sites_ann, directory = "/yourpath/output", cores = 2)
Gene co-occurence among subjects underlines similarity between mutation profiles, for example A1XQ shares more mutated genes with A0U0 subject than A2FW; the variability of such co-occurrences due to the use of different callers is quantified as the coefficient of variation (above diagonal).
The analysis of gene mutation frequency highlights genes with a more or less variable frequency according to the pipelines used.
isma contains a set of functions to study the relation between called sites and experimental evidences.
In particular three functions are provided: ese_allsubj
, ese_tool_subj
and ese_subj_tool
. By default these functions quantify site amount at varying values of Tumor_var_reads
, Normal_var_reads
, Tumor_reads
, Normal_reads
, Tumor_Vaf
and Normal_Vaf
. In particular:
ese_allsubj
calculates the variation of site amount and shows the results for each tool;ese_subj_tool
calculates the variation of site amount, considering separately each subject and returns the results for each caller.ese_tool_subj
calculates the variation of site amount, considering separately each tool and returns the results for each subject;These functions can also be run at gene level and using custom configurations to explore other experimental evidences.
The results provided by these functons are useful to select possible values to filter sites. Let us explore how the amount of sites detected by mutation callers varies with different values of tumor VAF.
For example, using ese_allsubj
, we observe that mutect2 calls more sites than the other two callers, especially at low tumor VAF.
library(isma)
ese1 <- ese_allsubj(sites, type = "Site", bin = 30)
ese1tvaf <- ese1$filtering_output$Tumor_Vaf
head(ese1tvaf)
## par mutect2 varscan2 somaticsniper muse
## 1 0.00000000 6589 2937 1615 3110
## 2 0.03448276 6534 2937 1615 3109
## 3 0.06896552 5222 2937 1615 2945
## 4 0.10344828 3857 2854 1615 2504
## 5 0.13793103 2854 2255 1615 2071
## 6 0.17241379 2141 1795 1548 1695
matplot(ese1tvaf[, -1], type = "l", xaxt = "n", col = 1:4, ylab = "#{sites | tumor VAF >= eps}", xlab = "eps", lwd=2, lty=1:4)
axis(1, at = 1:nrow(ese1tvaf), labels = t(round(ese1tvaf$par, 2)), las=2)
legend('topright', inset = .05, legend = colnames(ese1tvaf)[-1], col = 1:4, lty=1:4, lwd=2)
For each subject, ese_subj_tool
returns the amount of sites found by different tools; here is the case of subject “A1NC”, in which mutect2 calls more mutation sites than other tools, especially at lower VAF.
library(isma)
ese2 <- ese_subj_tool(sites, type = "Site")
ese2tvafa0ba <- ese2$filtering_output$Tumor_Vaf$A1NC
head(ese2tvafa0ba)
## par varscan2 mutect2 somaticsniper muse
## 1 0.03977273 67 259 36 60
## 2 0.07978469 67 209 36 60
## 3 0.11979665 60 153 36 55
## 4 0.15980861 39 98 36 45
## 5 0.19982057 31 70 29 32
## 6 0.23983254 17 43 17 20
Tool_ID <- colnames(ese2tvafa0ba[, -1])
matplot(ese2tvafa0ba[,-(1)], type = "l", xaxt = "n", ylab = "#{sites | tumor VAF >= eps}", xlab = "eps", main = "", lwd=2, col=1:4, lty=1:4)
axis(1, at = 1:nrow(ese2tvafa0ba), labels = t(round(ese2tvafa0ba$par, 2)), las=2)
legend('toprigh', legend = colnames(ese2tvafa0ba)[-1], lty = 1:4, col = 1:4, lwd=2)
Importantly, this and other analyses can be also carried out at gene level (see below).
isma offers the opportunity to collect data from the TCGA, in order to check whether the mutation sites under analysisare available at the TCGA Note that this function requires the R package TCGAbiolinks
(Tomczak et al. 2015) and an internet connection to retrieve data from TCGA servers.
Specifically, the function integrate_TCGA
adds the column Site_TCGA
, in which 1
indicates the presence of the same mutation in TCGA (currently this function does not support INDELs, because of different encoding formats: INDELs will have Site_TCGA
equal to NA
). Optionally, the presence of other mutations affecting the same gene (Gene_TCGA
) and gene mutation frequency in TCGA cohort (Gene_mut_freq
) are reported.
In general, the function can be used in the following way (we use the option to exclude hypermutated subjects):
sites_TCGA <- integrate_TCGA(sites_ann, genome = "hg38", project = "BRCA", pipeline = "mutect2", gene_info = TRUE, exclude_hypermutated = T)
colnames(sites_TCGA)
## [1] "Gene" "CHROM"
## [3] "POS" "REF"
## [5] "ALT" "Site_ID"
## [7] "Variant_Type" "Subject_ID"
## [9] "Design" "Tumor_ref_reads"
## [11] "Tumor_var_reads" "Tumor_reads"
## [13] "Tumor_Vaf" "Normal_ref_reads"
## [15] "Normal_var_reads" "Normal_reads"
## [17] "Normal_Vaf" "size_INDEL"
## [19] "Tool_ID" "SiteID_Subject"
## [21] "SiteID_Subject_Tool_ID" "Location"
## [23] "locstart" "locend"
## [25] "txid" "Consequence"
## [27] "ALT_TCGA" "Site_TCGA"
## [29] "Gene_TCGA" "Gene_mut_freq"
As expected, since we are using TCGA sites as input and asked for information on site availability among TCGA sites found by mutect2, only a few input sites are not available among TCGA sites found by mutect2, i.e. the sites that were found by other callers (muse, varscan and somatic sniper), but not by mutect2.
table(sites_TCGA$Site_TCGA)
##
## 0 1
## 10413 7717
Custom input files must have the following columns:
library(isma)
data(custom_format_example)
print(custom_format_example)
## CHROM POS REF ALT Tumor_ref_reads Tumor_var_reads Normal_ref_reads
## 1 chr1 1190867 C T 58 25 90
## 2 chr1 17030795 G A 25 7 12
## Normal_var_reads
## 1 1
## 2 0
Carlson M. 2017. “Org.Hs.eg.db: Genome Wide Annotation for Human.” R package version 3.5.0.
Cibulskis et al. 2013. “Sensitive Detection of Somatic Point Mutations in Impure and Heterogeneous Cancer Samples.” Nature Biotechnology.
Di Nanni et al. 2018. “Isma: Integrative Analysis of Mutations Detected by Multiple Calling Tools.” R package version 0.1.0.
Fan et al. 2016. “MuSE: Accounting for Tumor Heterogeneity Using a Sample-Specific Error Model Improves Sensitivity and Specificity in Mutation Calling from Sequencing Data.” Genome Biology.
Koboldt et al. 2012. “Somatic Mutation and Copy Number Alteration Discovery in Cancer by Exome Sequencing.” Genome Res.
Larson et al. 2011. “SomaticSniper: Identification of Somatic Point Mutations in Whole Genome Sequencing Data.” Bioinformatics.
Obenchain et al. 2014. “VariantAnnotation: A Bioconductor Package for Explora-Tion and Annotation of Genetic Variants.” Bioinformatics.
Saunder et al. 2012. “Strelka: Accurate Somatic Small-Variant Calling from Se-Quenced Tumor–normal Sample Pairs.” Bioinformatics.
Team BC and Maintainer BP. 2016. “TxDb.Hsapiens.UCSC.hg38.knownGene: Annotation Package for Txdb Object(s).” R package version 3.4.0.
Tomczak et al. 2015. “The Cancer Genome Atlas (Tcga): An Immeasurable Source of Knowledge.” Contemp Oncol (Pozn).