isma: an R package for the integrative analysis of mutations detected by multiple pipelines

Noemi Di Nanni & Ettore Mosca

2018-11-26

Description

This document describes isma (Di Nanni et al 2018), an R package that, beyond site integration and filtering, provides functions for a more in-depth analysis of mutation sites (both SNV and INDELs) occurrence across subjects and tools. The results generated by isma underline common patterns (e.g. recurrent calls, typical tool consensus across subjects) and specificities (e.g. outlier samples, pipeline specific sites, genes enriched in calls from a single pipeline) as well as sites already catalogued by The Cancer Genome Atlas (TCGA) (Tomczak et al. 2015), so as to design and apply filtering strategies to screen more reliable sites. The package is available at the URL http://www.interomics.eu/tools/isma.

Installation

The following instuctions - to be executed within the R environment - refer to R packages that must be installed in case you are NOT using RStudio docker image with isma package.

install.packages(c("devtools", "gplots", "ggplot2", "igraph", "lattice", "knitr", "RColorBrewer", "rmarkdown", "stringr", "UpSetR", "vcfR"))
devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")

R >= 3.5:

if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("IRanges", "GenomicRanges", "GenomicFeatures", "org.Hs.eg.db", "Rsamtools", "SummarizedExperiment", "TxDb.Hsapiens.UCSC.hg19.knownGene", "TxDb.Hsapiens.UCSC.hg38.knownGene", "VariantAnnotation"))

Older versions of R:

source("https://bioconductor.org/biocLite.R")
biocLite(c("IRanges", "GenomicRanges", "GenomicFeatures", "org.Hs.eg.db", "Rsamtools", "SummarizedExperiment", "TxDb.Hsapiens.UCSC.hg19.knownGene", "TxDb.Hsapiens.UCSC.hg38.knownGene", "VariantAnnotation"))

Lastly, isma:

install.packages("/yourpath/isma_0.3.0.tar.gz", repos=NULL)
library(isma)

Input

The package takes as input Variant Call Format (VCF) files generated by mutation callers. Currently isma supports VCF files from: mutect (versions 1 and 2) (Cibulskis et al. 2013), muse (Fan et al. 2016), somaticsniper (Larson et al. 2011), strelka (Saunder et al. 2012) and varscan2 (Koboldt et al. 2012). Alternatively isma supports tab-delimited files (see “Custom input format” section). Input file location and other parameters must be specified in a tab-delimited configuration file with the following columns:

##                File  Tool_ID Subject_ID Design Tumor_ID Normal_ID
## 1    S15_S14_M1.vcf  mutect1        186   t1_n      S15       S14
## 2    S19_S18_M1.vcf  mutect1        187    t_n      S19       S18
## 3       S15_S14.vcf  mutect2        186   t1_n    TUMOR    NORMAL
## 4       S19_S18.vcf  mutect2        187    t_n    TUMOR    NORMAL
## 5  S15_14_INDEL.vcf varscan2        186   t1_n    TUMOR    NORMAL
## 6   S15_S14_SNP.vcf varscan2        186   t1_n    TUMOR    NORMAL
## 7 S19_S18_INDEL.vcf varscan2        187    t_n    TUMOR    NORMAL
## 8   S19_S18_SNP.vcf varscan2        187    t_n    TUMOR    NORMAL

Summary

All the analysis can be run using the following functions:

sites <- pre_process(configuration_file)
site_analysis(sites, directory = "/yourpath/output", cores = 2)
sites_ann <- site_annotation(sites, genome = "hg38")
gene_analysis(sites_ann, directory = "/yourpath/output", cores = 2)
ese1 <- ese_allsubj(sites, type = "Site")
ese2 <- ese_tool_subj(sites, type = "Site")
ese3 <- ese_subj_tool(sites, type = "Site")
sites_ann<- integrate_TCGA(sites_ann, genome = "hg38", project = "BRCA", pipeline = "mutect2", gene_info = TRUE)

Note that annotated mutation sites can be also analysed at gene level, using the argument type = "Gene". In the following sections we will show the main functions of isma considering a small dataset (somatic mutations of 50 breast cancer subjects) from The Cancer Genome Atlas (TCGA) (Tomczak et al. 2015).

Reading input files and merging mutation sites from different callers

The function pre_process reads input VCF files and generates a data.frame, in which variants are merged from different callers, experimental evidences from different VCF files are harmonized, unique site identifiers are generated and some fields are added (the total number of reads supporting a site in each sample, variant allele frequency (VAF), INDEL size and mutation type):

sites <- pre_process(configuration_file)

To demonstrate the capabilities of isma, we implemented the function get_TCGA_sites(), which collects somatic mutations from TCGA (Tomczak et al. 2015) and returns the same output of the function pre_process. In the following example we collect mutation sites detected by 4 mutation callers from 50 breast cancer subjects (you can load the barcodes of the selected subjects using data(barcode_50), selected to include different types of outliers as well as samples without abnormal behaviours. Since data retrieval from TCGA may take some time according to your internet connection, you can load precomputed results using data(sites).

sites <- get_TCGA_sites(tools = c("muse", "mutect2", "varscan2","somaticsniper"), n_subjects = 50, seed_n = 2) #demo with TCGA data, use seed option to obtain each time the same subjects. NB. The barcode of subjects download are not the same of the selected 50 subjects.
data(barcode_50) #load the barcodes of the 50 selected subjects
sites <- get_TCGA_sites(tools = c("muse", "mutect2", "varscan2","somaticsniper"), barcode = barcode_50) #demo with TCGA data of the selected 50 subjects
data(sites) #precomputed results
head(sites)
##     CHROM       POS REF ALT Site_ID Variant_Type Subject_ID Design
## 52   chr3 179218303   G   A     S12          SNP       A0YC    WXS
## 53   chr3 179218303   G   A     S12          SNP       A1J5    WXS
## 139  chr3 179218303   G   A     S12          SNP       A1JN    WXS
## 145  chr3 179218303   G   A     S12          SNP       A0YC    WXS
## 158  chr3 179218303   G   A     S12          SNP       A1JN    WXS
## 159  chr3 179218303   G   A     S12          SNP       A0YC    WXS
##     Tumor_ref_reads Tumor_var_reads Tumor_reads Tumor_Vaf Normal_ref_reads
## 52               87              40         127 0.3149606               NA
## 53               44              20          64 0.3125000               NA
## 139              27              30          57 0.5263158               NA
## 145              90              40         130 0.3076923               NA
## 158              33              34          67 0.5074627               NA
## 159              91              40         131 0.3053435               NA
##     Normal_var_reads Normal_reads Normal_Vaf size_INDEL       Tool_ID
## 52                NA          133         NA          0       mutect2
## 53                NA           95         NA          0      varscan2
## 139               NA           72         NA          0      varscan2
## 145               NA          135         NA          0 somaticsniper
## 158               NA           77         NA          0          muse
## 159               NA          135         NA          0          muse
##     SiteID_Subject SiteID_Subject_Tool_ID
## 52        S12_A0YC       S12_A0YC_mutect2
## 53        S12_A1J5      S12_A1J5_varscan2
## 139       S12_A1JN      S12_A1JN_varscan2
## 145       S12_A0YC S12_A0YC_somaticsniper
## 158       S12_A1JN          S12_A1JN_muse
## 159       S12_A0YC          S12_A0YC_muse

Site analysis

The function site_analysis runs a series of function to obtain statistics at sites level, like the overlap among subjects and callers, the detection of outlier subjects and tools, the number of sites detected by each tool and the occurrence of mutation sites across callers and subject, calculating a consensus measure among callers in each subject. It only requires as input the data.frame generated by pre_process (or get_TCGA_sites()) and the output directory.

site_analysis(sites, directory = "/yourpath/output", cores = 2)

The overall consensus plot shows the consensus among pipelines.

The detailed consensus plot displays the total amount of sites, consensus on sites for each subject and subject recognized as outlier according to site number, imbalance in site number across tools, imbalance in consensus among tool and tool consensus score.

The previous analysis on outlier detection is summarized into a table:

head(df_outlier,4)
##    subj hypermutated Imbalance_in_site_number_across_tools
## 15 A0U0            1                                     1
## 3  A08H            1                                     0
## 28 A1J5            1                                     0
## 33 A1XQ            1                                     0
##    Imbalance_in_consensus_among_tools CS
## 15                                  0  0
## 3                                   0  0
## 28                                  0  0
## 33                                  0  0

Site co-occurrence matrices reports quantities for all-pairs of callers and subjects. For example, the figure below underlines how mutect2 detected up to 3 times more mutation sites than other tools, while somatic sniper shared the majority of its calls with other tools.

Annotating mutation sites

isma annotates sites relying on other R packages, like VariantAnnotion (Obenchain et al. 2014), TxDb.Hsapiens.UCSC.hg38.knownGene (Team BC and Maintainer BP 2016) and org.Hs.eg.db (Carlson M 2017).

sites_ann <- site_annotation(sites, genome = "hg38")
head(sites_ann)
##    CHROM       POS REF ALT Site_ID Variant_Type Subject_ID Design
## 1   chr3 179218303   G   A     S12          SNP       A1JN    WXS
## 3   chr3 179218303   G   A     S12          SNP       A1J5    WXS
## 5   chr3 179218303   G   A     S12          SNP       A1JN    WXS
## 7   chr3 179218303   G   A     S12          SNP       A0YC    WXS
## 9   chr3 179218303   G   A     S12          SNP       A0YC    WXS
## 11  chr3 179218303   G   A     S12          SNP       A1J5    WXS
##    Tumor_ref_reads Tumor_var_reads Tumor_reads Tumor_Vaf Normal_ref_reads
## 1               27              30          57 0.5263158               NA
## 3               55              25          80 0.3125000               NA
## 5               33              34          67 0.5074627               NA
## 7               91              40         131 0.3053435               NA
## 9               74              35         109 0.3211009               NA
## 11              52              24          76 0.3157895               NA
##    Normal_var_reads Normal_reads Normal_Vaf size_INDEL       Tool_ID
## 1                NA           72         NA          0      varscan2
## 3                NA          111         NA          0 somaticsniper
## 5                NA           77         NA          0          muse
## 7                NA          135         NA          0          muse
## 9                NA          110         NA          0      varscan2
## 11               NA          108         NA          0       mutect2
##    SiteID_Subject SiteID_Subject_Tool_ID Location locstart locend  txid
## 1        S12_A1JN      S12_A1JN_varscan2   coding     1633   1633 33941
## 3        S12_A1J5 S12_A1J5_somaticsniper   coding     1633   1633 33941
## 5        S12_A1JN          S12_A1JN_muse   coding     1633   1633 33941
## 7        S12_A0YC          S12_A0YC_muse   coding     1633   1633 33941
## 9        S12_A0YC      S12_A0YC_varscan2   coding     1633   1633 33941
## 11       S12_A1J5       S12_A1J5_mutect2   coding     1633   1633 33941
##    Gene Consequence
## 1  5290      coding
## 3  5290      coding
## 5  5290      coding
## 7  5290      coding
## 9  5290      coding
## 11 5290      coding

A column is added to classify mutation sites as coding and noncoding, on the basis of their Consequence. In our TCGA data most of the sites are coding.

barplot(table(sites_ann$Location, sites_ann$Tool_ID), col = c(8,1), legend = c("coding", "noncoding"), beside = T)

Gene analysis

The statistics obtained with site_analysis can be carried out at gene level using the gene_analysis wrapper function, which further provides information like gene mutation frequency across subjects and its dispersion estimated considering multiple callers.

gene_analysis(sites_ann, directory = "/yourpath/output", cores = 2)

Gene co-occurence among subjects underlines similarity between mutation profiles, for example A1XQ shares more mutated genes with A0U0 subject than A2FW; the variability of such co-occurrences due to the use of different callers is quantified as the coefficient of variation (above diagonal).

The analysis of gene mutation frequency highlights genes with a more or less variable frequency according to the pipelines used.

Exploring the relation between called sites and experimental evidences

isma contains a set of functions to study the relation between called sites and experimental evidences.
In particular three functions are provided: ese_allsubj, ese_tool_subj and ese_subj_tool. By default these functions quantify site amount at varying values of Tumor_var_reads, Normal_var_reads, Tumor_reads, Normal_reads, Tumor_Vaf and Normal_Vaf. In particular:

These functions can also be run at gene level and using custom configurations to explore other experimental evidences.

The results provided by these functons are useful to select possible values to filter sites. Let us explore how the amount of sites detected by mutation callers varies with different values of tumor VAF.

For example, using ese_allsubj, we observe that mutect2 calls more sites than the other two callers, especially at low tumor VAF.

library(isma)
ese1 <- ese_allsubj(sites, type = "Site", bin = 30)
ese1tvaf <- ese1$filtering_output$Tumor_Vaf
head(ese1tvaf)
##          par mutect2 varscan2 somaticsniper muse
## 1 0.00000000    6589     2937          1615 3110
## 2 0.03448276    6534     2937          1615 3109
## 3 0.06896552    5222     2937          1615 2945
## 4 0.10344828    3857     2854          1615 2504
## 5 0.13793103    2854     2255          1615 2071
## 6 0.17241379    2141     1795          1548 1695
matplot(ese1tvaf[, -1], type = "l", xaxt = "n", col = 1:4, ylab = "#{sites | tumor VAF >= eps}", xlab = "eps", lwd=2, lty=1:4)
axis(1, at = 1:nrow(ese1tvaf), labels = t(round(ese1tvaf$par, 2)), las=2)
legend('topright', inset = .05, legend = colnames(ese1tvaf)[-1], col = 1:4, lty=1:4, lwd=2)

For each subject, ese_subj_tool returns the amount of sites found by different tools; here is the case of subject “A1NC”, in which mutect2 calls more mutation sites than other tools, especially at lower VAF.

library(isma)
ese2 <- ese_subj_tool(sites, type = "Site")
ese2tvafa0ba <- ese2$filtering_output$Tumor_Vaf$A1NC
head(ese2tvafa0ba)
##          par varscan2 mutect2 somaticsniper muse
## 1 0.03977273       67     259            36   60
## 2 0.07978469       67     209            36   60
## 3 0.11979665       60     153            36   55
## 4 0.15980861       39      98            36   45
## 5 0.19982057       31      70            29   32
## 6 0.23983254       17      43            17   20
Tool_ID <- colnames(ese2tvafa0ba[, -1])
matplot(ese2tvafa0ba[,-(1)], type = "l", xaxt = "n",  ylab = "#{sites | tumor VAF >= eps}", xlab = "eps", main = "", lwd=2, col=1:4, lty=1:4)
axis(1, at = 1:nrow(ese2tvafa0ba), labels = t(round(ese2tvafa0ba$par, 2)), las=2)
legend('toprigh', legend = colnames(ese2tvafa0ba)[-1], lty = 1:4, col = 1:4, lwd=2)

Importantly, this and other analyses can be also carried out at gene level (see below).

Collecting data from the TCGA

isma offers the opportunity to collect data from the TCGA, in order to check whether the mutation sites under analysisare available at the TCGA Note that this function requires the R package TCGAbiolinks (Tomczak et al. 2015) and an internet connection to retrieve data from TCGA servers.

Specifically, the function integrate_TCGA adds the column Site_TCGA, in which 1 indicates the presence of the same mutation in TCGA (currently this function does not support INDELs, because of different encoding formats: INDELs will have Site_TCGA equal to NA). Optionally, the presence of other mutations affecting the same gene (Gene_TCGA) and gene mutation frequency in TCGA cohort (Gene_mut_freq) are reported.

In general, the function can be used in the following way (we use the option to exclude hypermutated subjects):

sites_TCGA <- integrate_TCGA(sites_ann, genome = "hg38", project = "BRCA", pipeline = "mutect2", gene_info = TRUE, exclude_hypermutated = T)
colnames(sites_TCGA)
##  [1] "Gene"                   "CHROM"                 
##  [3] "POS"                    "REF"                   
##  [5] "ALT"                    "Site_ID"               
##  [7] "Variant_Type"           "Subject_ID"            
##  [9] "Design"                 "Tumor_ref_reads"       
## [11] "Tumor_var_reads"        "Tumor_reads"           
## [13] "Tumor_Vaf"              "Normal_ref_reads"      
## [15] "Normal_var_reads"       "Normal_reads"          
## [17] "Normal_Vaf"             "size_INDEL"            
## [19] "Tool_ID"                "SiteID_Subject"        
## [21] "SiteID_Subject_Tool_ID" "Location"              
## [23] "locstart"               "locend"                
## [25] "txid"                   "Consequence"           
## [27] "ALT_TCGA"               "Site_TCGA"             
## [29] "Gene_TCGA"              "Gene_mut_freq"

As expected, since we are using TCGA sites as input and asked for information on site availability among TCGA sites found by mutect2, only a few input sites are not available among TCGA sites found by mutect2, i.e. the sites that were found by other callers (muse, varscan and somatic sniper), but not by mutect2.

table(sites_TCGA$Site_TCGA)
## 
##     0     1 
## 10413  7717

Custom input format

Custom input files must have the following columns:

library(isma)
data(custom_format_example)
print(custom_format_example)
##   CHROM      POS REF ALT Tumor_ref_reads Tumor_var_reads Normal_ref_reads
## 1  chr1  1190867   C   T              58              25               90
## 2  chr1 17030795   G   A              25               7               12
##   Normal_var_reads
## 1                1
## 2                0

Reference

Carlson M. 2017. “Org.Hs.eg.db: Genome Wide Annotation for Human.” R package version 3.5.0.

Cibulskis et al. 2013. “Sensitive Detection of Somatic Point Mutations in Impure and Heterogeneous Cancer Samples.” Nature Biotechnology.

Di Nanni et al. 2018. “Isma: Integrative Analysis of Mutations Detected by Multiple Calling Tools.” R package version 0.1.0.

Fan et al. 2016. “MuSE: Accounting for Tumor Heterogeneity Using a Sample-Specific Error Model Improves Sensitivity and Specificity in Mutation Calling from Sequencing Data.” Genome Biology.

Koboldt et al. 2012. “Somatic Mutation and Copy Number Alteration Discovery in Cancer by Exome Sequencing.” Genome Res.

Larson et al. 2011. “SomaticSniper: Identification of Somatic Point Mutations in Whole Genome Sequencing Data.” Bioinformatics.

Obenchain et al. 2014. “VariantAnnotation: A Bioconductor Package for Explora-Tion and Annotation of Genetic Variants.” Bioinformatics.

Saunder et al. 2012. “Strelka: Accurate Somatic Small-Variant Calling from Se-Quenced Tumor–normal Sample Pairs.” Bioinformatics.

Team BC and Maintainer BP. 2016. “TxDb.Hsapiens.UCSC.hg38.knownGene: Annotation Package for Txdb Object(s).” R package version 3.4.0.