Title: | In-Depth Characterization and Analysis of Mutational Signatures ('ICAMS') |
---|---|
Description: | Analysis and visualization of experimentally elucidated mutational signatures -- the kind of analysis and visualization in Boot et al., "In-depth characterization of the cisplatin mutational signature in human cell lines and in esophageal and liver tumors", Genome Research 2018, <doi:10.1101/gr.230219.117> and "Characterization of colibactin-associated mutational signature in an Asian oral squamous cell carcinoma and in other mucosal tumor types", Genome Research 2020 <doi:10.1101/gr.255620.119>. 'ICAMS' stands for In-depth Characterization and Analysis of Mutational Signatures. 'ICAMS' has functions to read in variant call files (VCFs) and to collate the corresponding catalogs of mutational spectra and to analyze and plot catalogs of mutational spectra and signatures. Handles both "counts-based" and "density-based" (i.e. representation as mutations per megabase) mutational spectra or signatures. |
Authors: | Steve Rozen, Nanhai Jiang, Arnoud Boot, Mo Liu, Yang Wu, Mi Ni Huang, Jia Geng Chang |
Maintainer: | Steve Rozen <[email protected]> |
License: | GPL-3 | file LICENSE |
Version: | 3.0.6 |
Built: | 2024-11-25 05:13:03 UTC |
Source: | https://github.com/steverozen/icams |
An R list with one element each for
BSgenome.Hsapiens.1000genomes.hs37d5
, BSgenome.Hsapiens.UCSC.hg38
and BSgenome.Mmusculus.UCSC.mm10
.
Each element is in turn a sub-list keyed by
exome
, transcript
,
and genome
. Each element of the sub list
is keyed by the number of rows in the catalog class (as a string, e.g.
"78"
, not 78
). The keys are:
78 (DBS78Catalog
), 96 (SBS96Catalog
), 136 (DBS136Catalog
),
144 (DBS144Catalog
), 192 (SBS192Catalog
),
and 1536 (SBS1536Catalog
). So, for example to get the exome
abundances for SBS96 catalogs for BSgenome.Hsapiens.UCSC.hg38
exomes
one would reference all.abundance[["BSgenome.Hsapiens.UCSC.hg38"]][["exome"]][["96"]]
or all.abundance$BSgenome.Hsapiens.UCSC.hg38$exome$"96"
.
The value of the abundance is an integer vector with the K-mers
as names and each value being the count of that K-mer.
all.abundance
all.abundance
See Description.
all.abundance$BSgenome.Hsapiens.UCSC.hg38$transcript$`144` # AA AC AG AT CA CC ... # 90769160 57156295 85738416 87552737 83479655 63267896 ... # There are 90769160 AAs on the sense strands of transcripts in # this genome.
all.abundance$BSgenome.Hsapiens.UCSC.hg38$transcript$`144` # AA AC AG AT CA CC ... # 90769160 57156295 85738416 87552737 83479655 63267896 ... # There are 90769160 AAs on the sense strands of transcripts in # this genome.
Add sequence context and transcript information to an in-memory DBS VCF
AnnotateDBSVCF(DBS.vcf, ref.genome, trans.ranges = NULL, name.of.VCF = NULL)
AnnotateDBSVCF(DBS.vcf, ref.genome, trans.ranges = NULL, name.of.VCF = NULL)
DBS.vcf |
An in-memory DBS VCF as a |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
name.of.VCF |
Name of the VCF file. |
An in-memory DBS VCF as a data.table
. This has been annotated
with the sequence context (column name seq.21bases
) and with
transcript information in the form of a gene symbol (e.g. "TP53"
)
and transcript strand. This information is in the columns
trans.start.pos
, trans.end.pos
, trans.strand
,
trans.Ensembl.gene.ID
and trans.gene.symbol
in the output.
These columns are not added if is.null(trans.ranges)
.
file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka") DBS.vcf <- list.of.vcfs$DBS[[1]] if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { annotated.DBS.vcf <- AnnotateDBSVCF(DBS.vcf, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37)}
file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka") DBS.vcf <- list.of.vcfs$DBS[[1]] if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { annotated.DBS.vcf <- AnnotateDBSVCF(DBS.vcf, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37)}
Add sequence context to an in-memory ID (insertion/deletion) VCF, and confirm that they match the given reference genome
AnnotateIDVCF( ID.vcf, ref.genome, flag.mismatches = 0, name.of.VCF = NULL, suppress.discarded.variants.warnings = TRUE )
AnnotateIDVCF( ID.vcf, ref.genome, flag.mismatches = 0, name.of.VCF = NULL, suppress.discarded.variants.warnings = TRUE )
ID.vcf |
An in-memory ID (insertion/deletion) VCF as a
|
ref.genome |
A |
flag.mismatches |
Deprecated. If there are ID variants whose |
name.of.VCF |
Name of the VCF file. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
A list of elements:
annotated.vcf
: The original VCF data
frame with two new columns added to the input data frame:
seq.context
: The sequence embedding the variant.
seq.context.width
: The width of seq.context
to the left.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
file <- c(system.file("extdata/Strelka-ID-vcf/", "Strelka.ID.GRCh37.s1.vcf", package = "ICAMS")) ID.vcf <- ReadAndSplitVCFs(file, variant.caller = "strelka")$ID[[1]] if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { list <- AnnotateIDVCF(ID.vcf, ref.genome = "hg19") annotated.ID.vcf <- list$annotated.vcf}
file <- c(system.file("extdata/Strelka-ID-vcf/", "Strelka.ID.GRCh37.s1.vcf", package = "ICAMS")) ID.vcf <- ReadAndSplitVCFs(file, variant.caller = "strelka")$ID[[1]] if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { list <- AnnotateIDVCF(ID.vcf, ref.genome = "hg19") annotated.ID.vcf <- list$annotated.vcf}
Add sequence context and transcript information to an in-memory SBS VCF
AnnotateSBSVCF(SBS.vcf, ref.genome, trans.ranges = NULL, name.of.VCF = NULL)
AnnotateSBSVCF(SBS.vcf, ref.genome, trans.ranges = NULL, name.of.VCF = NULL)
SBS.vcf |
An in-memory SBS VCF as a |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
name.of.VCF |
Name of the VCF file. |
An in-memory SBS VCF as a data.table
. This has been annotated
with the sequence context (column name seq.21bases
) and with
transcript information in the form of a gene symbol (e.g. "TP53"
)
and transcript strand. This information is in the columns
trans.start.pos
, trans.end.pos
, trans.strand
,
trans.Ensembl.gene.ID
and trans.gene.symbol
in the output.
These columns are not added if is.null(trans.ranges)
.
file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka") SBS.vcf <- list.of.vcfs$SBS[[1]] if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { annotated.SBS.vcf <- AnnotateSBSVCF(SBS.vcf, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37)}
file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka") SBS.vcf <- list.of.vcfs$SBS[[1]] if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { annotated.SBS.vcf <- AnnotateSBSVCF(SBS.vcf, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37)}
matrix
, data.frame
, or vector
Create a catalog from a matrix
, data.frame
, or vector
as.catalog( object, ref.genome = NULL, region = "unknown", catalog.type = "counts", abundance = NULL, infer.rownames = FALSE )
as.catalog( object, ref.genome = NULL, region = "unknown", catalog.type = "counts", abundance = NULL, infer.rownames = FALSE )
object |
A numeric |
ref.genome |
A |
region |
A character string designating a region, one of
|
catalog.type |
One of "counts", "density", "counts.signature", "density.signature". |
abundance |
If |
infer.rownames |
If |
A catalog as described in ICAMS
.
# Create an SBS96 catalog with all mutation counts equal to 1. object <- matrix(1, nrow = 96, ncol = 1, dimnames = list(catalog.row.order$SBS96)) catSBS96 <- as.catalog(object)
# Create an SBS96 catalog with all mutation counts equal to 1. object <- matrix(1, nrow = 96, ncol = 1, dimnames = list(catalog.row.order$SBS96)) catSBS96 <- as.catalog(object)
This function is primarily for internal use, but we export it to document the underlying logic.
Canonicalize1Del(context, del.seq, pos, trace = 0)
Canonicalize1Del(context, del.seq, pos, trace = 0)
context |
The deleted sequence plus ample surrounding
sequence on each side (at least as long as |
del.seq |
The deleted sequence in |
pos |
The position of |
trace |
If > 0, then generate messages tracing how the computation is carried out. |
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2017_12_08.xlsx for additional information on deletion mutation classification.
This function first handles deletions in homopolymers, then
handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
),
and if the deletion is not in a simple repeat,
looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
A string that is the canonical representation
of the given deletion type. Return NA
and raise a warning if
there is an un-normalized representation of
the deletion of a repeat unit.
See FindDelMH
for details.
(This seems to be very rare.)
Canonicalize1Del("xyAAAqr", del.seq = "A", pos = 3) # "DEL:T:1:2" Canonicalize1Del("xyAAAqr", del.seq = "A", pos = 4) # "DEL:T:1:2" Canonicalize1Del("xyAqr", del.seq = "A", pos = 3) # "DEL:T:1:0"
Canonicalize1Del("xyAAAqr", del.seq = "A", pos = 3) # "DEL:T:1:2" Canonicalize1Del("xyAAAqr", del.seq = "A", pos = 4) # "DEL:T:1:2" Canonicalize1Del("xyAqr", del.seq = "A", pos = 3) # "DEL:T:1:0"
This data is designed for those who need to create their own catalogs from formats not supported by this package. The rownames denote the mutation types. For example, for SBS96 catalogs, the rowname AGAT represents a mutation from AGA > ATA.
catalog.row.order
catalog.row.order
A list of character vectors indicating the standard orders of row names in catalogs.
An object of class list
of length 9.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+. In ID83 catalogs, deletion repeat sizes range from 0 to 5.
catalog.row.order$SBS96 # "ACAA" "ACCA" "ACGA" "ACTA" "CCAA" "CCCA" "CCGA" "CCTA" ... # There are altogether 96 row names to denote the mutation types # in SBS96 catalog.
catalog.row.order$SBS96 # "ACAA" "ACCA" "ACGA" "ACTA" "CCAA" "CCCA" "CCGA" "CCTA" ... # There are altogether 96 row names to denote the mutation types # in SBS96 catalog.
Take a mutational spectrum or signature catalog that is based on a fined-grained set of features (for example, single-nucleotide substitutions in the context of the preceding and following 2 bases).
Collapse it to a catalog based on a coarser-grained set of features (for example, single-nucleotide substitutions in the context of the immediately preceding and following bases).
Collapse192CatalogTo96
Collapse an SBS 192 catalog
to an SBS 96 catalog.
Collapse1536CatalogTo96
Collapse an SBS 1536 catalog
to an SBS 96 catalog.
Collapse144CatalogTo78
Collapse a DBS 144 catalog
to a DBS 78 catalog.
Collapse192CatalogTo96(catalog) Collapse1536CatalogTo96(catalog) Collapse144CatalogTo78(catalog)
Collapse192CatalogTo96(catalog) Collapse1536CatalogTo96(catalog) Collapse144CatalogTo78(catalog)
catalog |
A catalog as defined in |
A catalog as defined in ICAMS
.
# Create an SBS192 catalog and collapse it to an SBS96 catalog object <- matrix(1, nrow = 192, ncol = 1, dimnames = list(catalog.row.order$SBS192)) catSBS192 <- as.catalog(object, region = "transcript") catSBS96 <- Collapse192CatalogTo96(catSBS192)
# Create an SBS192 catalog and collapse it to an SBS96 catalog object <- matrix(1, nrow = 192, ncol = 1, dimnames = list(catalog.row.order$SBS192)) catSBS192 <- as.catalog(object, region = "transcript") catSBS96 <- Collapse192CatalogTo96(catSBS192)
Return the length of microhomology at a deletion
FindDelMH(context, deleted.seq, pos, trace = 0, warn.cryptic = TRUE)
FindDelMH(context, deleted.seq, pos, trace = 0, warn.cryptic = TRUE)
context |
The deleted sequence plus ample surrounding
sequence on each side (at least as long as |
deleted.seq |
The deleted sequence in |
pos |
The position of |
trace |
If > 0, then generate various messages showing how the computation is carried out. |
warn.cryptic |
if |
This function is primarily for internal use, but we export it to document the underlying logic.
Example:
GGCTAGTT
aligned to GGCTAGAACTAGTT
with
a deletion represented as:
GGCTAGAACTAGTT GG------CTAGTT GGCTAGTT GG[CTAGAA]CTAGTT ---- ----
Presumed repair mechanism leading to this:
.... GGCTAGAACTAGTT CCGATCTTGATCAA => .... GGCTAG TT CC GATCAA .... => GGCTAGTT CCGATCAA
Variant-caller software can represent the same deletion in several different, but completely equivalent, ways.
GGC------TAGTT GGCTAGTT GGC[TAGAAC]TAGTT * --- * --- GGCT------AGTT GGCTAGTT GGCT[AGAACT]AGTT ** -- ** -- GGCTA------GTT GGCTAGTT GGCTA[GAACTA]GTT *** - *** - GGCTAG------TT GGCTAGTT GGCTAG[AACTAG]TT **** ****
This function finds:
The maximum match of undeleted sequence to the left of the deletion that is identical to the right end of the deleted sequence, and
The maximum match of undeleted sequence to the right of the deletion that is identical to the left end of the deleted sequence.
The microhomology sequence is the concatenation of items (1) and (2).
Warning
A deletion in a repeat can also be represented
in several different ways. A deletion in a repeat
is abstractly equivalent to a deletion with microhomology that
spans the entire deleted sequence. For example;
GACTAGCTAGTT GACTA----GTT GACTAGTT GACTA[GCTA]GTT *** -*** -
is really a repeat
GACTAG----TT GACTAGTT GACTAG[CTAG]TT **** ---- GACT----AGTT GACTAGTT GACT[AGCT]AGTT ** --** --
This function only flags these "cryptic repeats" with a -1 return; it does not figure out the repeat extent.
The length of the maximum microhomology of del.sequence
in context
.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
# GAGAGG[CTAGAA]CTAGTT # ---- ---- FindDelMH("GGAGAGGCTAGAACTAGTTAAAAA", "CTAGAA", 8, trace = 0) # 4 # A cryptic repeat # # TAAATTATTTATTAATTTATTG # TAAATTA----TTAATTTATTG = TAAATTATTAATTTATTG # # equivalent to # # TAAATTATTTATTAATTTATTG # TAAAT----TATTAATTTATTG = TAAATTATTAATTTATTG # # and # # TAAATTATTTATTAATTTATTG # TAAA----TTATTAATTTATTG = TAAATTATTAATTTATTG FindDelMH("TAAATTATTTATTAATTTATTG", "TTTA", 8, warn.cryptic = FALSE) # -1
# GAGAGG[CTAGAA]CTAGTT # ---- ---- FindDelMH("GGAGAGGCTAGAACTAGTTAAAAA", "CTAGAA", 8, trace = 0) # 4 # A cryptic repeat # # TAAATTATTTATTAATTTATTG # TAAATTA----TTAATTTATTG = TAAATTATTAATTTATTG # # equivalent to # # TAAATTATTTATTAATTTATTG # TAAAT----TATTAATTTATTG = TAAATTATTAATTTATTG # # and # # TAAATTATTTATTAATTTATTG # TAAA----TTATTAATTTATTG = TAAATTATTAATTTATTG FindDelMH("TAAATTATTTATTAATTTATTG", "TTTA", 8, warn.cryptic = FALSE) # -1
Return the number of repeat units in which a deletion is embedded
FindMaxRepeatDel(context, rep.unit.seq, pos)
FindMaxRepeatDel(context, rep.unit.seq, pos)
context |
A string that embeds |
rep.unit.seq |
A substring of |
pos |
The position of |
This function is primarily for internal use, but we export it to document the underlying logic.
For example FindMaxRepeatDel("xyaczt", "ac", 3)
returns 0.
If
substr(context, pos, pos + nchar(rep.unit.seq) - 1) != rep.unit.seq
then stop.
If this functions returns 0, then it is necessary to
look for microhomology using the function
FindDelMH
.
Warning
This function depends on the variant caller having
"aligned" the deletion within the context of the
repeat.
For example, a deletion of CAG
in the repeat
GTCAGCAGCATGT
can have 3 "aligned" representations as follows:
CT---CAGCAGGT CTCAG---CAGGT CTCAGCAG---GT
In these cases this function will return 2. (Please
not that the return value does not include the
rep.uni.seq
in the count.)
However, the same deletion can also have an "unaligned" representation, such as
CTCAGC---AGGT
(a deletion of AGC
).
In this case this function will return 1 (a deletion of AGC
in a 2-element repeat of AGC
).
The number of repeat units in which rep.unit.seq
is
embedded, not including
the input rep.unit.seq
in the count.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
FindMaxRepeatDel("xyACACzt", "AC", 3) # 1 FindMaxRepeatDel("xyACACzt", "CA", 4) # 0
FindMaxRepeatDel("xyACACzt", "AC", 3) # 1 FindMaxRepeatDel("xyACACzt", "CA", 4) # 0
This data is designed to be used as an example in function PlotTransBiasGeneExp
and PlotTransBiasGeneExpToPdf
.
gene.expression.data.HepG2 gene.expression.data.MCF10A
gene.expression.data.HepG2 gene.expression.data.MCF10A
A data.table
which contains the expression values of genes.
An object of class data.table
(inherits from data.frame
) with 57736 rows and 4 columns.
An object of class data.table
(inherits from data.frame
) with 57736 rows and 4 columns.
gene.expression.data.HepG2 # Ensembl.gene.ID gene.symbol counts TPM # ENSG00000000003 TSPAN6 6007 33.922648455 # ENSG00000000005 TNMD 0 0.000000000 # ENSG00000000419 DPM1 4441 61.669371091 # ENSG00000000457 SCYL3 1368 3.334619195 # ENSG00000000460 C1orf112 916 2.416263423 # ... ... ... ...
gene.expression.data.HepG2 # Ensembl.gene.ID gene.symbol counts TPM # ENSG00000000003 TSPAN6 6007 33.922648455 # ENSG00000000005 TNMD 0 0.000000000 # ENSG00000000419 DPM1 4441 61.669371091 # ENSG00000000457 SCYL3 1368 3.334619195 # ENSG00000000460 C1orf112 916 2.416263423 # ... ... ... ...
Extract the VAFs (variant allele frequencies) and read depth information from a VCF file
GetStrelkaVAF(vcf, name.of.VCF = NULL) GetMutectVAF(vcf, name.of.VCF = NULL, tumor.col.name = NA) GetFreebayesVAF(vcf, name.of.VCF = NULL) GetPCAWGConsensusVAF(vcf, mc.cores = 1)
GetStrelkaVAF(vcf, name.of.VCF = NULL) GetMutectVAF(vcf, name.of.VCF = NULL, tumor.col.name = NA) GetFreebayesVAF(vcf, name.of.VCF = NULL) GetPCAWGConsensusVAF(vcf, mc.cores = 1)
vcf |
An in-memory VCF data frame. |
name.of.VCF |
Name of the VCF file. |
tumor.col.name |
Optional. Only applicable to Mutect VCF. Name
or index of the column in Mutect VCF which contains the tumor
sample information. It must have quotation marks if specifying the
column name. If |
mc.cores |
The number of cores to use. Not available on Windows
unless |
The original vcf
with two additional columns added which
contain the VAF(variant allele frequency) and read depth information.
GetPCAWGConsensusVAF
is analogous to GetMutectVAF
,
calculating VAF and read depth from PCAWG7 consensus vcfs
file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) MakeDataFrameFromVCF <- getFromNamespace("MakeDataFrameFromVCF", "ICAMS") df <- MakeDataFrameFromVCF(file) df1 <- GetStrelkaVAF(df)
file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) MakeDataFrameFromVCF <- getFromNamespace("MakeDataFrameFromVCF", "ICAMS") df <- MakeDataFrameFromVCF(file) df1 <- GetStrelkaVAF(df)
Analysis and visualization of experimentally elucidated mutational signatures
– the kind of analysis and visualization in Boot et al., "In-depth
characterization of the cisplatin mutational signature in human cell lines
and in esophageal and liver tumors",
Genome Research 2018 https://doi.org/10.1101/gr.230219.117 and
"Characterization of colibactin-associated mutational signature in an
Asian oral squamous cell carcinoma and in other mucosal tumor types",
Genome Research 2020, https://doi.org/10.1101/gr.255620.119.
"ICAMS" stands for In-depth Characterization and
Analysis of Mutational Signatures. "ICAMS" has functions to read in variant
call files (VCFs) and to collate the corresponding catalogs of mutational
spectra and to analyze and plot catalogs of mutational spectra and
signatures.
"ICAMS" can read in VCFs generated by Strelka, Mutect or other variant callers, and collate the mutations into "catalogs" of mutational spectra. "ICAMS" can create and plot catalogs of mutational spectra or signatures for single base substitutions (SBS), doublet base substitutions (DBS), and small insertions and deletions (ID). It can also read and write these catalogs.
A key data type in "ICAMS" is a "catalog" of mutation counts, of mutation densities (see below), or of mutational signatures.
Catalogs are S3 objects of class matrix
and one of
several additional classes that specify the types of the mutations
represented in the catalog. The additional class is one of
SBS96Catalog
(strand-agnostic single base substitutions in
trinucleotide context)
SBS192Catalog
(transcription-stranded single-base substitutions
in trinucleotide context)
SBS1536Catalog
DBS78Catalog
DBS144Catalog
DBS136Catalog
IndelCatalog
ID166Catalog
(genic-intergenic indel catalog)
as.catalog
is the main constructor.
Conceptually, a catalog also has one of the following types,
indicated by the attribute catalog.type
:
Matrix of mutation counts (one column per sample), representing
(counts-based) mutational spectra (catalog.type = "counts"
).
Matrix of mutation densities, i.e. mutations per occurrences
of source sequences (one column per sample), representing
(density-based) mutational spectra (catalog.type = "density"
).
Matrix of mutational signatures, which are similar to spectra. However where spectra consist of counts or densities of mutations in each mutation class (e.g. ACA > AAA, ACA > AGA, ACA > ATA, ACC > AAC, ...), signatures consist of the proportions of mutations in each class (with all the proportions summing to 1). A mutational signature can be based on either:
mutation counts (a "counts-based mutational signature",
catalog.type = "counts.signature"
), or
mutation densities (a "density-based mutational signature",
catalog.type = "density.signature"
).
Catalogs also have the attribute abundance
, which contains the
counts of different source sequences for mutations. For example,
for SBSs in trinucleotide context, the abundances would be the counts
of each trinucleotide in the human genome, exome, or in the transcribed
region of the genome. See TransformCatalog
for more information. Abundances logically depend on the species in
question and on the part of the genome being analyzed.
In "ICAMS"
abundances can sometimes be inferred from the
catalog
class attribute and the
function arguments region
, ref.genome
,
and catalog.type
.
Otherwise abundances can be provided as an abundance
argument.
See all.abundance
for examples.
Possible values for
region
are the strings genome
, transcript
,
exome
, and unknown
; transcript
includes entire
transcribed regions, i.e. the introns as well as the exons.
If you need to create a catalog from a source other than
this package (i.e. other than with
ReadCatalog
or VCFsToCatalogs
,
VCFsToZipFile
, etc.), then use
as.catalog
.
If user wants to subscript specific columns from a catalog, it is needed to
call library(ICAMS)
beforehand to preserve the ICAMS catalog
attribute. Otherwise writing or plotting catalog function in ICAMS may not
work properly.
VCFsToCatalogs
creates 3 SBS catalogs (96, 192, 1536), 3
DBS catalogs (78, 136, 144) and ID (small insertions and deletions) catalog
from the VCFs.
PlotCatalog
function plots mutational spectra
for one sample or plot one mutational signature.
PlotCatalogToPdf
function plots catalogs of mutational spectra or
of mutational signatures to a PDF file.
VCFsToCatalogsAndPlotToPdf
creates all types of SBS, DBS
and ID catalogs from VCFs and plots the catalogs.
VCFsToZipFile
creates a zip file which contains SBS, DBS
and ID catalogs and plot PDFs from VCF files.
ref.genome
(reference genome) argumentMany functions take the argument ref.genome
.
To create a mutational
spectrum catalog from a VCF file, "ICAMS" needs the reference genome sequence
that matches the VCF file. The ref.genome
argument
provides this.
ref.genome
must be one of
A variable from the Bioconductor BSgenome
package
that contains a particular reference genome, for example
BSgenome.Hsapiens.1000genomes.hs37d5
.
The strings "hg38"
or "GRCh38"
, which specify
BSgenome.Hsapiens.UCSC.hg38
.
The strings "hg19"
or "GRCh37"
,
which specify
BSgenome.Hsapiens.1000genomes.hs37d5
.
The strings "mm10"
or "GRCm38"
,
which specify
BSgenome.Mmusculus.UCSC.mm10
.
All needed reference genomes must be installed separately by the user.
Further instructions are at
https://bioconductor.org/packages/release/bioc/html/BSgenome.html.
Use of "ICAMS" with reference genomes other than the 2 human genomes
and 1 mouse genome specified above is restricted to
catalog.type
of counts
or counts.signature
unless the user also creates the necessary abundance vectors.
See all.abundance
.
Use available.genomes()
to get the list of available genomes.
WriteCatalog
function
writes a catalog to a file.
ReadCatalog
function
reads a file that contains a catalog in standardized format.
TransformCatalog
function transforms catalogs of mutational spectra or
signatures to account for differing abundances of the source
sequence of the mutations in the genome.
For example, mutations from ACG are much rarer in the human genome than mutations from ACC simply because CG dinucleotides are rare in the genome. Consequently, there are two possible representations of mutational spectra or signatures. One representation is based on mutation counts as observed in a given genome or exome, and this approach is widely used, as, for example, at https://cancer.sanger.ac.uk/signatures/, which presents signatures based on observed mutation counts in the human genome. We call these "counts-based spectra" or "counts-based signatures".
Alternatively, mutational spectra or signatures can be represented as mutations per source sequence, for example the number of ACT > AGT mutations occurring at all ACT 3-mers in a genome. We call these "density-based spectra" or "density-based signatures".
This function can also transform spectra based on observed genome-wide counts to "density"-based catalogs. In density-based catalogs mutations are expressed as mutations per source sequences. For example, a density-based catalog represents the proportion of ACCs mutated to ATCs, the proportion of ACGs mutated to ATGs, etc. This is different from counts-based mutational spectra catalogs, which contain the number of ACC > ATC mutations, the number of ACG > ATG mutations, etc.
This function can also transform observed-count based spectra or signatures from genome to exome based counts, or between different species (since the abundances of source sequences vary between genome and exome and between species).
CollapseCatalog
function
Takes a mutational spectrum or signature catalog that is based on a fined-grained set of features (for example, single-nucleotide substitutions in the context of the preceding and following 2 bases).
Collapses it to a catalog based on a coarser-grained set of features (for example, single-nucleotide substitutions in the context of the immediately preceding and following bases).
CatalogRowOrder
Standard order of rownames in a catalog.
The rownames encode the type of each mutation. For example, for SBS96
catalogs, the rowname AGAT represents a mutation from AGA > ATA.
TranscriptRanges
Transcript ranges and strand information
for a particular reference genome.
all.abundance
The counts of different source sequences
for mutations.
GeneExpressionData
Example gene expression data from two
cell lines.
Check whether an R object contains one of the ICAMS catalog classes
Check whether an R object contains one of the ICAMS catalog classes
IsICAMSCatalog(object) IsICAMSCatalog(object)
IsICAMSCatalog(object) IsICAMSCatalog(object)
object |
An R object. |
A logical value.
A logical value.
# Create a matrix with all values being 1 object <- matrix(1, nrow = 96, ncol = 1, dimnames = list(catalog.row.order$SBS96)) IsICAMSCatalog(object) # FALSE # Use as.catalog to add class attribute to object catalog <- as.catalog(object) IsICAMSCatalog(catalog) # TRUE # Create a matrix with all values being 1 object <- matrix(1, nrow = 96, ncol = 1, dimnames = list(catalog.row.order$SBS96)) IsICAMSCatalog(object) # FALSE # Use as.catalog to add class attribute to object catalog <- as.catalog(object) IsICAMSCatalog(catalog) # TRUE
# Create a matrix with all values being 1 object <- matrix(1, nrow = 96, ncol = 1, dimnames = list(catalog.row.order$SBS96)) IsICAMSCatalog(object) # FALSE # Use as.catalog to add class attribute to object catalog <- as.catalog(object) IsICAMSCatalog(catalog) # TRUE # Create a matrix with all values being 1 object <- matrix(1, nrow = 96, ncol = 1, dimnames = list(catalog.row.order$SBS96)) IsICAMSCatalog(object) # FALSE # Use as.catalog to add class attribute to object catalog <- as.catalog(object) IsICAMSCatalog(catalog) # TRUE
[Deprecated, use VCFsToCatalogs(variant.caller = "mutect") instead]
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the Mutect VCFs specified by files
MutectVCFFilesToCatalog( files, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
MutectVCFFilesToCatalog( files, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
files |
Character vector of file paths to the Mutect VCF files. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Vector of column names or column indices in
VCFs which contain the tumor sample information. The order of elements in
|
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
This function calls VCFsToSBSCatalogs
,
VCFsToDBSCatalogs
and VCFsToIDCatalogs
A list containing the following objects:
catSBS96
, catSBS192
, catSBS1536
: Matrix of
3 SBS catalogs (one each for 96, 192, and 1536).
catDBS78
, catDBS136
, catDBS144
: Matrix of
3 DBS catalogs (one each for 78, 136, and 144).
catID
: Matrix of ID (small insertions and deletions) catalog.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE.
A list of elements:
SBS
: SBS VCF annotated by AnnotateSBSVCF
with
three new columns SBS96.class
, SBS192.class
and
SBS1536.class
showing the mutation class for each SBS variant.
DBS
: DBS VCF annotated by AnnotateDBSVCF
with
three new columns DBS78.class
, DBS136.class
and
DBS144.class
showing the mutation class for each DBS variant.
ID
: ID VCF annotated by AnnotateIDVCF
with one
new column ID.class
showing the mutation class for each
ID variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
## Not run: file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- MutectVCFFilesToCatalog(file, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome")} ## End(Not run)
## Not run: file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- MutectVCFFilesToCatalog(file, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome")} ## End(Not run)
[Deprecated, use VCFsToCatalogsAndPlotToPdf(variant.caller = "mutect") instead]
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the Mutect VCFs specified by files
and plot them to
PDF
MutectVCFFilesToCatalogAndPlotToPdf( files, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, output.file = "", flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
MutectVCFFilesToCatalogAndPlotToPdf( files, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, output.file = "", flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
files |
Character vector of file paths to the Mutect VCF files. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Vector of column names or column indices in
VCFs which contain the tumor sample information. The order of elements in
|
output.file |
Optional. The base name of the PDF files to be produced;
multiple files will be generated, each ending in |
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
This function calls MutectVCFFilesToCatalog
and
PlotCatalogToPdf
A list containing the following objects:
catSBS96
, catSBS192
, catSBS1536
: Matrix of
3 SBS catalogs (one each for 96, 192, and 1536).
catDBS78
, catDBS136
, catDBS144
: Matrix of
3 DBS catalogs (one each for 78, 136, and 144).
catID
: Matrix of ID (small insertions and deletions) catalog.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE.
A list of elements:
SBS
: SBS VCF annotated by AnnotateSBSVCF
with
three new columns SBS96.class
, SBS192.class
and
SBS1536.class
showing the mutation class for each SBS variant.
DBS
: DBS VCF annotated by AnnotateDBSVCF
with
three new columns DBS78.class
, DBS136.class
and
DBS144.class
showing the mutation class for each DBS variant.
ID
: ID VCF annotated by AnnotateIDVCF
with one
new column ID.class
showing the mutation class for each
ID variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
## Not run: file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- MutectVCFFilesToCatalogAndPlotToPdf(file, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome", output.file = file.path(tempdir(), "Mutect"))} ## End(Not run)
## Not run: file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- MutectVCFFilesToCatalogAndPlotToPdf(file, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome", output.file = file.path(tempdir(), "Mutect"))} ## End(Not run)
[Deprecated, use VCFsToZipFile(variant.caller = "mutect") instead]
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the Mutect VCFs specified by dir
, save the catalogs
as CSV files, plot them to PDF and generate a zip archive of all the output files.
MutectVCFFilesToZipFile( dir, zipfile, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, base.filename = "", flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
MutectVCFFilesToZipFile( dir, zipfile, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, base.filename = "", flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
dir |
Pathname of the directory which contains only the Mutect
VCF files. Each Mutect VCF must have a file extension ".vcf" (case
insensitive) and share the same |
zipfile |
Pathname of the zip file to be created. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Vector of column names or column indices in
VCFs which contain the tumor sample information. The order of elements in
|
base.filename |
Optional. The base name of the CSV and PDF files to be
produced; multiple files will be generated, each ending in
|
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
This function calls MutectVCFFilesToCatalog
,
PlotCatalogToPdf
, WriteCatalog
and
zip::zipr
.
A list containing the following objects:
catSBS96
, catSBS192
, catSBS1536
: Matrix of
3 SBS catalogs (one each for 96, 192, and 1536).
catDBS78
, catDBS136
, catDBS144
: Matrix of
3 DBS catalogs (one each for 78, 136, and 144).
catID
: Matrix of ID (small insertions and deletions) catalog.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE.
A list of elements:
SBS
: SBS VCF annotated by AnnotateSBSVCF
with
three new columns SBS96.class
, SBS192.class
and
SBS1536.class
showing the mutation class for each SBS variant.
DBS
: DBS VCF annotated by AnnotateDBSVCF
with
three new columns DBS78.class
, DBS136.class
and
DBS144.class
showing the mutation class for each DBS variant.
ID
: ID VCF annotated by AnnotateIDVCF
with one
new column ID.class
showing the mutation class for each
ID variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
## Not run: dir <- c(system.file("extdata/Mutect-vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- MutectVCFFilesToZipFile(dir, zipfile = file.path(tempdir(), "test.zip"), ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome", base.filename = "Mutect") unlink(file.path(tempdir(), "test.zip"))} ## End(Not run)
## Not run: dir <- c(system.file("extdata/Mutect-vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- MutectVCFFilesToZipFile(dir, zipfile = file.path(tempdir(), "test.zip"), ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome", base.filename = "Mutect") unlink(file.path(tempdir(), "test.zip"))} ## End(Not run)
Plot the spectrum of one sample or plot one signature. The
type of graph is based on attribute("catalog.type")
of the input catalog.
You can first use TransformCatalog
to get different types of
catalog and then do the plotting.
PlotCatalog( catalog, plot.SBS12 = NULL, cex = NULL, grid = NULL, upper = NULL, xlabels = NULL, ylabels = NULL, ylim = NULL )
PlotCatalog( catalog, plot.SBS12 = NULL, cex = NULL, grid = NULL, upper = NULL, xlabels = NULL, ylabels = NULL, ylim = NULL )
catalog |
A catalog as defined in |
plot.SBS12 |
Only meaningful for class |
cex |
Has the usual meaning. Taken from |
grid |
A logical value indicating whether to draw grid lines. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
upper |
A logical value indicating whether to draw horizontal lines and the names of major mutation class on top of graph. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
xlabels |
A logical value indicating whether to draw x axis labels. Only
implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog.
If |
ylabels |
A logical value indicating whether to draw y axis labels. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
ylim |
Has the usual meaning. Only implemented for SBS96Catalog, IndelCatalog, ID166Catalog. |
An invisible list whose first element is a logic value
indicating whether the plot is successful. For SBS96Catalog
,
SBS192Catalog
, DBS78Catalog
, DBS144Catalog
and
IndelCatalog
, the list will have a second element, which is a
numeric vector giving the coordinates of all the bar midpoints drawn,
useful for adding to the graph. For SBS192Catalog with "counts"
catalog.type and non-NULL abundance and plot.SBS12 = TRUE
, the list
will have an additional element which is a list containing the strand bias
statistics.
For SBS192Catalog with "counts" catalog.type and
non-NULL abundance and plot.SBS12 = TRUE
, the strand bias statistics
are Benjamini-Hochberg q-values based on two-sided binomial tests of the
mutation counts on the transcribed and untranscribed strands relative to
the actual abundances of C and T on the transcribed strand. On the SBS12
plot, asterisks indicate q-values as follows *, ; **,
; ***,
.
The sizes of repeats involved in deletions range from 0 to 5+ in the mutational-spectra and signature catalog rownames, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
file <- system.file("extdata", "strelka.regress.cat.sbs.96.csv", package = "ICAMS") catSBS96 <- ReadCatalog(file) colnames(catSBS96) <- "sample" PlotCatalog(catSBS96)
file <- system.file("extdata", "strelka.regress.cat.sbs.96.csv", package = "ICAMS") catSBS96 <- ReadCatalog(file) colnames(catSBS96) <- "sample" PlotCatalog(catSBS96)
Plot catalog to a PDF file. The type of graph is based on
attribute("catalog.type")
of the input catalog. You can first use
TransformCatalog
to get different types of catalog and then do
the plotting.
PlotCatalogToPdf( catalog, file, plot.SBS12 = NULL, cex = NULL, grid = NULL, upper = NULL, xlabels = NULL, ylabels = NULL, ylim = NULL )
PlotCatalogToPdf( catalog, file, plot.SBS12 = NULL, cex = NULL, grid = NULL, upper = NULL, xlabels = NULL, ylabels = NULL, ylim = NULL )
catalog |
A catalog as defined in |
file |
The name of the PDF file to be produced. |
plot.SBS12 |
Only meaningful for class |
cex |
Has the usual meaning. Taken from |
grid |
A logical value indicating whether to draw grid lines. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
upper |
A logical value indicating whether to draw horizontal lines and the names of major mutation class on top of graph. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
xlabels |
A logical value indicating whether to draw x axis labels. Only
implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog.
If |
ylabels |
A logical value indicating whether to draw y axis labels. Only implemented for SBS96Catalog, DBS78Catalog, IndelCatalog, ID166Catalog. |
ylim |
Has the usual meaning. Only implemented for SBS96Catalog, IndelCatalog, ID166Catalog. |
An invisible list whose first element is a logic value
indicating whether the plot is successful. For SBS192Catalog with
"counts" catalog.type and non-null abundance and plot.SBS12 = TRUE
,
the list will have a second element which is a list containing the strand
bias statistics.
For SBS192Catalog with "counts" catalog.type and
non-NULL abundance and plot.SBS12 = TRUE
, the strand bias statistics
are Benjamini-Hochberg q-values based on two-sided binomial tests of the
mutation counts on the transcribed and untranscribed strands relative to
the actual abundances of C and T on the transcribed strand. On the SBS12
plot, asterisks indicate q-values as follows *, ; **,
; ***,
.
The sizes of repeats involved in deletions range from 0 to 5+ in the mutational-spectra and signature catalog rownames, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
file <- system.file("extdata", "strelka.regress.cat.sbs.96.csv", package = "ICAMS") catSBS96 <- ReadCatalog(file) colnames(catSBS96) <- "sample" PlotCatalogToPdf(catSBS96, file = file.path(tempdir(), "test.pdf"))
file <- system.file("extdata", "strelka.regress.cat.sbs.96.csv", package = "ICAMS") catSBS96 <- ReadCatalog(file) colnames(catSBS96) <- "sample" PlotCatalogToPdf(catSBS96, file = file.path(tempdir(), "test.pdf"))
Plot transcription strand bias with respect to gene expression values
PlotTransBiasGeneExp( annotated.SBS.vcf, expression.data, Ensembl.gene.ID.col, expression.value.col, num.of.bins, plot.type, damaged.base = NULL, ymax = NULL )
PlotTransBiasGeneExp( annotated.SBS.vcf, expression.data, Ensembl.gene.ID.col, expression.value.col, num.of.bins, plot.type, damaged.base = NULL, ymax = NULL )
annotated.SBS.vcf |
An SBS VCF annotated by
|
expression.data |
A |
Ensembl.gene.ID.col |
Name of column which has the Ensembl gene ID
information in |
expression.value.col |
Name of column which has the gene expression
values in |
num.of.bins |
The number of bins that will be plotted on the graph. |
plot.type |
A character string indicating one mutation type to be plotted. It should be one of "C>A", "C>G", "C>T", "T>A", "T>C", "T>G". |
damaged.base |
One of |
ymax |
Limit for the y axis. If not specified, it defaults to NULL and the y axis limit equals 1.5 times of the maximum mutation counts in a specific mutation type. |
A list whose first element is a logic value indicating whether the plot is successful. The second element is a named numeric vector containing the p-values printed on the plot.
The p-values are calculated by logistic regression using function
glm
. The dependent variable is labeled "1" and "0" if
the mutation from annotated.SBS.vcf
falls onto the untranscribed and
transcribed strand respectively. The independent variable is the binary
logarithm of the gene expression value from expression.data
plus one,
i.e. where
stands for gene
expression value.
file <- c(system.file("extdata/Strelka-SBS-vcf/", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka") SBS.vcf <- list.of.vcfs$SBS[[1]] if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { annotated.SBS.vcf <- AnnotateSBSVCF(SBS.vcf, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37) PlotTransBiasGeneExp(annotated.SBS.vcf = annotated.SBS.vcf, expression.data = gene.expression.data.HepG2, Ensembl.gene.ID.col = "Ensembl.gene.ID", expression.value.col = "TPM", num.of.bins = 4, plot.type = "C>A") }
file <- c(system.file("extdata/Strelka-SBS-vcf/", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka") SBS.vcf <- list.of.vcfs$SBS[[1]] if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { annotated.SBS.vcf <- AnnotateSBSVCF(SBS.vcf, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37) PlotTransBiasGeneExp(annotated.SBS.vcf = annotated.SBS.vcf, expression.data = gene.expression.data.HepG2, Ensembl.gene.ID.col = "Ensembl.gene.ID", expression.value.col = "TPM", num.of.bins = 4, plot.type = "C>A") }
Plot transcription strand bias with respect to gene expression values to a PDF file
PlotTransBiasGeneExpToPdf( annotated.SBS.vcf, file, expression.data, Ensembl.gene.ID.col, expression.value.col, num.of.bins, plot.type = c("C>A", "C>G", "C>T", "T>A", "T>C", "T>G"), damaged.base = NULL )
PlotTransBiasGeneExpToPdf( annotated.SBS.vcf, file, expression.data, Ensembl.gene.ID.col, expression.value.col, num.of.bins, plot.type = c("C>A", "C>G", "C>T", "T>A", "T>C", "T>G"), damaged.base = NULL )
annotated.SBS.vcf |
An SBS VCF annotated by
|
file |
The name of output file. |
expression.data |
A |
Ensembl.gene.ID.col |
Name of column which has the Ensembl gene ID
information in |
expression.value.col |
Name of column which has the gene expression
values in |
num.of.bins |
The number of bins that will be plotted on the graph. |
plot.type |
A vector of character indicating types to be plotted. It can be one or more types from "C>A", "C>G", "C>T", "T>A", "T>C", "T>G". The default is to print all the six mutation types. |
damaged.base |
One of |
A list whose first element is a logic value indicating whether the plot is successful. The second element is a named numeric vector containing the p-values printed on the plot.
The p-values are calculated by logistic regression using function
glm
. The dependent variable is labeled "1" and "0" if
the mutation from annotated.SBS.vcf
falls onto the untranscribed and
transcribed strand respectively. The independent variable is the binary
logarithm of the gene expression value from expression.data
plus one,
i.e. where
stands for gene
expression value.
file <- c(system.file("extdata/Strelka-SBS-vcf/", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka") SBS.vcf <- list.of.vcfs$SBS[[1]] if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { annotated.SBS.vcf <- AnnotateSBSVCF(SBS.vcf, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37) PlotTransBiasGeneExpToPdf(annotated.SBS.vcf = annotated.SBS.vcf, expression.data = gene.expression.data.HepG2, Ensembl.gene.ID.col = "Ensembl.gene.ID", expression.value.col = "TPM", num.of.bins = 4, plot.type = c("C>A","C>G","C>T","T>A","T>C"), file = file.path(tempdir(), "test.pdf")) }
file <- c(system.file("extdata/Strelka-SBS-vcf/", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka") SBS.vcf <- list.of.vcfs$SBS[[1]] if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { annotated.SBS.vcf <- AnnotateSBSVCF(SBS.vcf, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37) PlotTransBiasGeneExpToPdf(annotated.SBS.vcf = annotated.SBS.vcf, expression.data = gene.expression.data.HepG2, Ensembl.gene.ID.col = "Ensembl.gene.ID", expression.value.col = "TPM", num.of.bins = 4, plot.type = c("C>A","C>G","C>T","T>A","T>C"), file = file.path(tempdir(), "test.pdf")) }
[Deprecated, use ReadAndSplitVCFs(variant.caller = "mutect") instead] Read and split Mutect VCF files
ReadAndSplitMutectVCFs( files, names.of.VCFs = NULL, tumor.col.names = NA, suppress.discarded.variants.warnings = TRUE )
ReadAndSplitMutectVCFs( files, names.of.VCFs = NULL, tumor.col.names = NA, suppress.discarded.variants.warnings = TRUE )
files |
Character vector of file paths to the Mutect VCF files. |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Vector of column names or column indices in
VCFs which contain the tumor sample information. The order of elements in
|
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
A list containing the following objects:
SBS
: List of VCFs with only single base substitutions.
DBS
: List of VCFs with only doublet base substitutions as called
by Mutect.
ID
: List of VCFs with only small insertions and deletions.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
## Not run: file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitMutectVCFs(file) ## End(Not run)
## Not run: file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitMutectVCFs(file) ## End(Not run)
[Deprecated, use ReadAndSplitVCFs(variant.caller = "strelka") instead] The function will find and merge adjacent SBS pairs into DBS if their VAFs are very similar. The default threshold value for VAF is 0.02.
ReadAndSplitStrelkaSBSVCFs( files, names.of.VCFs = NULL, suppress.discarded.variants.warnings = TRUE )
ReadAndSplitStrelkaSBSVCFs( files, names.of.VCFs = NULL, suppress.discarded.variants.warnings = TRUE )
files |
Character vector of file paths to the Strelka SBS VCF files. |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
A list of elements as follows:
SBS.vcfs
: List of data.frames of pure SBS mutations – no DBS or
3+BS mutations.
DBS.vcfs
: List of data.frames of pure DBS mutations – no SBS or
3+BS mutations.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
## Not run: file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitStrelkaSBSVCFs(file) ## End(Not run)
## Not run: file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitStrelkaSBSVCFs(file) ## End(Not run)
Read and split VCF files
ReadAndSplitVCFs( files, variant.caller = "unknown", num.of.cores = 1, names.of.VCFs = NULL, tumor.col.names = NA, filter.status = DefaultFilterStatus(variant.caller), get.vaf.function = NULL, ..., max.vaf.diff = 0.02, suppress.discarded.variants.warnings = TRUE, always.merge.SBS = FALSE, chr.names.to.process = NULL )
ReadAndSplitVCFs( files, variant.caller = "unknown", num.of.cores = 1, names.of.VCFs = NULL, tumor.col.names = NA, filter.status = DefaultFilterStatus(variant.caller), get.vaf.function = NULL, ..., max.vaf.diff = 0.02, suppress.discarded.variants.warnings = TRUE, always.merge.SBS = FALSE, chr.names.to.process = NULL )
files |
Character vector of file paths to the VCF files. |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Only applicable to Mutect VCFs.
Vector of column names or column indices in Mutect VCFs which
contain the tumor sample information. The order of elements in
|
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
max.vaf.diff |
Not applicable if |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
always.merge.SBS |
If |
chr.names.to.process |
A character vector specifying the chromosome names in VCF whose variants will be kept and processed, other chromosome variants will be discarded. If NULL(default), all variants will be kept except those on chromosomes with names that contain strings "GL", "KI", "random", "Hs", "M", "JH", "fix", "alt". |
A list containing the following objects:
SBS
: List of VCFs with only single base substitutions.
DBS
: List of VCFs with only doublet base substitutions.
ID
: List of VCFs with only small insertions and deletions.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "mutect")
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadAndSplitVCFs(file, variant.caller = "mutect")
Read a catalog in standardized format from path.
ReadCatalog( file, ref.genome = NULL, region = "unknown", catalog.type = "counts", strict = NULL, stop.on.error = TRUE )
ReadCatalog( file, ref.genome = NULL, region = "unknown", catalog.type = "counts", strict = NULL, stop.on.error = TRUE )
file |
Path to a catalog on disk in a standardized format. The recognized formats are:
|
ref.genome |
A |
region |
region A character string designating a genomic region;
see |
catalog.type |
One of "counts", "density", "counts.signature", "density.signature". |
strict |
Ignored and deprecated. |
stop.on.error |
If TRUE, call |
See also WriteCatalog
A catalog as an S3 object; see as.catalog
.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
file <- system.file("extdata", "strelka.regress.cat.sbs.96.csv", package = "ICAMS") catSBS96 <- ReadCatalog(file)
file <- system.file("extdata", "strelka.regress.cat.sbs.96.csv", package = "ICAMS") catSBS96 <- ReadCatalog(file)
[Deprecated, use ReadAndSplitVCFs(variant.caller = "strelka") instead] Read Strelka ID (small insertions and deletions) VCF files
ReadStrelkaIDVCFs(files, names.of.VCFs = NULL)
ReadStrelkaIDVCFs(files, names.of.VCFs = NULL)
files |
Character vector of file paths to the VCF files. |
names.of.VCFs |
Character vector of names of the VCF files. The order of
names in |
A list of data frames containing data lines of the VCF files.
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
## Not run: file <- c(system.file("extdata/Strelka-ID-vcf", "Strelka.ID.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadStrelkaIDVCFs(file) ## End(Not run)
## Not run: file <- c(system.file("extdata/Strelka-ID-vcf", "Strelka.ID.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadStrelkaIDVCFs(file) ## End(Not run)
Read VCF files
ReadVCFs( files, variant.caller = "unknown", num.of.cores = 1, names.of.VCFs = NULL, tumor.col.names = NA, filter.status = DefaultFilterStatus(variant.caller), get.vaf.function = NULL, ... )
ReadVCFs( files, variant.caller = "unknown", num.of.cores = 1, names.of.VCFs = NULL, tumor.col.names = NA, filter.status = DefaultFilterStatus(variant.caller), get.vaf.function = NULL, ... )
files |
Character vector of file paths to the VCF files. |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Only applicable to Mutect VCFs.
Vector of column names or column indices in Mutect VCFs which
contain the tumor sample information. The order of elements in
|
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
A list of data frames storing data lines of the VCF files with two additional columns added which contain the VAF(variant allele frequency) and read depth information.
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadVCFs(file, variant.caller = "mutect")
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadVCFs(file, variant.caller = "mutect")
string.vec
Based on reverseComplement
.
Handles IUPAC ambiguity codes but not "u" (uracil).
(see <https://en.wikipedia.org/wiki/Nucleic_acid_notation>).
revc(string.vec)
revc(string.vec)
string.vec |
A character vector. |
A character vector with the reverse complement of every
string in string.vec
.
revc("aTgc") # GCAT # A vector and strings with ambiguity codes revc(c("ATGC", "aTGc", "wnTCb")) # GCAT GCAT VGANW ## Not run: revc("ACGU") # An error ## End(Not run)
revc("aTgc") # GCAT # A vector and strings with ambiguity codes revc(c("ATGC", "aTGc", "wnTCb")) # GCAT GCAT VGANW ## Not run: revc("ACGU") # An error ## End(Not run)
Read a VCF file into a data frame with minimal processing.
SimpleReadVCF(file)
SimpleReadVCF(file)
file |
The name/path of the VCF file, or a complete URL. |
Header lines beginning "##" are removed, and column "#CHROM" is renamed to "CHROM". Other column names are unchanged. Columns "#CHROM", "POS", "REF", and "ALT" must be in the input.
A data frame storing mutation records of a VCF file.
file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) df <- SimpleReadVCF(file)
file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) df <- SimpleReadVCF(file)
Split each VCF into SBS, DBS, and ID VCFs (plus VCF-like data frame with left-over rows)
SplitListOfVCFs( list.of.vcfs, variant.caller, max.vaf.diff = 0.02, num.of.cores = 1, suppress.discarded.variants.warnings = TRUE, always.merge.SBS = FALSE, chr.names.to.process = NULL )
SplitListOfVCFs( list.of.vcfs, variant.caller, max.vaf.diff = 0.02, num.of.cores = 1, suppress.discarded.variants.warnings = TRUE, always.merge.SBS = FALSE, chr.names.to.process = NULL )
list.of.vcfs |
List of VCFs as in-memory data frames. The VCFs should
have |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
max.vaf.diff |
The maximum difference of VAF, default value is 0.02. If
the absolute difference of VAFs for adjacent SBSs is bigger than
|
num.of.cores |
The number of cores to use. Not available on Windows
unless |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
always.merge.SBS |
If |
chr.names.to.process |
A character vector specifying the chromosome
names in VCF whose variants will be kept and processed, other chromosome
variants will be discarded. If |
A list containing the following objects:
SBS
: List of VCFs with only single base substitutions.
DBS
: List of VCFs with only doublet base substitutions as called
by Mutect.
ID
: List of VCFs with only small insertions and deletions.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadVCFs(file, variant.caller = "mutect") split.vcfs <- SplitListOfVCFs(list.of.vcfs, variant.caller = "mutect")
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.vcfs <- ReadVCFs(file, variant.caller = "mutect") split.vcfs <- SplitListOfVCFs(list.of.vcfs, variant.caller = "mutect")
[Deprecated, use VCFsToCatalogs(variant.caller = "strelka") instead]
Create ID (small insertions and deletions) catalog from the Strelka ID VCFs
specified by files
StrelkaIDVCFFilesToCatalog( files, ref.genome, region = "unknown", names.of.VCFs = NULL, flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
StrelkaIDVCFFilesToCatalog( files, ref.genome, region = "unknown", names.of.VCFs = NULL, flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
files |
Character vector of file paths to the Strelka ID VCF files. |
ref.genome |
A |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
This function calls VCFsToIDCatalogs
A list of elements:
catalog
: The ID (small insertions and deletions) catalog with
attributes added. See as.catalog
for more details.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE. A list of
data frames which contain the original VCF's ID mutation rows with three
additional columns seq.context.width
, seq.context
and
ID.class
added. The category assignment of each ID mutation in VCF can
be obtained from ID.class
column.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
## Not run: file <- c(system.file("extdata/Strelka-ID-vcf", "Strelka.ID.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catID <- StrelkaIDVCFFilesToCatalog(file, ref.genome = "hg19", region = "genome")} ## End(Not run)
## Not run: file <- c(system.file("extdata/Strelka-ID-vcf", "Strelka.ID.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catID <- StrelkaIDVCFFilesToCatalog(file, ref.genome = "hg19", region = "genome")} ## End(Not run)
[Deprecated, use VCFsToCatalogsAndPlotToPdf(variant.caller = "strelka") instead]
Create ID (small insertions and deletions) catalog from the Strelka ID VCFs
specified by files
and plot them to PDF
StrelkaIDVCFFilesToCatalogAndPlotToPdf( files, ref.genome, region = "unknown", names.of.VCFs = NULL, output.file = "", flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
StrelkaIDVCFFilesToCatalogAndPlotToPdf( files, ref.genome, region = "unknown", names.of.VCFs = NULL, output.file = "", flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
files |
Character vector of file paths to the Strelka ID VCF files. |
ref.genome |
A |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
output.file |
Optional. The base name of the PDF file to be produced;
the file is ending in |
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
This function calls StrelkaIDVCFFilesToCatalog
and
PlotCatalogToPdf
A list of elements:
catalog
: The ID (small insertions and deletions) catalog with
attributes added. See as.catalog
for more details.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE. A list of
data frames which contain the original VCF's ID mutation rows with three
additional columns seq.context.width
, seq.context
and
ID.class
added. The category assignment of each ID mutation in VCF can
be obtained from ID.class
column.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
## Not run: file <- c(system.file("extdata/Strelka-ID-vcf", "Strelka.ID.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catID <- StrelkaIDVCFFilesToCatalogAndPlotToPdf(file, ref.genome = "hg19", region = "genome", output.file = file.path(tempdir(), "StrelkaID"))} ## End(Not run)
## Not run: file <- c(system.file("extdata/Strelka-ID-vcf", "Strelka.ID.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catID <- StrelkaIDVCFFilesToCatalogAndPlotToPdf(file, ref.genome = "hg19", region = "genome", output.file = file.path(tempdir(), "StrelkaID"))} ## End(Not run)
[Deprecated, use VCFsToZipFile(variant.caller = "strelka") instead]
Create ID (small insertions and deletions) catalog from the Strelka ID VCFs
specified by dir
, save the catalog as CSV file, plot it to PDF and
generate a zip archive of all the output files.
StrelkaIDVCFFilesToZipFile( dir, zipfile, ref.genome, region = "unknown", names.of.VCFs = NULL, base.filename = "", flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
StrelkaIDVCFFilesToZipFile( dir, zipfile, ref.genome, region = "unknown", names.of.VCFs = NULL, base.filename = "", flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
dir |
Pathname of the directory which contains only the Strelka
ID VCF files. Each Strelka ID VCF must have a file extension
".vcf" (case insensitive) and share the same |
zipfile |
Pathname of the zip file to be created. |
ref.genome |
A |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
base.filename |
Optional. The base name of the CSV and PDF file to be
produced; the file is ending in |
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
This function calls StrelkaIDVCFFilesToCatalog
,
PlotCatalogToPdf
, WriteCatalog
and
zip::zipr
.
A list of elements:
catalog
: The ID (small insertions and deletions) catalog with
attributes added. See as.catalog
for more details.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE. A list of
data frames which contain the original VCF's ID mutation rows with three
additional columns seq.context.width
, seq.context
and
ID.class
added. The category assignment of each ID mutation in VCF can
be obtained from ID.class
column.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
## Not run: dir <- c(system.file("extdata/Strelka-ID-vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- StrelkaIDVCFFilesToZipFile(dir, zipfile = file.path(tempdir(), "test.zip"), ref.genome = "hg19", region = "genome", base.filename = "Strelka-ID") unlink(file.path(tempdir(), "test.zip"))} ## End(Not run)
## Not run: dir <- c(system.file("extdata/Strelka-ID-vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- StrelkaIDVCFFilesToZipFile(dir, zipfile = file.path(tempdir(), "test.zip"), ref.genome = "hg19", region = "genome", base.filename = "Strelka-ID") unlink(file.path(tempdir(), "test.zip"))} ## End(Not run)
[Deprecated, use VCFsToCatalogs(variant.caller = "strelka") instead]
Create 3 SBS catalogs (96, 192, 1536) and 3 DBS catalogs (78, 136, 144) from
the Strelka SBS VCFs specified by files
. The function will find and
merge adjacent SBS pairs into DBS if their VAFs are very similar. The default
threshold value for VAF is 0.02.
StrelkaSBSVCFFilesToCatalog( files, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
StrelkaSBSVCFFilesToCatalog( files, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
files |
Character vector of file paths to the Strelka SBS VCF files. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
This function calls VCFsToSBSCatalogs
and
VCFsToDBSCatalogs
.
A list containing the following objects:
catSBS96
, catSBS192
, catSBS1536
: Matrix of
3 SBS catalogs (one each for 96, 192, and 1536).
catDBS78
, catDBS136
, catDBS144
: Matrix of
3 DBS catalogs (one each for 78, 136, and 144).
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE.
A list of elements:
SBS
: SBS VCF annotated by AnnotateSBSVCF
with
three new columns SBS96.class
, SBS192.class
and
SBS1536.class
showing the mutation class for each SBS variant.
DBS
: DBS VCF annotated by AnnotateDBSVCF
with
three new columns DBS78.class
, DBS136.class
and
DBS144.class
showing the mutation class for each DBS variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
## Not run: file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- StrelkaSBSVCFFilesToCatalog(file, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome")} ## End(Not run)
## Not run: file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- StrelkaSBSVCFFilesToCatalog(file, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome")} ## End(Not run)
[Deprecated, use VCFsToCatalogsAndPlotToPdf(variant.caller = "strelka") instead]
Create 3 SBS catalogs (96, 192, 1536) and 3 DBS catalogs (78, 136, 144) from
the Strelka SBS VCFs specified by files
and plot them to PDF. The
function will find and merge adjacent SBS pairs into DBS if their VAFs are
very similar. The default threshold value for VAF is 0.02.
StrelkaSBSVCFFilesToCatalogAndPlotToPdf( files, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, output.file = "", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
StrelkaSBSVCFFilesToCatalogAndPlotToPdf( files, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, output.file = "", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
files |
Character vector of file paths to the Strelka SBS VCF files. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
output.file |
Optional. The base name of the PDF files to be produced;
multiple files will be generated, each ending in |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
This function calls StrelkaSBSVCFFilesToCatalog
and
PlotCatalogToPdf
A list containing the following objects:
catSBS96
, catSBS192
, catSBS1536
: Matrix of
3 SBS catalogs (one each for 96, 192, and 1536).
catDBS78
, catDBS136
, catDBS144
: Matrix of
3 DBS catalogs (one each for 78, 136, and 144).
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE.
A list of elements:
SBS
: SBS VCF annotated by AnnotateSBSVCF
with
three new columns SBS96.class
, SBS192.class
and
SBS1536.class
showing the mutation class for each SBS variant.
DBS
: DBS VCF annotated by AnnotateDBSVCF
with
three new columns DBS78.class
, DBS136.class
and
DBS144.class
showing the mutation class for each DBS variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
## Not run: file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- StrelkaSBSVCFFilesToCatalogAndPlotToPdf(file, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome", output.file = file.path(tempdir(), "StrelkaSBS"))} ## End(Not run)
## Not run: file <- c(system.file("extdata/Strelka-SBS-vcf", "Strelka.SBS.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- StrelkaSBSVCFFilesToCatalogAndPlotToPdf(file, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome", output.file = file.path(tempdir(), "StrelkaSBS"))} ## End(Not run)
[Deprecated, use VCFsToZipFile(variant.caller = "strelka") instead]
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) from the
Strelka SBS VCFs specified by dir
, save the catalogs as CSV files,
plot them to PDF and generate a zip archive of all the output files. The
function will find and merge adjacent SBS pairs into DBS if their VAFs are
very similar. The default threshold value for VAF is 0.02.
StrelkaSBSVCFFilesToZipFile( dir, zipfile, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, base.filename = "", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
StrelkaSBSVCFFilesToZipFile( dir, zipfile, ref.genome, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, base.filename = "", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
dir |
Pathname of the directory which contains only the Strelka
SBS VCF files. Each Strelka SBS VCF must have a file extension
".vcf" (case insensitive) and share the same |
zipfile |
Pathname of the zip file to be created. |
ref.genome |
A |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
base.filename |
Optional. The base name of the CSV and PDF files to be
produced; multiple files will be generated, each ending in
|
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
This function calls StrelkaSBSVCFFilesToCatalog
,
PlotCatalogToPdf
, WriteCatalog
and
zip::zipr
.
A list containing the following objects:
catSBS96
, catSBS192
, catSBS1536
: Matrix of
3 SBS catalogs (one each for 96, 192, and 1536).
catDBS78
, catDBS136
, catDBS144
: Matrix of
3 DBS catalogs (one each for 78, 136, and 144).
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE.
A list of elements:
SBS
: SBS VCF annotated by AnnotateSBSVCF
with
three new columns SBS96.class
, SBS192.class
and
SBS1536.class
showing the mutation class for each SBS variant.
DBS
: DBS VCF annotated by AnnotateDBSVCF
with
three new columns DBS78.class
, DBS136.class
and
DBS144.class
showing the mutation class for each DBS variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
## Not run: dir <- c(system.file("extdata/Strelka-SBS-vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- StrelkaSBSVCFFilesToZipFile(dir, zipfile = file.path(tempdir(), "test.zip"), ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome", base.filename = "Strelka-SBS") unlink(file.path(tempdir(), "test.zip"))} ## End(Not run)
## Not run: dir <- c(system.file("extdata/Strelka-SBS-vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- StrelkaSBSVCFFilesToZipFile(dir, zipfile = file.path(tempdir(), "test.zip"), ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome", base.filename = "Strelka-SBS") unlink(file.path(tempdir(), "test.zip"))} ## End(Not run)
Transcript ranges and strand information for a particular reference genome.
trans.ranges.GRCh37 trans.ranges.GRCh38 trans.ranges.GRCm38
trans.ranges.GRCh37 trans.ranges.GRCh38 trans.ranges.GRCm38
A data.table
which contains transcript
range and strand information for a particular reference genome.
colname
s are chrom
, start
, end
, strand
,
Ensembl.gene.ID
, gene.symbol
. It uses one-based coordinates.
An object of class data.table
(inherits from data.frame
) with 19083 rows and 6 columns.
An object of class data.table
(inherits from data.frame
) with 19096 rows and 6 columns.
An object of class data.table
(inherits from data.frame
) with 20325 rows and 6 columns.
This information is needed to generate catalogs that
depend on transcriptional
strand information, for example catalogs of
class SBS192Catalog
.
trans.ranges.GRCh37
: Human GRCh37.
trans.ranges.GRCh38
: Human GRCh38.
trans.ranges.GRCm38
: Mouse GRCm38.
For these two tables, only genes that are associated with a CCDS ID are kept for transcriptional strand bias analysis.
This information is needed for StrelkaSBSVCFFilesToCatalog
, StrelkaSBSVCFFilesToCatalogAndPlotToPdf
,
MutectVCFFilesToCatalog
, MutectVCFFilesToCatalogAndPlotToPdf
,
VCFsToSBSCatalogs
and VCFsToDBSCatalogs
.
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gff3.gz
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M21/gencode.vM21.annotation.gff3.gz
trans.ranges.GRCh37 # chrom start end strand Ensembl.gene.ID gene.symbol # 1 65419 71585 + ENSG00000186092 OR4F5 # 1 367640 368634 + ENSG00000235249 OR4F29 # 1 621059 622053 - ENSG00000284662 OR4F16 # 1 859308 879961 + ENSG00000187634 SAMD11 # 1 879583 894689 - ENSG00000188976 NOC2L # ... ... ... ... ... ...
trans.ranges.GRCh37 # chrom start end strand Ensembl.gene.ID gene.symbol # 1 65419 71585 + ENSG00000186092 OR4F5 # 1 367640 368634 + ENSG00000235249 OR4F29 # 1 621059 622053 - ENSG00000284662 OR4F16 # 1 859308 879961 + ENSG00000187634 SAMD11 # 1 879583 894689 - ENSG00000188976 NOC2L # ... ... ... ... ... ...
Transform between counts and density spectrum catalogs and counts and density signature catalogs
TransformCatalog( catalog, target.ref.genome = NULL, target.region = NULL, target.catalog.type = NULL, target.abundance = NULL )
TransformCatalog( catalog, target.ref.genome = NULL, target.region = NULL, target.catalog.type = NULL, target.abundance = NULL )
catalog |
An SBS or DBS catalog as described in |
target.ref.genome |
A |
target.region |
A |
target.catalog.type |
A character string acting as a catalog type
identifier, one of "counts", "density", "counts.signature",
"density.signature"; see |
target.abundance |
A vector of counts, one for each source K-mer for mutations (e.g. for
strand-agnostic single nucleotide substitutions in trinucleotide – i.e.
3-mer – context, one count each for ACA, ACC, ACG, ... TTT). See
|
Only the following transformations are legal:
counts -> counts
(deprecated, generates a warning;
we strongly suggest that you work with densities if comparing
spectra or signatures generated from data with
different underlying abundances.)
counts -> density
counts -> (counts.signature, density.signature)
density -> counts
(the semantics are to
infer the genome-wide or exome-wide counts based on the
densities)
density -> density
(a null operation, generates
a warning)
density -> (counts.signature, density.signature)
counts.signature -> counts.signature
(used to transform
between the source abundance and target.abundance
)
counts.signature -> density.signature
counts.signature -> (counts, density)
(generates an error)
density.signature -> density.signature
(a null operation,
generates a warning)
density.signature -> counts.signature
density.signature -> (counts, density)
(generates an error)
A catalog as defined in ICAMS
.
The TransformCatalog
function transforms catalogs of mutational spectra or
signatures to account for differing abundances of the source
sequence of the mutations in the genome.
For example, mutations from ACG are much rarer in the human genome than mutations from ACC simply because CG dinucleotides are rare in the genome. Consequently, there are two possible representations of mutational spectra or signatures. One representation is based on mutation counts as observed in a given genome or exome, and this approach is widely used, as, for example, at https://cancer.sanger.ac.uk/cosmic/signatures, which presents signatures based on observed mutation counts in the human genome. We call these "counts-based spectra" or "counts-based signatures".
Alternatively, mutational spectra or signatures can be represented as mutations per source sequence, for example the number of ACT > AGT mutations occurring at all ACT 3-mers in a genome. We call these "density-based spectra" or "density-based signatures".
This function can also transform spectra based on observed genome-wide counts to "density"-based catalogs. In density-based catalogs mutations are expressed as mutations per source sequences. For example, a density-based catalog represents the proportion of ACCs mutated to ATCs, the proportion of ACGs mutated to ATGs, etc. This is different from counts-based mutational spectra catalogs, which contain the number of ACC > ATC mutations, the number of ACG > ATG mutations, etc.
This function can also transform observed-count based spectra or signatures from genome to exome based counts, or between different species (since the abundances of source sequences vary between genome and exome and between species).
file <- system.file("extdata", "strelka.regress.cat.sbs.96.csv", package = "ICAMS") if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catSBS96.counts <- ReadCatalog(file, ref.genome = "hg19", region = "genome", catalog.type = "counts") catSBS96.density <- TransformCatalog(catSBS96.counts, target.ref.genome = "hg19", target.region = "genome", target.catalog.type = "density")}
file <- system.file("extdata", "strelka.regress.cat.sbs.96.csv", package = "ICAMS") if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catSBS96.counts <- ReadCatalog(file, ref.genome = "hg19", region = "genome", catalog.type = "counts") catSBS96.density <- TransformCatalog(catSBS96.counts, target.ref.genome = "hg19", target.region = "genome", target.catalog.type = "density")}
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the Mutect VCFs specified by files
VCFsToCatalogs( files, ref.genome, variant.caller = "unknown", num.of.cores = 1, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, filter.status = DefaultFilterStatus(variant.caller), get.vaf.function = NULL, ..., max.vaf.diff = 0.02, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE, chr.names.to.process = NULL )
VCFsToCatalogs( files, ref.genome, variant.caller = "unknown", num.of.cores = 1, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, filter.status = DefaultFilterStatus(variant.caller), get.vaf.function = NULL, ..., max.vaf.diff = 0.02, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE, chr.names.to.process = NULL )
files |
Character vector of file paths to the VCF files. |
ref.genome |
A |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Only applicable to Mutect VCFs.
Vector of column names or column indices in Mutect VCFs which
contain the tumor sample information. The order of elements in
|
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
max.vaf.diff |
Not applicable if |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
chr.names.to.process |
A character vector specifying the chromosome names in VCF whose variants will be kept and processed, other chromosome variants will be discarded. If NULL(default), all variants will be kept except those on chromosomes with names that contain strings "GL", "KI", "random", "Hs", "M", "JH", "fix", "alt". |
This function calls VCFsToSBSCatalogs
,
VCFsToDBSCatalogs
and VCFsToIDCatalogs
A list containing the following objects:
catSBS96
, catSBS192
, catSBS1536
: Matrix of
3 SBS catalogs (one each for 96, 192, and 1536).
catDBS78
, catDBS136
, catDBS144
: Matrix of
3 DBS catalogs (one each for 78, 136, and 144).
catID
: Matrix of ID (small insertions and deletions) catalog.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE.
A list of elements:
SBS
: SBS VCF annotated by AnnotateSBSVCF
with
three new columns SBS96.class
, SBS192.class
and
SBS1536.class
showing the mutation class for each SBS variant.
DBS
: DBS VCF annotated by AnnotateDBSVCF
with
three new columns DBS78.class
, DBS136.class
and
DBS144.class
showing the mutation class for each DBS variant.
ID
: ID VCF annotated by AnnotateIDVCF
with one
new column ID.class
showing the mutation class for each
ID variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- VCFsToCatalogs(file, ref.genome = "hg19", variant.caller = "mutect", region = "genome")}
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- VCFsToCatalogs(file, ref.genome = "hg19", variant.caller = "mutect", region = "genome")}
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the VCFs specified by files
and plot them to
PDF
VCFsToCatalogsAndPlotToPdf( files, output.dir, ref.genome, variant.caller = "unknown", num.of.cores = 1, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, filter.status = DefaultFilterStatus(variant.caller), get.vaf.function = NULL, ..., max.vaf.diff = 0.02, base.filename = "", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE, chr.names.to.process = NULL )
VCFsToCatalogsAndPlotToPdf( files, output.dir, ref.genome, variant.caller = "unknown", num.of.cores = 1, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, filter.status = DefaultFilterStatus(variant.caller), get.vaf.function = NULL, ..., max.vaf.diff = 0.02, base.filename = "", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE, chr.names.to.process = NULL )
files |
Character vector of file paths to the VCF files. |
output.dir |
The directory where the PDF files will be saved. |
ref.genome |
A |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Only applicable to Mutect VCFs.
Vector of column names or column indices in Mutect VCFs which
contain the tumor sample information. The order of elements in
|
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
max.vaf.diff |
Not applicable if |
base.filename |
Optional. The base name of the PDF files to be produced;
multiple files will be generated, each ending in |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
chr.names.to.process |
A character vector specifying the chromosome names in VCF whose variants will be kept and processed, other chromosome variants will be discarded. If NULL(default), all variants will be kept except those on chromosomes with names that contain strings "GL", "KI", "random", "Hs", "M", "JH", "fix", "alt". |
This function calls VCFsToCatalogs
and
PlotCatalogToPdf
A list containing the following objects:
catSBS96
, catSBS192
, catSBS1536
: Matrix of
3 SBS catalogs (one each for 96, 192, and 1536).
catDBS78
, catDBS136
, catDBS144
: Matrix of
3 DBS catalogs (one each for 78, 136, and 144).
catID
: Matrix of ID (small insertions and deletions) catalog.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE.
A list of elements:
SBS
: SBS VCF annotated by AnnotateSBSVCF
with
three new columns SBS96.class
, SBS192.class
and
SBS1536.class
showing the mutation class for each SBS variant.
DBS
: DBS VCF annotated by AnnotateDBSVCF
with
three new columns DBS78.class
, DBS136.class
and
DBS144.class
showing the mutation class for each DBS variant.
ID
: ID VCF annotated by AnnotateIDVCF
with one
new column ID.class
showing the mutation class for each
ID variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- VCFsToCatalogsAndPlotToPdf(file, ref.genome = "hg19", output.dir = tempdir(), variant.caller = "mutect", region = "genome", base.filename = "Mutect")}
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- VCFsToCatalogsAndPlotToPdf(file, ref.genome = "hg19", output.dir = tempdir(), variant.caller = "mutect", region = "genome", base.filename = "Mutect")}
Create a list of 3 catalogs (one each for DBS78, DBS144 and DBS136) out of the contents in list.of.DBS.vcfs. The VCFs must not contain any type of mutation other then DBSs.
VCFsToDBSCatalogs( list.of.DBS.vcfs, ref.genome, num.of.cores = 1, trans.ranges = NULL, region = "unknown", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
VCFsToDBSCatalogs( list.of.DBS.vcfs, ref.genome, num.of.cores = 1, trans.ranges = NULL, region = "unknown", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
list.of.DBS.vcfs |
List of in-memory data frames of pure DBS mutations – no SBS or 3+BS mutations. The list names will be the sample ids in the output catalog. |
ref.genome |
A |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
A list containing the following objects:
catDBS78
, catDBS136
, catDBS144
: Matrix of
3 DBS catalogs (one each for 78, 136, and 144).
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
: Non-NULL only if
return.annotated.vcfs
= TRUE. DBS VCF annotated by
AnnotateDBSVCF
with three new columns DBS78.class
,
DBS136.class
and DBS144.class
showing the mutation class for
each DBS variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
DBS 144 catalog only contains mutations in transcribed regions.
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.DBS.vcfs <- ReadAndSplitVCFs(file, variant.caller = "mutect")$DBS if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs.DBS <- VCFsToDBSCatalogs(list.of.DBS.vcfs, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome")}
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.DBS.vcfs <- ReadAndSplitVCFs(file, variant.caller = "mutect")$DBS if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs.DBS <- VCFsToDBSCatalogs(list.of.DBS.vcfs, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome")}
Create ID (small insertions and deletions) catalog from ID VCFs
VCFsToIDCatalogs( list.of.vcfs, ref.genome, num.of.cores = 1, region = "unknown", flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
VCFsToIDCatalogs( list.of.vcfs, ref.genome, num.of.cores = 1, region = "unknown", flag.mismatches = 0, return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
list.of.vcfs |
List of in-memory ID VCFs. The list names will be the sample ids in the output catalog. |
ref.genome |
A |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
region |
A character string acting as a region identifier, one of "genome", "exome". |
flag.mismatches |
Deprecated. If there are ID variants whose |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
A list of elements:
catalog
: The ID (small insertions and deletions) catalog with
attributes added. See as.catalog
for details.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE. A list of
data frames which contain the original VCF's ID mutation rows with three
additional columns seq.context.width
, seq.context
and
ID.class
added. The category assignment of each ID mutation in VCF can
be obtained from ID.class
column.
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
file <- c(system.file("extdata/Strelka-ID-vcf/", "Strelka.ID.GRCh37.s1.vcf", package = "ICAMS")) list.of.ID.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka")$ID if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catID <- VCFsToIDCatalogs(list.of.ID.vcfs, ref.genome = "hg19", region = "genome")}
file <- c(system.file("extdata/Strelka-ID-vcf/", "Strelka.ID.GRCh37.s1.vcf", package = "ICAMS")) list.of.ID.vcfs <- ReadAndSplitVCFs(file, variant.caller = "strelka")$ID if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catID <- VCFsToIDCatalogs(list.of.ID.vcfs, ref.genome = "hg19", region = "genome")}
Create a list of 3 catalogs (one each for 96, 192, 1536) out of the contents in list.of.SBS.vcfs. The SBS VCFs must not contain DBSs, indels, or other types of mutations.
VCFsToSBSCatalogs( list.of.SBS.vcfs, ref.genome, num.of.cores = 1, trans.ranges = NULL, region = "unknown", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
VCFsToSBSCatalogs( list.of.SBS.vcfs, ref.genome, num.of.cores = 1, trans.ranges = NULL, region = "unknown", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE )
list.of.SBS.vcfs |
List of in-memory data frames of pure SBS mutations – no DBS or 3+BS mutations. The list names will be the sample ids in the output catalog. |
ref.genome |
A |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
A list containing the following objects:
catSBS96
, catSBS192
, catSBS1536
: Matrix of
3 SBS catalogs (one each for 96, 192, and 1536).
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE.
SBS VCF annotated by AnnotateSBSVCF
with
three new columns SBS96.class
, SBS192.class
and
SBS1536.class
showing the mutation class for each SBS variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 catalog will not be generated. Each catalog has attributes
added. See as.catalog
for more details.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
SBS 192 catalogs only contain mutations in transcribed regions.
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.SBS.vcfs <- ReadAndSplitVCFs(file, variant.caller = "mutect")$SBS if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs.SBS <- VCFsToSBSCatalogs(list.of.SBS.vcfs, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome")}
file <- c(system.file("extdata/Mutect-vcf", "Mutect.GRCh37.s1.vcf", package = "ICAMS")) list.of.SBS.vcfs <- ReadAndSplitVCFs(file, variant.caller = "mutect")$SBS if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs.SBS <- VCFsToSBSCatalogs(list.of.SBS.vcfs, ref.genome = "hg19", trans.ranges = trans.ranges.GRCh37, region = "genome")}
Create 3 SBS catalogs (96, 192, 1536), 3 DBS catalogs (78, 136, 144) and
Indel catalog from the VCFs specified by dir
, save the catalogs
as CSV files, plot them to PDF and generate a zip archive of all the output files.
VCFsToZipFile( dir, files, zipfile, ref.genome, variant.caller = "unknown", num.of.cores = 1, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, filter.status = DefaultFilterStatus(variant.caller), get.vaf.function = NULL, ..., max.vaf.diff = 0.02, base.filename = "", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE, chr.names.to.process = NULL )
VCFsToZipFile( dir, files, zipfile, ref.genome, variant.caller = "unknown", num.of.cores = 1, trans.ranges = NULL, region = "unknown", names.of.VCFs = NULL, tumor.col.names = NA, filter.status = DefaultFilterStatus(variant.caller), get.vaf.function = NULL, ..., max.vaf.diff = 0.02, base.filename = "", return.annotated.vcfs = FALSE, suppress.discarded.variants.warnings = TRUE, chr.names.to.process = NULL )
dir |
Pathname of the directory which contains VCFs that come from the
same variant caller. Each VCF must have a file extension
".vcf" (case insensitive) and share the same |
files |
Character vector of file paths to the VCF files. Only
one of argument |
zipfile |
Pathname of the zip file to be created. |
ref.genome |
A |
variant.caller |
Name of the variant caller that produces the VCF, can
be either |
num.of.cores |
The number of cores to use. Not available on Windows
unless |
trans.ranges |
Optional. If
then the function will infer |
region |
A character string designating a genomic region;
see |
names.of.VCFs |
Optional. Character vector of names of the VCF files.
The order of names in |
tumor.col.names |
Optional. Only applicable to Mutect VCFs.
Vector of column names or column indices in Mutect VCFs which
contain the tumor sample information. The order of elements in
|
filter.status |
The character string in column |
get.vaf.function |
Optional. Only applicable when |
... |
Optional arguments to |
max.vaf.diff |
Not applicable if |
base.filename |
Optional. The base name of the CSV and PDF files to be
produced; multiple files will be generated, each ending in
|
return.annotated.vcfs |
Logical. Whether to return the annotated VCFs with additional columns showing mutation class for each variant. Default is FALSE. |
suppress.discarded.variants.warnings |
Logical. Whether to suppress warning messages showing information about the discarded variants. Default is TRUE. |
chr.names.to.process |
A character vector specifying the chromosome names in VCF whose variants will be kept and processed, other chromosome variants will be discarded. If NULL(default), all variants will be kept except those on chromosomes with names that contain strings "GL", "KI", "random", "Hs", "M", "JH", "fix", "alt". |
This function calls VCFsToCatalogs
,
PlotCatalogToPdf
, WriteCatalog
and
zip::zipr
.
A list containing the following objects:
catSBS96
, catSBS192
, catSBS1536
: Matrix of
3 SBS catalogs (one each for 96, 192, and 1536).
catDBS78
, catDBS136
, catDBS144
: Matrix of
3 DBS catalogs (one each for 78, 136, and 144).
catID
: Matrix of ID (small insertions and deletions) catalog.
discarded.variants
: Non-NULL only if there are variants
that were excluded from the analysis. See the added extra column
discarded.reason
for more details.
annotated.vcfs
:
Non-NULL only if return.annotated.vcfs
= TRUE.
A list of elements:
SBS
: SBS VCF annotated by AnnotateSBSVCF
with
three new columns SBS96.class
, SBS192.class
and
SBS1536.class
showing the mutation class for each SBS variant.
DBS
: DBS VCF annotated by AnnotateDBSVCF
with
three new columns DBS78.class
, DBS136.class
and
DBS144.class
showing the mutation class for each DBS variant.
ID
: ID VCF annotated by AnnotateIDVCF
with one
new column ID.class
showing the mutation class for each
ID variant.
If trans.ranges
is not provided by user and cannot be inferred by
ICAMS, SBS 192 and DBS 144 catalog will not be generated. Each catalog has
attributes added. See as.catalog
for more details.
See https://github.com/steverozen/ICAMS/raw/master/data-raw/PCAWG7_indel_classification_2021_09_03.xlsx for additional information on ID (small insertions and deletions) mutation classification.
See the documentation for Canonicalize1Del
which first handles
deletions in homopolymers, then handles deletions in simple repeats with
longer repeat units, (e.g. CACACACA
, see
FindMaxRepeatDel
), and if the deletion is not in a simple
repeat, looks for microhomology (see FindDelMH
).
See the code for unexported function CanonicalizeID
and the functions it calls for handling of insertions.
SBS 192 and DBS 144 catalogs include only mutations in transcribed regions. In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
To add or change attributes of the catalog, you can use function
attr
.
For example, attr(catalog, "abundance")
<- custom.abundance
.
dir <- c(system.file("extdata/Mutect-vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- VCFsToZipFile(dir, zipfile = file.path(tempdir(), "test.zip"), ref.genome = "hg19", variant.caller = "mutect", region = "genome", base.filename = "Mutect") unlink(file.path(tempdir(), "test.zip"))}
dir <- c(system.file("extdata/Mutect-vcf", package = "ICAMS")) if (requireNamespace("BSgenome.Hsapiens.1000genomes.hs37d5", quietly = TRUE)) { catalogs <- VCFsToZipFile(dir, zipfile = file.path(tempdir(), "test.zip"), ref.genome = "hg19", variant.caller = "mutect", region = "genome", base.filename = "Mutect") unlink(file.path(tempdir(), "test.zip"))}
Write a catalog to a file.
WriteCatalog(catalog, file, strict = TRUE)
WriteCatalog(catalog, file, strict = TRUE)
catalog |
A catalog as defined in |
file |
The path to the file to be created. |
strict |
If TRUE, do additional checks on the input, and stop if the checks fail. |
See also ReadCatalog
.
In ID (small insertions and deletions) catalogs, deletion repeat sizes range from 0 to 5+, but for plotting and end-user documentation deletion repeat sizes range from 1 to 6+.
file <- system.file("extdata", "strelka.regress.cat.sbs.96.csv", package = "ICAMS") catSBS96 <- ReadCatalog(file) WriteCatalog(catSBS96, file = file.path(tempdir(), "catSBS96.csv"))
file <- system.file("extdata", "strelka.regress.cat.sbs.96.csv", package = "ICAMS") catSBS96 <- ReadCatalog(file) WriteCatalog(catSBS96, file = file.path(tempdir(), "catSBS96.csv"))