Title: | Read, Manipulate and Visualize 'Pairwise mApping Format' Data |
---|---|
Description: | Provides functions to read, process and visualize pairwise sequence alignments in the 'PAF' format used by 'minimap2' and other whole-genome aligners. 'minimap2' is described by Li H. (2018) <doi:10.1093/bioinformatics/bty191>. |
Authors: | David Winter [aut, cre] , Kate Lee [ctb] , Murray Cox [ctb] |
Maintainer: | David Winter <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.2 |
Built: | 2024-11-03 04:58:24 UTC |
Source: | https://github.com/dwinter/pafr |
The main reason to use this function is speed up the process of reading in a large paf file that has no tags. Functions like read.table, read_delim (reader) and fread (data.table) can process a 12 column file more quickly than pafr's read_paf. If you you do not need tag data for your analyses or visualizations, it might make sense to use a fast reading function to get a 12 column data.frame, convert that data.frame into a 'pafr object with this function. The 'pafr' object can then work easily with the functions in this package.
as_paf(paf_data_frame)
as_paf(paf_data_frame)
paf_data_frame |
a data.frame object with 12 columns. Column names and types will be over-written in the returned object |
a pafr object
read_paf
Extract the sizes of all sequences in a paf alignment
chrom_sizes(ali)
chrom_sizes(ali)
ali |
pafr or tibble containing the genome alignment (as returned by
|
list with two elements (tlens and qlens) Each element is a dataframe with one column of sequence names and another column containing the length of each sequence
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) chrom_sizes(ali)
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) chrom_sizes(ali)
Generate a dot plot from a paf alignment
dotplot( ali, order_by = c("size", "qstart", "provided"), label_seqs = FALSE, dashes = TRUE, ordering = list(), alignment_colour = "black", xlab = "query", ylab = "target", line_size = 2 )
dotplot( ali, order_by = c("size", "qstart", "provided"), label_seqs = FALSE, dashes = TRUE, ordering = list(), alignment_colour = "black", xlab = "query", ylab = "target", line_size = 2 )
ali |
pafr or tibble containing the genome alignment (as returned by
|
order_by |
How the query and target sequences should be ordered in the
dot plot. Option must be one of 'size' (smallest-to-largest), 'qstart' (query organised
smallest to largest, target by first match in the query genome) or 'provided'
(ordering as specified in the |
label_seqs |
boolean If TRUE, label centre of query and target sequences in margins of the dot plot |
dashes |
boolean If TRUE, add dashes to borders of query and target sequences in the dot plot |
ordering |
If |
alignment_colour |
character The colour used to draw each aligned section in the dot plot (defaults to black) |
xlab |
character The x-axis label (defaults to 'query') |
ylab |
character The y-axis label (defaults to 'target') |
line_size |
The width of the line used to represent an alignment in the dot plot (defaults to 2) |
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) dotplot(ali) dotplot(ali) + theme_bw() dotplot(ali, label_seqs=TRUE, order_by="qstart", alignment_colour="blue")
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) dotplot(ali) dotplot(ali) + theme_bw() dotplot(ali, label_seqs=TRUE, order_by="qstart", alignment_colour="blue")
Remove secondary alignments from a pafr alignment
filter_secondary_alignments(ali, remove_inversions = FALSE)
filter_secondary_alignments(ali, remove_inversions = FALSE)
ali |
Genomic alignment in |
remove_inversions |
logical If TRUE, also remove inversions (tp flag 'I' or 'i') from the alignment |
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) ali filter_secondary_alignments(ali)
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) ali filter_secondary_alignments(ali)
For use with ggplot2
Gb_lab(x) Mb_lab(x) Kb_lab(x)
Gb_lab(x) Mb_lab(x) Kb_lab(x)
x |
The data (in base pairs) to be formatted as Gb, Mb or Kb |
A character vector with scale labels
## Not run: ali <- read_paf(system.file("extdata", "fungi.paf", package="pafr")) doplot(ali) + scale_x_continuous("Genomic position", label=Mb_lab) ## End(Not run)
## Not run: ali <- read_paf(system.file("extdata", "fungi.paf", package="pafr")) doplot(ali) + scale_x_continuous("Genomic position", label=Mb_lab) ## End(Not run)
This plot is intended to be used in conjunction with link{dotplot}
.
Adding higlight_query
or highlight_target
to a dotplot function call
(see examples below) will add a rectangular 'highlight' corresponding to a
particular genomic interval in the corresponding genome.
highlight_query(bed, fill = "yellow", colour = "black", alpha = 0.6) highlight_target(bed, fill = "yellow", colour = "black", alpha = 0.6)
highlight_query(bed, fill = "yellow", colour = "black", alpha = 0.6) highlight_target(bed, fill = "yellow", colour = "black", alpha = 0.6)
bed |
|
fill |
character Fill colour for highlight segment |
colour |
character Outline colour for highlight segment |
alpha |
character Opacity ([0-1]) for highlight segment |
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) cen <- read_bed(system.file("extdata", "Q_centro.bed", package="pafr")) dotplot(ali) + highlight_query(cen) interval <- data.frame(chrom="T_chr3", start=2000000, end=3000000) dotplot(ali, label_seqs=TRUE) + highlight_target(interval)
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) cen <- read_bed(system.file("extdata", "Q_centro.bed", package="pafr")) dotplot(ali) + highlight_query(cen) interval <- data.frame(chrom="T_chr3", start=2000000, end=3000000) dotplot(ali, label_seqs=TRUE) + highlight_target(interval)
Each sequence in the focal genome is displayed as a rectangle, with regions
covered by an alignment shaded as per the fill
argument described
below. Uncovered regions remain white.
plot_coverage( ali, target = TRUE, fill = "forestgreen", direct_label = TRUE, label_colour = "black", xlab = "Position in sequence", x_labeller = Mb_lab )
plot_coverage( ali, target = TRUE, fill = "forestgreen", direct_label = TRUE, label_colour = "black", xlab = "Position in sequence", x_labeller = Mb_lab )
ali |
alignment An alignment as read by |
target |
logical If TRUE, display coverage for the target genome; if FALSE, display coverage for the query |
fill |
character How to colour the alignment blocks. If the character
provided is the name of a column in the alignment, this column will be passed
to |
direct_label |
logical If TRUE, use geom_text to directly label the name of the focal sequences; if FALSE, no direct labels are drawn |
label_colour |
character Colour used for direct labels |
xlab |
string Name for the x-axis |
x_labeller |
function Function to be used to label the x-axis (defaults to
|
Note that this function uses theme_coverage_plot
to style the
graph. Using another ggplot theme on the plot may produce unexpected results.
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) plot_coverage(ali) plot_coverage(ali, fill='qname', direct_label=FALSE) + scale_fill_brewer(palette="Set1")
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) plot_coverage(ali) plot_coverage(ali, fill='qname', direct_label=FALSE) + scale_fill_brewer(palette="Set1")
Plot synteny between a query and target sequence in a PAF alignment
plot_synteny( ali, q_chrom, t_chrom, centre = TRUE, rc = FALSE, xlab = "Position in query", ylab = "", x_labeller = Mb_lab )
plot_synteny( ali, q_chrom, t_chrom, centre = TRUE, rc = FALSE, xlab = "Position in query", ylab = "", x_labeller = Mb_lab )
ali |
pafr or tibble containing the genome alignment (as returned by
|
q_chrom |
character Name for the query sequence |
t_chrom |
character Name for the target sequence |
centre |
logical If TRUE (default), adjust the position of the target sequence, so it is centred on the query. If not, both sequences start at position zero |
rc |
logical If TRUE, use the reverse and complement for the target sequence |
xlab |
string Name for the x-axis |
ylab |
string Name for the y-axis |
x_labeller |
Function to be used to label the x-axis |
A ggplot object that displays synteny between query and target sequences
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) long_ali <- subset(ali, alen > 1e4) plot_synteny(long_ali, q_chrom="Q_chr3", t_chrom="T_chr4", centre=TRUE) plot_synteny(long_ali, q_chrom="Q_chr5", t_chrom="T_chr5", centre=TRUE) plot_synteny(long_ali, q_chrom="Q_chr5", t_chrom="T_chr5", centre=TRUE, rc=TRUE)
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) long_ali <- subset(ali, alen > 1e4) plot_synteny(long_ali, q_chrom="Q_chr3", t_chrom="T_chr4", centre=TRUE) plot_synteny(long_ali, q_chrom="Q_chr5", t_chrom="T_chr5", centre=TRUE) plot_synteny(long_ali, q_chrom="Q_chr5", t_chrom="T_chr5", centre=TRUE, rc=TRUE)
The first three columns of the file specified by file_name
must
contain data in the standard bed format (i.e., a genomic interval
represented by 0-based half-open interval with seq-id, start and end position).
These columns will be renamed to 'chrom', 'start' and 'end', respectively. Any
other columns present in the data will be left unmodified.
read_bed(file_name, tibble = FALSE, ...)
read_bed(file_name, tibble = FALSE, ...)
file_name |
Path to the bed file to be read in |
tibble |
logical If TRUE, the genomic intervals are returned as
a tidy |
... |
Other arguments passed to |
The file is read into memory with read.table
, with the
argument sep
set to '\t'
and stringsAsFactors
set to
FALSE. All other arguments are left as default, but arguments can be passed
from read_bed
to read.table
.
Either a data.frame
or a tbl_df
with at least three
columns named 'chrom', 'start' and 'end'
bed_path <- system.file("extdata", "Q_centro.bed", package="pafr") centro <- read_bed(bed_path) centro # Can pass arguments to read.table miss_two <- read_bed(bed_path, skip=2) miss_two
bed_path <- system.file("extdata", "Q_centro.bed", package="pafr") centro <- read_bed(bed_path) centro # Can pass arguments to read.table miss_two <- read_bed(bed_path, skip=2) miss_two
See the package vignette for detailed information on the file format and its representation as an R object.
read_paf(file_name, tibble = FALSE, include_tags = TRUE)
read_paf(file_name, tibble = FALSE, include_tags = TRUE)
file_name |
Path to the .paf file |
tibble |
logical If TRUE, the genomic alignments are returned as
a tidy |
include_tags |
logical if TRUE (default) read additional information about each alignment encoded as PAF tags. Setting this to FALSE will speed up parsing of paf alignments, specially those with large CIGAR strings/ |
Either a pafr
object, which acts as a data.frame
, or a
tbl_df
containing information on genomic alignments. The contents of
this table are described in detail in the pafr package vingette.
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) ali
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) ali
This theme is used as the default when plot_coverage
is called,
so you should usually only call this function to modify the appearance of the
coverage plot.
theme_coverage_plot(facet_labs = TRUE, show_legend = TRUE)
theme_coverage_plot(facet_labs = TRUE, show_legend = TRUE)
facet_labs |
logical If TRUE (default), label sequences using the facet
labels; if FALSE, sequences are labeled directly using
|
show_legend |
logical If TRUE (default), label display any legend
associated with the fill parameter of |
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) plot_coverage(ali) + theme_coverage_plot(show_legend=FALSE)
ali <- read_paf( system.file("extdata", "fungi.paf", package="pafr") ) plot_coverage(ali) + theme_coverage_plot(show_legend=FALSE)