The Multiple Alignment Format, described by UCSC, stores a series of multiple alignments in a single file. Suitable for whole-genome to whole-genome alignments, metadata such as source chromosome, start position, size, and strand can be stored.
Biopython 1.69 includes a MAF reader and writer accessible via
Bio.AlignIO
, and an indexer accessible via
Bio.AlignIO.MafIO
.
All examples below make use of the Multiz 30-way alignment to mouse chromosome 10 available from UCSC.
If you can’t wait for Biopython 1.69 to be released, get the latest Biopython from GitHub:
First, clone the repository with git from the command line, like so:
git clone git@github.com:biopython/biopython.git
This will give you the default master branch. Then install from source.
Parsing a MAF file is similar to any other alignment file in AlignIO
.
Additional data, however, is stored as a dict in the .annotations
property of SeqRecord
s belonging to returned MultipleSeqAlignment
objects.
Key | Type | Value |
---|---|---|
start | integer | The start position in the source sequence of this alignment |
size | integer | The ungapped length of this sequence |
strand | enum(“+”, “-“) | The strand this sequence originates from on the source sequence/chromosome |
srcSize | integer | The total length of the source sequence/chromosome |
from Bio import AlignIO
for multiple_alignment in AlignIO.parse("chr10.maf", "maf"):
print("printing a new multiple alignment")
for seqrec in multiple_alignment:
print(
"starts at %s on the %s strand of a sequence %s in length, and runs for %s bp"
% (
seqrec.annotations["start"],
seqrec.annotations["strand"],
seqrec.annotations["srcSize"],
seqrec.annotations["size"],
)
)
Biopython may soon provide an interface for fast access to the multiple
alignment of several sequences across an arbitrary interval: for
example, chr10:25,079,604-25,243,324 in mm9. As MAF files are available
for entire chromosomes, they can be indexed by chromosome position and
accessed at random. This functionality would be available in the class
Bio.AlignIO.MafIO.MafIndex
.
Indexes are created by determining the chromosome start and end position for a specific sequence name (generally a species), which must appear in every alignment block in the file. An index can be generated for only one species at a time. In whole-genome alignments generated by Multiz, the chromosome of one species is generally used as the reference to which other species are aligned. This reference species will appear in every block, and should be used as the target_seqname parameter. For UCSC multiz files, the form of species.chromosome is used.
To index a MAF file, or load an existing index, create a new
MafIO.MafIndex
object. If the index database file sqlite_file
does not exist, it will be created, otherwise it will be loaded.
# index mouse chr10 from UCSC and store it in a file for later use
from Bio.AlignIO import MafIO
# idx = MafIO.MafIndex(sqlite_file, maf_file, target_seqname)
idx = MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
The MafIO.MafIndex.search()
generator function accepts a list of
start and end positions, and yields MultipleSeqAlignment
objects that
overlap the given intervals. This is particularly useful for obtaining
alignments over the multiple exons of a single transcript, eliminating
the need to retrieve an entire locus.
# count the number of bases in danRer5 (Zebrafish) that align to the
# Pcmt1 locus in mouse
from Bio.AlignIO.MafIO import MafIndex
idx = MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
results = idx.search([7350034], [7383048])
total_bases = 0
for multiple_alignment in results:
for seqrec in multiple_alignment:
if seqrec.id.startswith("danRer5"):
# don't count gaps as bases
total_bases += len(str(seqrec.seq).replace("-", ""))
print("a total of %s bases align" % total_bases)
The MafIO.MafIndex.get_spliced()
function accepts a list of start
and end positions representing exons, and returns a single
MultipleSeqAlignment
object of the in silico spliced transcript from
the reference and all aligned sequences. If part of the sequence range
is not found in a particular species in the alignment, dashes (“-“) are
used to fill the gaps, or “N”s if the sequence is not present in the
reference (target_seqname) sequence. If strand is opposite that in
the reference sequence, all sequences in the returned alignment will be
reverse complemented.
# convert the alignment for mouse Foxo3 (NM_019740) from MAF to FASTA
from Bio import AlignIO
idx = AlignIO.MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
multiple_alignment = idx.get_spliced(
[41905591, 41916271, 41994621, 41996331],
[41906101, 41917707, 41995347, 41996548],
strand="+",
)
AlignIO.write(multiple_alignment, "mm9_foxo3.fa", "fasta")
# find every gene on chr10 in the current UCSC refGene database,
# retrieve its spliced multiple alignment, and write it to
# a FASTA file in the current directory
#
# depends: MySQLdb
import MySQLdb
from Bio import AlignIO
# connect to UCSC's live MySQL database
mysql_conn = MySQLdb.connect(
host="genome-mysql.cse.ucsc.edu", user="genome", passwd="", db="mm9"
)
db_conn = mysql_conn.cursor(MySQLdb.cursors.DictCursor)
# load MAF index
idx = AlignIO.MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
# fetch all records on chr10
db_conn.execute("SELECT * FROM refGene WHERE chrom = 'chr10'")
for record in db_conn.fetchall():
multiple_alignment = idx.get_spliced(
map(int, record["exonStarts"].split(",")[:-1]),
map(int, record["exonEnds"].split(",")[:-1]),
strand=record["strand"],
)
print("writing %s.fa" % record["name"])
AlignIO.write(multiple_alignment, "%s.fa" % record["name"], "fasta")
track name=euArc visibility=pack
##maf version=1 scoring=tba.v8
# tba.v8 (((human chimp) baboon) (mouse rat))
a score=23262.0
s hg18.chr7 27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s baboon 116834 38 + 4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
s mm4.chr6 53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
s rn3.chr4 81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
a score=5062.0
s hg18.chr7 27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
s baboon 241163 6 + 4622798 TAAAGA
s mm4.chr6 53303881 6 + 151104725 TAAAGA
s rn3.chr4 81444246 6 + 187371129 taagga
a score=6636.0
s hg18.chr7 27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon 249182 13 + 4622798 gcagctgaaaaca
s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATA