Multiple Alignment Format

From Biopython
(Difference between revisions)
Jump to: navigation, search
m
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[Category:Formats|Formats]]
 
[[Category:Formats|Formats]]
The Multiple Alignment Format, described by [http://genome.ucsc.edu/FAQ/FAQformat.html#format5 UCSC], stores a series of multiple alignments in a single file.  Suitable for whole-genome to whole-genome alignments, metadata such as source chromosome, start position, size, and strand can be stored.  Biopython implements a MAF reader and writer accessible via '''Bio.AlignIO''', and an indexer accessible via '''Bio.AlignIO.MafIO'''.
+
The Multiple Alignment Format, described by [http://genome.ucsc.edu/FAQ/FAQformat.html#format5 UCSC], stores a series of multiple alignments in a single file.  Suitable for whole-genome to whole-genome alignments, metadata such as source chromosome, start position, size, and strand can be stored.
  
==MafIndex==
+
A branch of Biopython on GitHub (not yet in the main distribution for general use) implements a MAF reader and writer accessible via '''Bio.AlignIO''', and an indexer accessible via '''Bio.AlignIO.MafIO'''.
Biopython provides an interface for fast access to the multiple alignment of several sequences across an arbitrary interval: for example, chr10:25,079,604-25,243,324 in mm9. As MAF files are available for entire chromosomes, they can be indexed by chromosome position and accessed at random. This functionality is available in the class MafIO.MafIndex.
+
  
 
All examples below make use of the Multiz [http://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Mus_musculus/multiz30way/maf/chr10.maf.gz 30-way alignment to mouse chromosome 10] available from UCSC.
 
All examples below make use of the Multiz [http://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Mus_musculus/multiz30way/maf/chr10.maf.gz 30-way alignment to mouse chromosome 10] available from UCSC.
 +
 +
==Getting the AlignIO MAF branch from Github==
 +
First, clone the repository with git from the command line, like so:
 +
git clone -b alignio-maf git://github.com/polyatail/biopython.git alignio-maf
 +
 +
If you're using an older version of git, you may need to do this:
 +
 +
git clone git://github.com/polyatail/biopython.git alignio-maf
 +
cd alignio-maf
 +
git checkout -b alignio-maf origin/alignio-maf
 +
 +
To access the MAF parser, you'll need to manually specify the path to it in your code, as in:
 +
<python>
 +
import sys
 +
 +
# replace "./alignio-maf" with the full path of the alignio-maf branch
 +
# you cloned from github, or keep if it's in the current directory
 +
sys.path.insert(1, "./alignio-maf")
 +
 +
try:
 +
    from Bio.AlignIO import MafIO
 +
except ImportError:
 +
    print "oops, the import didn't work"
 +
</python>
 +
 +
For help, contact the branch maintainer Andrew Sczesnak (firstname dot lastname at med dot nyu dot edu) or the [mailto:biopython-dev@lists.open-bio.org BioPython Developers].
 +
 +
==Reading in a MAF file==
 +
Parsing a MAF file is similar to any other alignment file in AlignIO.  Additional data, however, is stored as a dict in the .annotations property of SeqRecords belonging to returned MultipleSeqAlignment objects.
 +
 +
===Annotations available in SeqRecords===
 +
{| border="1" cellpadding="4" cellspacing="0"
 +
! Key
 +
! Type
 +
! Value
 +
|-
 +
| '''start'''
 +
| integer
 +
| The start position in the source sequence of this alignment
 +
|-
 +
| '''size'''
 +
| integer
 +
| The ungapped length of this sequence
 +
|-
 +
| '''strand'''
 +
| enum("+", "-")
 +
| The strand this sequence originates from on the source sequence/chromosome
 +
|-
 +
| '''srcSize'''
 +
| integer
 +
| The total length of the source sequence/chromosome
 +
|}
 +
 +
===Example===
 +
<python>
 +
from Bio import AlignIO
 +
 +
for multiple_alignment in AlignIO.parse("chr10.maf", "maf"):
 +
    print "printing a new multiple alignment"
 +
 +
    for seqrec in multiple_alignment:
 +
        print "starts at %s on the %s strand of a sequence %s in length, and runs for %s bp" % \
 +
              (seqrec.annotations["start"],
 +
              seqrec.annotations["strand"],
 +
              seqrec.annotations["srcSize"],
 +
              seqrec.annotations["size"])
 +
</python>
 +
 +
==MafIndex==
 +
Biopython may soon provide an interface for fast access to the multiple alignment of several sequences across an arbitrary interval: for example, chr10:25,079,604-25,243,324 in mm9.  As MAF files are available for entire chromosomes, they can be indexed by chromosome position and accessed at random.  This functionality would be available in the class Bio.AlignIO.MafIO.MafIndex.
  
 
===Creating or loading a MAF index===
 
===Creating or loading a MAF index===
Line 16: Line 85:
 
# index mouse chr10 from UCSC and store it in a file for later use
 
# index mouse chr10 from UCSC and store it in a file for later use
  
from Bio.AlignIO import MafIO.MafIndex
+
from Bio.AlignIO import MafIO
  
 +
# idx = MafIO.MafIndex(sqlite_file, maf_file, target_seqname)
 
idx = MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
 
idx = MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
 
</python>
 
</python>
Line 29: Line 99:
 
# Pcmt1 locus in mouse
 
# Pcmt1 locus in mouse
  
from Bio.AlignIO import MafIO.MafIndex as MafIndex
+
from Bio.AlignIO.MafIO import MafIndex
  
 
idx = MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
 
idx = MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
Line 85: Line 155:
  
 
# fetch all records on chr10
 
# fetch all records on chr10
db_conn.execute ("SELECT * FROM refGene WHERE chrom = 'chr10'")
+
db_conn.execute("SELECT * FROM refGene WHERE chrom = 'chr10'")
  
 
for record in db_conn.fetchall():
 
for record in db_conn.fetchall():
Line 98: Line 168:
 
                   "fasta")
 
                   "fasta")
 
</python>
 
</python>
 +
 +
==Format==
 +
<pre>
 +
track name=euArc visibility=pack
 +
##maf version=1 scoring=tba.v8
 +
# tba.v8 (((human chimp) baboon) (mouse rat))
 +
                 
 +
a score=23262.0   
 +
s hg18.chr7    27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
 +
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
 +
s baboon        116834 38 +  4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
 +
s mm4.chr6    53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
 +
s rn3.chr4    81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
 +
                 
 +
a score=5062.0                   
 +
s hg18.chr7    27699739 6 + 158545518 TAAAGA
 +
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
 +
s baboon        241163 6 +  4622798 TAAAGA
 +
s mm4.chr6    53303881 6 + 151104725 TAAAGA
 +
s rn3.chr4    81444246 6 + 187371129 taagga
 +
 +
a score=6636.0
 +
s hg18.chr7    27707221 13 + 158545518 gcagctgaaaaca
 +
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
 +
s baboon        249182 13 +  4622798 gcagctgaaaaca
 +
s mm4.chr6    53310102 13 + 151104725 ACAGCTGAAAATA
 +
</pre>

Latest revision as of 21:37, 20 April 2012

The Multiple Alignment Format, described by UCSC, stores a series of multiple alignments in a single file. Suitable for whole-genome to whole-genome alignments, metadata such as source chromosome, start position, size, and strand can be stored.

A branch of Biopython on GitHub (not yet in the main distribution for general use) implements a MAF reader and writer accessible via Bio.AlignIO, and an indexer accessible via Bio.AlignIO.MafIO.

All examples below make use of the Multiz 30-way alignment to mouse chromosome 10 available from UCSC.

Contents

Getting the AlignIO MAF branch from Github

First, clone the repository with git from the command line, like so:

git clone -b alignio-maf git://github.com/polyatail/biopython.git alignio-maf

If you're using an older version of git, you may need to do this:

git clone git://github.com/polyatail/biopython.git alignio-maf
cd alignio-maf
git checkout -b alignio-maf origin/alignio-maf

To access the MAF parser, you'll need to manually specify the path to it in your code, as in:

import sys
 
# replace "./alignio-maf" with the full path of the alignio-maf branch
# you cloned from github, or keep if it's in the current directory
sys.path.insert(1, "./alignio-maf")
 
try:
    from Bio.AlignIO import MafIO
except ImportError:
    print "oops, the import didn't work"

For help, contact the branch maintainer Andrew Sczesnak (firstname dot lastname at med dot nyu dot edu) or the BioPython Developers.

Reading in a MAF file

Parsing a MAF file is similar to any other alignment file in AlignIO. Additional data, however, is stored as a dict in the .annotations property of SeqRecords belonging to returned MultipleSeqAlignment objects.

Annotations available in SeqRecords

Key Type Value
start integer The start position in the source sequence of this alignment
size integer The ungapped length of this sequence
strand enum("+", "-") The strand this sequence originates from on the source sequence/chromosome
srcSize integer The total length of the source sequence/chromosome

Example

from Bio import AlignIO
 
for multiple_alignment in AlignIO.parse("chr10.maf", "maf"):
    print "printing a new multiple alignment"
 
    for seqrec in multiple_alignment:
        print "starts at %s on the %s strand of a sequence %s in length, and runs for %s bp" % \
              (seqrec.annotations["start"],
               seqrec.annotations["strand"],
               seqrec.annotations["srcSize"],
               seqrec.annotations["size"])

MafIndex

Biopython may soon provide an interface for fast access to the multiple alignment of several sequences across an arbitrary interval: for example, chr10:25,079,604-25,243,324 in mm9. As MAF files are available for entire chromosomes, they can be indexed by chromosome position and accessed at random. This functionality would be available in the class Bio.AlignIO.MafIO.MafIndex.

Creating or loading a MAF index

Indexes are created by determining the chromosome start and end position for a specific sequence name (generally a species), which must appear in every alignment block in the file. An index can be generated for only one species at a time. In whole-genome alignments generated by Multiz, the chromosome of one species is generally used as the reference to which other species are aligned. This reference species will appear in every block, and should be used as the target_seqname parameter. For UCSC multiz files, the form of species.chromosome is used.

To index a MAF file, or load an existing index, create a new MafIO.MafIndex object. If the index database file sqlite_file does not exist, it will be created, otherwise it will be loaded.

# index mouse chr10 from UCSC and store it in a file for later use
 
from Bio.AlignIO import MafIO
 
# idx = MafIO.MafIndex(sqlite_file, maf_file, target_seqname)
idx = MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")

Retrieving alignments overlapping a given interval

The MafIO.MafIndex.search() generator function accepts a list of start and end positions, and yields MultipleSeqAlignment objects that overlap the given intervals. This is particularly useful for obtaining alignments over the multiple exons of a single transcript, eliminating the need to retrieve an entire locus.

# count the number of bases in danRer5 (Zebrafish) that align to the
# Pcmt1 locus in mouse
 
from Bio.AlignIO.MafIO import MafIndex
 
idx = MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
results = idx.search([7350034], [7383048])
 
total_bases = 0
 
for multiple_alignment in results:
    for seqrec in multiple_alignment:
        if seqrec.id.startswith("danRer5"):
            # don't count gaps as bases
            total_bases += len(str(seqrec.seq).replace("-", ""))
 
print "a total of %s bases align" % total_bases

Retrieving a pre-spliced alignment over a given set of exons

The MafIO.MafIndex.get_spliced() function accepts a list of start and end positions representing exons, and returns a single MultipleSeqAlignment object of the in silico spliced transcript from the reference and all aligned sequences. If part of the sequence range is not found in a particular species in the alignment, dashes ("-") are used to fill the gaps, or "N"s if the sequence is not present in the reference (target_seqname) sequence. If strand is opposite that in the reference sequence, all sequences in the returned alignment will be reverse complemented.

# convert the alignment for mouse Foxo3 (NM_019740) from MAF to FASTA
 
from Bio import AlignIO
 
idx = AlignIO.MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
 
multiple_alignment = idx.get_spliced([41905591, 41916271, 41994621, 41996331],
                                     [41906101, 41917707, 41995347, 41996548],
                                     strand = "+")
 
AlignIO.write(multiple_alignment, "mm9_foxo3.fa", "fasta")
# find every gene on chr10 in the current UCSC refGene database,
# retrieve its spliced multiple alignment, and write it to
# a FASTA file in the current directory
#
# depends: MySQLdb
 
import MySQLdb
from Bio import AlignIO
 
# connect to UCSC's live MySQL database
mysql_conn = MySQLdb.connect(host = "genome-mysql.cse.ucsc.edu",
                             user = "genome",
                             passwd = "", 
                             db = "mm9")
 
db_conn = mysql_conn.cursor(MySQLdb.cursors.DictCursor)
 
# load MAF index
idx = AlignIO.MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
 
# fetch all records on chr10
db_conn.execute("SELECT * FROM refGene WHERE chrom = 'chr10'")
 
for record in db_conn.fetchall():
    multiple_alignment = idx.get_spliced(map(int, record["exonStarts"].split(",")[:-1]),
                                         map(int, record["exonEnds"].split(",")[:-1]),
                                         strand = record["strand"])
 
    print "writing %s.fa" % record["name"]
 
    AlignIO.write(multiple_alignment,
                  "%s.fa" % record["name"],
                  "fasta")

Format

track name=euArc visibility=pack
##maf version=1 scoring=tba.v8 
# tba.v8 (((human chimp) baboon) (mouse rat)) 
                   
a score=23262.0     
s hg18.chr7    27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s baboon         116834 38 +   4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
s mm4.chr6     53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
s rn3.chr4     81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
                   
a score=5062.0                    
s hg18.chr7    27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
s baboon         241163 6 +   4622798 TAAAGA 
s mm4.chr6     53303881 6 + 151104725 TAAAGA
s rn3.chr4     81444246 6 + 187371129 taagga

a score=6636.0
s hg18.chr7    27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon         249182 13 +   4622798 gcagctgaaaaca
s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox