Multiple Alignment Format
(Draft) |
|||
| Line 1: | Line 1: | ||
| − | [[Category:Formats|Formats]] | + | [http://www.example.com link title][[Category:Formats|Formats]] |
==MafIndex== | ==MafIndex== | ||
Biopython provides an interface for fast access to the multiple alignment of several sequences across an arbitrary interval: for example, chr10:25,079,604-25,243,324 in mm9. As MAF files are available for entire chromosomes, they can be indexed by chromosome position and accessed at random. This functionality is available in the class MafIO.MafIndex. | Biopython provides an interface for fast access to the multiple alignment of several sequences across an arbitrary interval: for example, chr10:25,079,604-25,243,324 in mm9. As MAF files are available for entire chromosomes, they can be indexed by chromosome position and accessed at random. This functionality is available in the class MafIO.MafIndex. | ||
| + | |||
| + | All examples below make use of the Multiz [http://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Mus_musculus/multiz30way/maf/chr10.maf.gz 30-way alignment to mouse chromosome 10] available from UCSC. | ||
===Creating or loading a MAF index=== | ===Creating or loading a MAF index=== | ||
| Line 8: | Line 10: | ||
Indexes are created by determining the chromosome start and end position for a specific sequence name (generally a species), which must appear in every alignment block in the file. In whole-genome alignments generated by Multiz, the chromosome of one species is generally used as the reference to which other species are aligned. This reference species will appear in every block, and should be used as the ''target_seqname'' parameter. For UCSC multiz files, the form of '''species.chromosome''' is used. | Indexes are created by determining the chromosome start and end position for a specific sequence name (generally a species), which must appear in every alignment block in the file. In whole-genome alignments generated by Multiz, the chromosome of one species is generally used as the reference to which other species are aligned. This reference species will appear in every block, and should be used as the ''target_seqname'' parameter. For UCSC multiz files, the form of '''species.chromosome''' is used. | ||
| − | To index a MAF file, or load an existing index, create a new MafIndex object. If the index database file ''sqlite_file'' does not exist it will be created, otherwise it will be loaded. | + | To index a MAF file, or load an existing index, create a new MafIndex object. If the index database file ''sqlite_file'' does not exist, it will be created, otherwise it will be loaded. |
<pre> | <pre> | ||
| + | # index mouse chr10 from UCSC and store it in a file for later use | ||
| + | |||
from Bio.AlignIO import MafIO.MafIndex | from Bio.AlignIO import MafIO.MafIndex | ||
| − | idx = MafIO.MafIndex( | + | idx = MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10") |
</pre> | </pre> | ||
| Line 21: | Line 25: | ||
<pre> | <pre> | ||
| − | results = idx.search([ | + | # count the number of bases in danRer5 (Zebrafish) that align to the |
| + | # Pcmt1 locus in mouse | ||
| + | |||
| + | from Bio.AlignIO import MafIO.MafIndex as MafIndex | ||
| + | |||
| + | idx = MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10") | ||
| + | results = idx.search([7350034], [7383048]) | ||
| + | |||
| + | total_bases = 0 | ||
for multiple_alignment in results: | for multiple_alignment in results: | ||
| − | + | for seqrec in multiple_alignment: | |
| + | if seqrec.id.startswith("danRer5"): | ||
| + | # don't count gaps as bases | ||
| + | total_bases += len(str(seqrec.seq).replace("-", "")) | ||
| + | |||
| + | print "a total of %s bases align" % total_bases | ||
</pre> | </pre> | ||
| Line 34: | Line 51: | ||
# convert the alignment for mouse Foxo3 (NM_019740) from MAF to FASTA | # convert the alignment for mouse Foxo3 (NM_019740) from MAF to FASTA | ||
| − | from Bio | + | from Bio import AlignIO |
| − | + | ||
| − | + | ||
| − | idx = MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10") | + | idx = AlignIO.MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10") |
multiple_alignment = idx.get_spliced([41905591, 41916271, 41994621, 41996331], | multiple_alignment = idx.get_spliced([41905591, 41916271, 41994621, 41996331], | ||
| Line 44: | Line 59: | ||
strand = "+") | strand = "+") | ||
| − | + | AlignIO.write(multiple_alignment, "mm9_foxo3.fa", "fasta") | |
| − | + | ||
</pre> | </pre> | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
Revision as of 17:29, 23 May 2011
Contents |
MafIndex
Biopython provides an interface for fast access to the multiple alignment of several sequences across an arbitrary interval: for example, chr10:25,079,604-25,243,324 in mm9. As MAF files are available for entire chromosomes, they can be indexed by chromosome position and accessed at random. This functionality is available in the class MafIO.MafIndex.
All examples below make use of the Multiz 30-way alignment to mouse chromosome 10 available from UCSC.
Creating or loading a MAF index
Indexes are created by determining the chromosome start and end position for a specific sequence name (generally a species), which must appear in every alignment block in the file. In whole-genome alignments generated by Multiz, the chromosome of one species is generally used as the reference to which other species are aligned. This reference species will appear in every block, and should be used as the target_seqname parameter. For UCSC multiz files, the form of species.chromosome is used.
To index a MAF file, or load an existing index, create a new MafIndex object. If the index database file sqlite_file does not exist, it will be created, otherwise it will be loaded.
# index mouse chr10 from UCSC and store it in a file for later use
from Bio.AlignIO import MafIO.MafIndex
idx = MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
Retrieving alignments overlapping a given interval
The search(starts, ends) generator function accepts a list of start and end positions, and yields MultipleSeqAlignment objects that overlap the given intervals. This is particularly useful for obtaining alignments over the multiple exons of a single transcript, eliminating the need to retrieve an entire locus.
# count the number of bases in danRer5 (Zebrafish) that align to the
# Pcmt1 locus in mouse
from Bio.AlignIO import MafIO.MafIndex as MafIndex
idx = MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
results = idx.search([7350034], [7383048])
total_bases = 0
for multiple_alignment in results:
for seqrec in multiple_alignment:
if seqrec.id.startswith("danRer5"):
# don't count gaps as bases
total_bases += len(str(seqrec.seq).replace("-", ""))
print "a total of %s bases align" % total_bases
Retrieving a pre-spliced alignment over a given set of exons
The get_spliced(starts, ends, strand = "+") function accepts a list of start and end positions representing exons, and returns a single MultipleSeqAlignment object of the in silico spliced transcript from the reference and all aligned sequences. If part of the sequence range is not found in a particular species in the alignment, dashes ("-") are used to fill the gaps, or "N"s if the sequence is not present in the reference (target_seqname) sequence. If strand is opposite that in the reference sequence, all sequences in the returned alignment will be reverse complemented.
# convert the alignment for mouse Foxo3 (NM_019740) from MAF to FASTA
from Bio import AlignIO
idx = AlignIO.MafIO.MafIndex("chr10.mafindex", "chr10.maf", "mm9.chr10")
multiple_alignment = idx.get_spliced([41905591, 41916271, 41994621, 41996331],
[41906101, 41917707, 41995347, 41996548],
strand = "+")
AlignIO.write(multiple_alignment, "mm9_foxo3.fa", "fasta")