Bio.AlignIO.StockholmIO module

Bio.AlignIO support for “stockholm” format (used in the PFAM database).

You are expected to use this module via the Bio.AlignIO functions (or the Bio.SeqIO functions if you want to work directly with the gapped sequences).

For example, consider a Stockholm alignment file containing the following:

# STOCKHOLM 1.0
#=GC SS_cons       .................<<<<<<<<...<<<<<<<........>>>>>>>..
AP001509.1         UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU
#=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..--
AE007476.1         AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU
#=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----

#=GC SS_cons       ......<<<<<<<.......>>>>>>>..>>>>>>>>...............
AP001509.1         CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU
#=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>---------------
AE007476.1         UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU
#=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>---------------
//

This is a single multiple sequence alignment, so you would probably load this using the Bio.AlignIO.read() function:

>>> from Bio import AlignIO
>>> align = AlignIO.read("Stockholm/simple.sth", "stockholm")
>>> print(align)
SingleLetterAlphabet() alignment with 2 rows and 104 columns
UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1
AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1
>>> for record in align:
...     print("%s %i" % (record.id, len(record)))
AP001509.1 104
AE007476.1 104

This example file is clearly using RNA, so you might want the alignment object (and the SeqRecord objects it holds) to reflect this, rather than simple using the default single letter alphabet as shown above. You can do this with an optional argument to the Bio.AlignIO.read() function:

>>> from Bio import AlignIO
>>> from Bio.Alphabet import generic_rna
>>> align = AlignIO.read("Stockholm/simple.sth", "stockholm",
...                      alphabet=generic_rna)
>>> print(align)
RNAAlphabet() alignment with 2 rows and 104 columns
UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1
AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1

In addition to the sequences themselves, this example alignment also includes some GR lines for the secondary structure of the sequences. These are strings, with one character for each letter in the associated sequence:

>>> for record in align:
...     print(record.id)
...     print(record.seq)
...     print(record.letter_annotations['secondary_structure'])
AP001509.1
UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU
-----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>---------------
AE007476.1
AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU
-----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>---------------

Any general annotation for each row is recorded in the SeqRecord’s annotations dictionary. Any per-column annotation for the entire alignment in in the alignment’s column annotations dictionary, such as the secondary structure consensus in this example:

>>> sorted(align.column_annotations.keys())
['secondary_structure']
>>> align.column_annotations["secondary_structure"]
'.................<<<<<<<<...<<<<<<<........>>>>>>>........<<<<<<<.......>>>>>>>..>>>>>>>>...............'

You can output this alignment in many different file formats using Bio.AlignIO.write(), or the MultipleSeqAlignment object’s format method:

>>> print(align.format("fasta"))
>AP001509.1
UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-A
GGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU
>AE007476.1
AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAA
GGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU

Most output formats won’t be able to hold the annotation possible in a Stockholm file:

>>> print(align.format("stockholm"))
# STOCKHOLM 1.0
#=GF SQ 2
AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU
#=GS AP001509.1 AC AP001509.1
#=GS AP001509.1 DE AP001509.1
#=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>---------------
AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU
#=GS AE007476.1 AC AE007476.1
#=GS AE007476.1 DE AE007476.1
#=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>---------------
#=GC SS_cons .................<<<<<<<<...<<<<<<<........>>>>>>>........<<<<<<<.......>>>>>>>..>>>>>>>>...............
//

Note that when writing Stockholm files, AlignIO does not break long sequences up and interleave them (as in the input file shown above). The standard allows this simpler layout, and it is more likely to be understood by other tools.

Finally, as an aside, it can sometimes be useful to use Bio.SeqIO.parse() to iterate over the alignment rows as SeqRecord objects - rather than working with Alignnment objects. Again, if you want to you can specify this is RNA:

>>> from Bio import SeqIO
>>> from Bio.Alphabet import generic_rna
>>> for record in SeqIO.parse("Stockholm/simple.sth", "stockholm",
...                           alphabet=generic_rna):
...     print(record.id)
...     print(record.seq)
...     print(record.letter_annotations['secondary_structure'])
AP001509.1
UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU
-----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>---------------
AE007476.1
AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU
-----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>---------------

Remember that if you slice a SeqRecord, the per-letter-annotations like the secondary structure string here, are also sliced:

>>> sub_record = record[10:20]
>>> print(sub_record.seq)
AUCGUUUUAC
>>> print(sub_record.letter_annotations['secondary_structure'])
-------<<<

Likewise with the alignment object, as long as you are not dropping any rows, slicing specific columns of an alignment will slice any per-column-annotations:

>>> align.column_annotations["secondary_structure"]
'.................<<<<<<<<...<<<<<<<........>>>>>>>........<<<<<<<.......>>>>>>>..>>>>>>>>...............'
>>> part_align = align[:,10:20]
>>> part_align.column_annotations["secondary_structure"]
'.......<<<'

You can also see this in the Stockholm output of this partial-alignment:

>>> print(part_align.format("stockholm"))
# STOCKHOLM 1.0
#=GF SQ 2
AP001509.1 UCAACACUCU
#=GS AP001509.1 AC AP001509.1
#=GS AP001509.1 DE AP001509.1
#=GR AP001509.1 SS -------<<<
AE007476.1 AUCGUUUUAC
#=GS AE007476.1 AC AE007476.1
#=GS AE007476.1 DE AE007476.1
#=GR AE007476.1 SS -------<<<
#=GC SS_cons .......<<<
//
class Bio.AlignIO.StockholmIO.StockholmWriter(handle)

Bases: Bio.AlignIO.Interfaces.SequentialAlignmentWriter

Stockholm/PFAM alignment writer.

pfam_gr_mapping = {'active_site': 'AS', 'intron': 'IN', 'ligand_binding': 'LI', 'posterior_probability': 'PP', 'secondary_structure': 'SS', 'surface_accessibility': 'SA', 'transmembrane': 'TM'}
pfam_gc_mapping = {'model_mask': 'MM', 'reference_annotation': 'RF'}
pfam_gs_mapping = {'look': 'LO', 'organism': 'OS', 'organism_classification': 'OC'}
write_alignment(self, alignment)

Use this to write (another) single alignment to an open file.

Note that sequences and their annotation are recorded together (rather than having a block of annotation followed by a block of aligned sequences).

class Bio.AlignIO.StockholmIO.StockholmIterator(handle, seq_count=None, alphabet=SingleLetterAlphabet())

Bases: Bio.AlignIO.Interfaces.AlignmentIterator

Loads a Stockholm file from PFAM into MultipleSeqAlignment objects.

The file may contain multiple concatenated alignments, which are loaded and returned incrementally.

This parser will detect if the Stockholm file follows the PFAM conventions for sequence specific meta-data (lines starting #=GS and #=GR) and populates the SeqRecord fields accordingly.

Any annotation which does not follow the PFAM conventions is currently ignored.

If an accession is provided for an entry in the meta data, IT WILL NOT be used as the record.id (it will be recorded in the record’s annotations). This is because some files have (sub) sequences from different parts of the same accession (differentiated by different start-end positions).

Wrap-around alignments are not supported - each sequences must be on a single line. However, interlaced sequences should work.

For more information on the file format, please see: http://sonnhammer.sbc.su.se/Stockholm.html https://en.wikipedia.org/wiki/Stockholm_format http://bioperl.org/formats/alignment_formats/Stockholm_multiple_alignment_format.html

For consistency with BioPerl and EMBOSS we call this the “stockholm” format.

pfam_gr_mapping = {'AS': 'active_site', 'IN': 'intron', 'LI': 'ligand_binding', 'PP': 'posterior_probability', 'SA': 'surface_accessibility', 'SS': 'secondary_structure', 'TM': 'transmembrane'}
pfam_gc_mapping = {'MM': 'model_mask', 'RF': 'reference_annotation'}
pfam_gs_mapping = {'LO': 'look', 'OC': 'organism_classification', 'OS': 'organism'}
__next__(self)

Parse the next alignment from the handle.