Bio.Align.stockholm module

Bio.Align support for alignment files in the Stockholm file format.

You are expected to use this module via the Bio.Align functions.

For example, consider this alignment from PFAM for the HAT helix motif:

# STOCKHOLM 1.0
#=GF ID   HAT
#=GF AC   PF02184.18
#=GF DE   HAT (Half-A-TPR) repeat
#=GF AU   SMART;
#=GF SE   Alignment kindly provided by SMART
#=GF GA   21.00 21.00;
#=GF TC   21.00 21.00;
#=GF NC   20.90 20.90;
#=GF BM   hmmbuild HMM.ann SEED.ann
#=GF SM   hmmsearch -Z 57096847 -E 1000 --cpu 4 HMM pfamseq
#=GF TP   Repeat
#=GF CL   CL0020
#=GF RN   [1]
#=GF RM   9478129
#=GF RT   The HAT helix, a repetitive motif implicated in RNA processing.
#=GF RA   Preker PJ, Keller W;
#=GF RL   Trends Biochem Sci 1998;23:15-16.
#=GF DR   INTERPRO; IPR003107;
#=GF DR   SMART; HAT;
#=GF DR   SO; 0001068; polypeptide_repeat;
#=GF CC   The HAT (Half A TPR) repeat is found in several RNA processing
#=GF CC   proteins [1].
#=GF SQ   3
#=GS CRN_DROME/191-222     AC P17886.2
#=GS CLF1_SCHPO/185-216    AC P87312.1
#=GS CLF1_SCHPO/185-216    DR PDB; 3JB9 R; 185-216;
#=GS O16376_CAEEL/201-233  AC O16376.2
CRN_DROME/191-222                KEIDRAREIYERFVYVH.PDVKNWIKFARFEES
CLF1_SCHPO/185-216               HENERARGIYERFVVVH.PEVTNWLRWARFEEE
#=GR CLF1_SCHPO/185-216    SS    --HHHHHHHHHHHHHHS.--HHHHHHHHHHHHH
O16376_CAEEL/201-233             KEIDRARSVYQRFLHVHGINVQNWIKYAKFEER
#=GC SS_cons                     --HHHHHHHHHHHHHHS.--HHHHHHHHHHHHH
#=GC seq_cons                    KEIDRARuIYERFVaVH.P-VpNWIKaARFEEc
//

Parsing this file using Bio.Align stores the alignment, its annotations, as well as the sequences and their annotations:

>>> from Bio.Align import stockholm
>>> alignments = stockholm.AlignmentIterator("Stockholm/example.sth")
>>> alignment = next(alignments)
>>> alignment.shape
(3, 33)
>>> alignment[0]
'KEIDRAREIYERFVYVH-PDVKNWIKFARFEES'

Alignment meta-data are stored in alignment.annotations:

>>> alignment.annotations["accession"]
'PF02184.18'
>>> alignment.annotations["references"][0]["title"]
'The HAT helix, a repetitive motif implicated in RNA processing.'

Annotations of alignment columns are stored in alignment.column_annotations:

>>> alignment.column_annotations["consensus secondary structure"]
'--HHHHHHHHHHHHHHS.--HHHHHHHHHHHHH'

Sequences and their annotations are stored in alignment.sequences:

>>> alignment.sequences[0].id
'CRN_DROME/191-222'
>>> alignment.sequences[0].seq
Seq('KEIDRAREIYERFVYVHPDVKNWIKFARFEES')
>>> alignment.sequences[1].letter_annotations["secondary structure"]
'--HHHHHHHHHHHHHHS--HHHHHHHHHHHHH'

Slicing specific columns of an alignment will slice any per-column-annotations:

>>> alignment.column_annotations["consensus secondary structure"]
'--HHHHHHHHHHHHHHS.--HHHHHHHHHHHHH'
>>> part_alignment = alignment[:,10:20]
>>> part_alignment.column_annotations["consensus secondary structure"]
'HHHHHHS.--'

class Bio.Align.stockholm.AlignmentIterator(source)

Bases: Bio.Align.interfaces.AlignmentIterator

Alignment iterator for alignment files in the Stockholm format.

The file may contain multiple concatenated alignments, which are loaded and returned incrementally.

Alignment meta-data (lines starting with #=GF) are stored in the dictionary alignment.annotations. Column annotations (lines starting with #=GC) are stored in the dictionary alignment.column_annotations. Sequence names are stored in record.id. Sequence record meta-data (lines starting with #=GS) are stored in the dictionary record.annotations. Sequence letter annotations (lines starting with #=GR) are stored in the dictionary record.letter_annotations.

Wrap-around alignments are not supported - each sequence must be on a single line.

For more information on the file format, please see: http://sonnhammer.sbc.su.se/Stockholm.html https://en.wikipedia.org/wiki/Stockholm_format

gf_mapping = {'**': '**', 'AC': 'accession', 'AU': 'author', 'BM': 'build method', 'CB': 'calibration method', 'CC': 'comment', 'CL': 'clan', 'DE': 'definition', 'GA': 'gathering method', 'ID': 'identifier', 'NC': 'noise cutoff', 'PI': 'previous identifier', 'SE': 'source of seed', 'SM': 'search method', 'SS': 'source of structure', 'TC': 'trusted cutoff', 'TP': 'type', 'WK': 'wikipedia'}

gr_mapping = {'AS': 'active site', 'CSA': 'Catalytic Site Atlas', 'IN': 'intron', 'LI': 'ligand binding', 'PP': 'posterior probability', 'SA': 'surface accessibility', 'SS': 'secondary structure', 'TM': 'transmembrane', 'pAS': 'active site - Pfam predicted', 'sAS': 'active site - from SwissProt'}

gc_mapping = {'2L3J_B_SS': '2L3J B SS', 'AS_cons': 'consensus active site', 'CORE': 'CORE', 'CSA_cons': 'consensus Catalytic Site Atlas', 'IN_cons': 'consensus intron', 'LI_cons': 'consensus ligand binding', 'MM': 'model mask', 'PK': 'PK', 'PK_SS': 'PK SS', 'PP_cons': 'consensus posterior probability', 'RF': 'reference coordinate annotation', 'RNA_elements': 'RNA elements', 'RNA_ligand_AdoCbl': 'RNA ligand AdoCbl', 'RNA_ligand_AqCbl': 'RNA ligand AqCbl', 'RNA_ligand_FMN': 'RNA ligand FMN', 'RNA_ligand_Guanidinium': 'RNA ligand Guanidinium', 'RNA_ligand_SAM': 'RNA ligand SAM', 'RNA_ligand_THF_1': 'RNA ligand THF 1', 'RNA_ligand_THF_2': 'RNA ligand THF 2', 'RNA_ligand_TPP': 'RNA ligand TPP', 'RNA_ligand_preQ1': 'RNA ligand preQ1', 'RNA_motif_k_turn': 'RNA motif k turn', 'RNA_structural_element': 'RNA structural element', 'RNA_structural_elements': 'RNA structural elements', 'Repeat_unit': 'Repeat unit', 'SA_cons': 'consensus surface accessibility', 'SS_cons': 'consensus secondary structure', 'TM_cons': 'consensus transmembrane', 'cons': 'cons', 'pAS_cons': 'consensus active site - Pfam predicted', 'sAS_cons': 'consensus active site - from SwissProt', 'scorecons': 'consensus score', 'scorecons_70': 'consensus score 70', 'scorecons_80': 'consensus score 80', 'scorecons_90': 'consensus score 90', 'seq_cons': 'consensus sequence'}

gs_mapping = {'AC': 'accession', 'LO': 'look', 'OC': 'organism classification', 'OS': 'organism'}

__init__(source)

Create an AlignmentIterator object.

Arguments:

source - input data or file name

__abstractmethods__ = frozenset({})

key = 'IN'

keyword = 'cons'

value = 'intron'

class Bio.Align.stockholm.AlignmentWriter(target, mode='w')

Bases: Bio.Align.interfaces.AlignmentWriter

Alignment file writer for the Stockholm file format.

gf_mapping = {'**': '**', 'accession': 'AC', 'author': 'AU', 'build method': 'BM', 'calibration method': 'CB', 'clan': 'CL', 'comment': 'CC', 'definition': 'DE', 'gathering method': 'GA', 'identifier': 'ID', 'noise cutoff': 'NC', 'previous identifier': 'PI', 'search method': 'SM', 'source of seed': 'SE', 'source of structure': 'SS', 'trusted cutoff': 'TC', 'type': 'TP', 'wikipedia': 'WK'}

gs_mapping = {'accession': 'AC', 'look': 'LO', 'organism': 'OS', 'organism classification': 'OC'}

gr_mapping = {'Catalytic Site Atlas': 'CSA', 'active site': 'AS', 'active site - Pfam predicted': 'pAS', 'active site - from SwissProt': 'sAS', 'intron': 'IN', 'ligand binding': 'LI', 'posterior probability': 'PP', 'secondary structure': 'SS', 'surface accessibility': 'SA', 'transmembrane': 'TM'}

gc_mapping = {'2L3J B SS': '2L3J_B_SS', 'CORE': 'CORE', 'PK': 'PK', 'PK SS': 'PK_SS', 'RNA elements': 'RNA_elements', 'RNA ligand AdoCbl': 'RNA_ligand_AdoCbl', 'RNA ligand AqCbl': 'RNA_ligand_AqCbl', 'RNA ligand FMN': 'RNA_ligand_FMN', 'RNA ligand Guanidinium': 'RNA_ligand_Guanidinium', 'RNA ligand SAM': 'RNA_ligand_SAM', 'RNA ligand THF 1': 'RNA_ligand_THF_1', 'RNA ligand THF 2': 'RNA_ligand_THF_2', 'RNA ligand TPP': 'RNA_ligand_TPP', 'RNA ligand preQ1': 'RNA_ligand_preQ1', 'RNA motif k turn': 'RNA_motif_k_turn', 'RNA structural element': 'RNA_structural_element', 'RNA structural elements': 'RNA_structural_elements', 'Repeat unit': 'Repeat_unit', 'cons': 'cons', 'consensus Catalytic Site Atlas': 'CSA_cons', 'consensus active site': 'AS_cons', 'consensus active site - Pfam predicted': 'pAS_cons', 'consensus active site - from SwissProt': 'sAS_cons', 'consensus intron': 'IN_cons', 'consensus ligand binding': 'LI_cons', 'consensus posterior probability': 'PP_cons', 'consensus score': 'scorecons', 'consensus score 70': 'scorecons_70', 'consensus score 80': 'scorecons_80', 'consensus score 90': 'scorecons_90', 'consensus secondary structure': 'SS_cons', 'consensus sequence': 'seq_cons', 'consensus surface accessibility': 'SA_cons', 'consensus transmembrane': 'TM_cons', 'model mask': 'MM', 'reference coordinate annotation': 'RF'}

format_alignment(alignment): Return a string with a single alignment in the Stockholm format.