Bio.GenBank.Scanner module
Internal code for parsing GenBank and EMBL files (PRIVATE).
This code is NOT intended for direct use. It provides a basic scanner (for use with a event consumer such as Bio.GenBank._FeatureConsumer) to parse a GenBank or EMBL file (with their shared INSDC feature table).
It is used by Bio.GenBank to parse GenBank files It is also used by Bio.SeqIO to parse GenBank and EMBL files
Feature Table Documentation:
- class Bio.GenBank.Scanner.InsdcScanner(debug=0)
Bases:
object
Basic functions for breaking up a GenBank/EMBL file into sub sections.
The International Nucleotide Sequence Database Collaboration (INSDC) between the DDBJ, EMBL, and GenBank. These organisations all use the same “Feature Table” layout in their plain text flat file formats.
However, the header and sequence sections of an EMBL file are very different in layout to those produced by GenBank/DDBJ.
- RECORD_START = 'XXX'
- HEADER_WIDTH = 3
- FEATURE_START_MARKERS = ['XXX***FEATURES***XXX']
- FEATURE_END_MARKERS = ['XXX***END FEATURES***XXX']
- FEATURE_QUALIFIER_INDENT = 0
- FEATURE_QUALIFIER_SPACER = ''
- SEQUENCE_HEADERS = ['XXX']
- __init__(debug=0)
Initialize the class.
- set_handle(handle)
Set the handle attribute.
- find_start()
Read in lines until find the ID/LOCUS line, which is returned.
Any preamble (such as the header used by the NCBI on
*.seq.gz
archives) will we ignored.
- parse_header()
Return list of strings making up the header.
New line characters are removed.
Assumes you have just read in the ID/LOCUS line.
- parse_features(skip=False)
Return list of tuples for the features (if present).
Each feature is returned as a tuple (key, location, qualifiers) where key and location are strings (e.g. “CDS” and “complement(join(490883..490885,1..879))”) while qualifiers is a list of two string tuples (feature qualifier keys and values).
Assumes you have already read to the start of the features table.
- parse_feature(feature_key, lines)
Parse a feature given as a list of strings into a tuple.
Expects a feature as a list of strings, returns a tuple (key, location, qualifiers)
For example given this GenBank feature:
CDS complement(join(490883..490885,1..879)) /locus_tag="NEQ001" /note="conserved hypothetical [Methanococcus jannaschii]; COG1583:Uncharacterized ACR; IPR001472:Bipartite nuclear localization signal; IPR002743: Protein of unknown function DUF57" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="NP_963295.1" /db_xref="GI:41614797" /db_xref="GeneID:2732620" /translation="MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK EKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTK KFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEP IEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFE EAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGS LNSMGFGFVNTKKNSAR"
Then should give input key=”CDS” and the rest of the data as a list of strings lines=[“complement(join(490883..490885,1..879))”, …, “LNSMGFGFVNTKKNSAR”] where the leading spaces and trailing newlines have been removed.
Returns tuple containing: (key as string, location string, qualifiers as list) as follows for this example:
key = “CDS”, string location = “complement(join(490883..490885,1..879))”, string qualifiers = list of string tuples:
- [(‘locus_tag’, ‘“NEQ001”’),
(‘note’, ‘“conserved hypothetical [Methanococcus jannaschii];nCOG1583:…”’), (‘codon_start’, ‘1’), (‘transl_table’, ‘11’), (‘product’, ‘“hypothetical protein”’), (‘protein_id’, ‘“NP_963295.1”’), (‘db_xref’, ‘“GI:41614797”’), (‘db_xref’, ‘“GeneID:2732620”’), (‘translation’, ‘“MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKnEKYFNFT…”’)]
In the above example, the “note” and “translation” were edited for compactness, and they would contain multiple new line characters (displayed above as n)
If a qualifier is quoted (in this case, everything except codon_start and transl_table) then the quotes are NOT removed.
Note that no whitespace is removed.
Return a tuple containing a list of any misc strings, and the sequence.
- feed(handle, consumer, do_features=True)
Feed a set of data into the consumer.
This method is intended for use with the “old” code in Bio.GenBank
- Arguments:
handle - A handle with the information to parse.
consumer - The consumer that should be informed of events.
do_features - Boolean, should the features be parsed? Skipping the features can be much faster.
- Return values:
true - Passed a record
false - Did not find a record
- parse(handle, do_features=True)
Return a SeqRecord (with SeqFeatures if do_features=True).
See also the method parse_records() for use on multi-record files.
- parse_records(handle, do_features=True)
Parse records, return a SeqRecord object iterator.
Each record (from the ID/LOCUS line to the // line) becomes a SeqRecord
The SeqRecord objects include SeqFeatures if do_features=True
This method is intended for use in Bio.SeqIO
- parse_cds_features(handle, alphabet=None, tags2id=('protein_id', 'locus_tag', 'product'))
Parse CDS features, return SeqRecord object iterator.
Each CDS feature becomes a SeqRecord.
- Arguments:
alphabet - Obsolete, should be left as None.
tags2id - Tuple of three strings, the feature keys to use for the record id, name and description,
This method is intended for use in Bio.SeqIO
- class Bio.GenBank.Scanner.EmblScanner(debug=0)
Bases:
Bio.GenBank.Scanner.InsdcScanner
For extracting chunks of information in EMBL files.
- RECORD_START = 'ID '
- HEADER_WIDTH = 5
- FEATURE_START_MARKERS = ['FH Key Location/Qualifiers', 'FH']
- FEATURE_END_MARKERS = ['XX']
- FEATURE_QUALIFIER_INDENT = 21
- FEATURE_QUALIFIER_SPACER = 'FT '
- SEQUENCE_HEADERS = ['SQ', 'CO']
- EMBL_INDENT = 5
- EMBL_SPACER = ' '
Return a tuple containing a list of any misc strings, and the sequence.
- class Bio.GenBank.Scanner.GenBankScanner(debug=0)
Bases:
Bio.GenBank.Scanner.InsdcScanner
For extracting chunks of information in GenBank files.
- RECORD_START = 'LOCUS '
- HEADER_WIDTH = 12
- FEATURE_START_MARKERS = ['FEATURES Location/Qualifiers', 'FEATURES']
- FEATURE_END_MARKERS = []
- FEATURE_QUALIFIER_INDENT = 21
- FEATURE_QUALIFIER_SPACER = ' '
- SEQUENCE_HEADERS = ['CONTIG', 'ORIGIN', 'BASE COUNT', 'WGS', 'TSA', 'TLS']
- GENBANK_INDENT = 12
- GENBANK_SPACER = ' '
- STRUCTURED_COMMENT_START = '-START##'
- STRUCTURED_COMMENT_END = '-END##'
- STRUCTURED_COMMENT_DELIM = ' :: '
Return a tuple containing a list of any misc strings, and the sequence.