Bio.Blast package

Submodules

Module contents

Code to parse and store BLAST XML output, and to invoke the NCBI BLAST web server.

This module provides code to parse and store BLAST XML output, following its definition in the associated BLAST XML DTD file: https://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd

This module also provides code to invoke the BLAST web server provided by NCBI. https://blast.ncbi.nlm.nih.gov/

Variables:

email Set the Blast email parameter (default is None).

tool Set the Blast tool parameter (default is biopython).

exception Bio.Blast.NotXMLError(message)

Bases: ValueError

Failed to parse file as XML.

__init__(message): Initialize the class.

__str__(): Return a string summary of the exception.

exception Bio.Blast.CorruptedXMLError(message)

Bases: ValueError

Corrupted XML.

__init__(message): Initialize the class.

__str__(): Return a string summary of the exception.

class Bio.Blast.HSP(sequences, coordinates=None)

Bases: Alignment

Stores an alignment of one query sequence against a target sequence.

An HSP (High-scoring Segment Pair) stores the alignment of one query sequence segment against one target (hit) sequence segment. The Bio.Blast.HSP class inherits from the Bio.Align.Alignment class.

In addition to the target and query attributes of a Bio.Align.Alignment, a Bio.Blast.HSP object has the following attributes:

score: score of HSP;

annotations: a dictionary that may contain the following keys:

‘bit score’: score (in bits) of HSP (float);

‘evalue’: e-value of HSP (float);

‘identity’: number of identities in HSP (integer);

‘positive’: number of positives in HSP (integer);

‘gaps’: number of gaps in HSP (integer);

‘midline’: formatting middle line.

A Bio.Blast.HSP object behaves the same as a Bio.Align.Alignment` object and can be used as such. However, when printing a Bio.Blast.HSP object, the BLAST e-value and bit score are included in the output (in addition to the alignment itself).

See the documentation of Bio.Blast.Record for a more detailed explanation of how the information in BLAST records is stored in Biopython.

__repr__()

Return a representation of the alignment, including its shape.

The representation cannot be used with eval() to recreate the object, which is usually possible with simple python objects. For example:

The hex string is the memory address of the object and can be used to distinguish different Alignment objects. See help(id) for more information.

>>> import numpy as np
>>> from Bio.Align import Alignment
>>> alignment = Alignment(("ACCGT", "ACGT"),
...                       coordinates = np.array([[0, 2, 3, 5],
...                                               [0, 2, 2, 4],
...                                              ]))
>>> print(alignment)
target            0 ACCGT 5
                  0 ||-|| 5
query             0 AC-GT 4

>>> alignment  
<Alignment object (2 rows x 5 columns) at 0x...>

__str__()

Return a human-readable string representation of the alignment.

For sequence alignments, each line has at most 80 columns. The first 10 columns show the (possibly truncated) sequence name, which may be the id attribute of a SeqRecord, or otherwise ‘target’ or ‘query’ for pairwise alignments. The next 10 columns show the sequence coordinate, using zero-based counting as usual in Python. The remaining 60 columns shown the sequence, using dashes to represent gaps. At the end of the alignment, the end coordinates are shown on the right of the sequence, again in zero-based coordinates.

Pairwise alignments have an additional line between the two sequences showing whether the sequences match (‘|’) or mismatch (‘.’), or if there is a gap (‘-‘). The coordinates shown for this line are the column indices, which can be useful when extracting a subalignment.

For example,

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()

>>> seqA = "TTAACCCCATTTG"
>>> seqB = "AAGCCCCTTT"
>>> seqC = "AAAGGGGCTT"

>>> alignments = aligner.align(seqA, seqB)
>>> len(alignments)
1
>>> alignment = alignments[0]
>>> print(alignment)
target            0 TTAA-CCCCATTTG 13
                  0 --||-||||-|||- 14
query             0 --AAGCCCC-TTT- 10

Note that seqC is the reverse complement of seqB. Aligning it to the reverse strand gives the same alignment, but the query coordinates are switched:

>>> alignments = aligner.align(seqA, seqC, strand="-")
>>> len(alignments)
1
>>> alignment = alignments[0]
>>> print(alignment)
target            0 TTAA-CCCCATTTG 13
                  0 --||-||||-|||- 14
query            10 --AAGCCCC-TTT-  0

__annotations__ = {}

class Bio.Blast.Hit(alignments=())

Bases: Alignments

Stores a single BLAST hit of one single query against one target.

The Bio.Blast.Hit class inherits from the Bio.Align.Alignments class, which is a subclass of a Python list. The Bio.Blast.Hit class stores Bio.Blast.HSP objwcts, which inherit from Bio.Align.Alignment. A Bio.Blast.Hit object is therefore effectively a list of Bio.Align.Alignment objects. Most hits consist of only 1 or a few Alignment objects.

Each Bio.Blast.Hit object has a target attribute containing the following information:

target.id: seqId of subject;

target.description: definition line of subject;

target.name: accession of subject;

len(target.seq): sequence length of subject.

See the documentation of Bio.Blast.Record for a more detailed explanation of the information stored in the alignments contained in the Bio.Blast.Hit object.

__getitem__(key): x.__getitem__(y) <==> x[y]

__repr__(): Return repr(self).

__str__(): Return a human readable summary of the Hit object.

__abstractmethods__ = frozenset({})

__annotations__ = {}

class Bio.Blast.Record

Bases: list

Stores the BLAST results for a single query.

A Bio.Blast.Record object is a list of Bio.Blast.Hit objects, each corresponding to one hit for the query in the BLAST output.

The Bio.Blast.Record object may have the following attributes:

query: A SeqRecord object which may contain some or all of the
following information:
query.id: SeqId of query;

query.description: Definition line of query;

len(query.seq): Length of the query sequence.
stat: A dictionary with summary statistics of the BLAST run. It may
contain the following keys:
‘db-num’: number of sequences in BLAST db (integer);

‘db-len’: length of BLAST db (integer);

‘hsp-len’: effective HSP length (integer);

‘eff-space’: effective search space (float);

‘kappa’: Karlin-Altschul parameter K (float);

‘lambda’: Karlin-Altschul parameter Lambda (float);

‘entropy’: Karlin-Altschul parameter H (float).
message: Some (error?) information.

Each Bio.Blast.Hit object has a target attribute containing the following information:

target.id: seqId of subject;

target.description: definition line of subject;

target.name: accession of subject;

len(target.seq): sequence length of subject.

The Bio.Blast.Hit class inherits from the Bio.Align.Alignments class, which inherits from a Python list. In this list, the Bio.Blast.Hit object stores Bio.Blast.HSP objects, which inherit from the Bio.Align.Alignment class. A Bio.Blast.Hit object is therefore effectively a list of alignment objects.

Each HSP in a Bio.Blast.Hit object has the attributes target and query attributes, as usual for of a Bio.Align.Alignment object storing a pairwise alignment, pointing to a SeqRecord object representing the target and query, respectively. For translated BLAST searches, the features attribute of the target or query may contain a SeqFeature of type CDS that stores the amino acid sequence region. The qualifiers attribute of such a feature is a dictionary with a single key ‘coded_by’; the corresponding value specifies the nucleotide sequence region, in a GenBank-style string with 1-based coordinates, that encodes the amino acid sequence.

Each Bio.Blast.HSP object has the following additional attributes:

score: score of HSP;

annotations: a dictionary that may contain the following keys:

‘bit score’: score (in bits) of HSP (float);

‘evalue’: e-value of HSP (float);

‘identity’: number of identities in HSP (integer);

‘positive’: number of positives in HSP (integer);

‘gaps’: number of gaps in HSP (integer);

‘midline’: formatting middle line.

>>> from Bio import Blast
>>> record = Blast.read("Blast/xml_2212L_blastx_001.xml")
>>> record.query
SeqRecord(seq=Seq(None, length=556), id='gi|1347369|gb|G25137.1|G25137', name='<unknown name>', description='human STS EST48004, sequence tagged site', dbxrefs=[])
>>> record.stat
{'db-num': 2934173, 'db-len': 1011751523, 'hsp-len': 0, 'eff-space': 0, 'kappa': 0.041, 'lambda': 0.267, 'entropy': 0.14}
>>> len(record)
78
>>> hit = record[0]
>>> type(hit)
<class 'Bio.Blast.Hit'>
>>> from Bio.Align import Alignments
>>> isinstance(hit, Alignments)
True
>>> hit.target
SeqRecord(seq=Seq(None, length=319), id='gi|12654095|gb|AAH00859.1|', name='AAH00859', description='Unknown (protein for IMAGE:3459481) [Homo sapiens]', dbxrefs=[])

Most hits consist of only 1 or a few Alignment objects:

>>> len(hit)
1
>>> alignment = hit[0]
>>> type(alignment)
<class 'Bio.Blast.HSP'>
>>> alignment.score
630.0
>>> alignment.annotations
{'bit score': 247.284, 'evalue': 1.69599e-64, 'identity': 122, 'positive': 123, 'gaps': 0, 'midline': 'DLQLLIKAVNLFPAGTNSRWEVIANYMNIHSSSGVKRTAKDVIGKAKSLQKLDPHQKDDINKKAFDKFKKEHGVVPQADNATPSERF GPYTDFTP TTE QKL EQAL TYPVNT ERW  IA AVPGR K+'}

Target and query information are stored in the respective attributes of the alignment:

>>> alignment.target
SeqRecord(seq=Seq({155: 'DLQLLIKAVNLFPAGTNSRWEVIANYMNIHSSSGVKRTAKDVIGKAKSLQKLDP...TKK'}, length=319), id='gi|12654095|gb|AAH00859.1|', name='AAH00859', description='Unknown (protein for IMAGE:3459481) [Homo sapiens]', dbxrefs=[])
>>> alignment.query
SeqRecord(seq=Seq('DLQLLIKAVNLFPAGTNSRWEVIANYMNIHSSSGVKRTAKDVIGKAKSLQKLDP...XKE'), id='gi|1347369|gb|G25137.1|G25137', name='<unknown name>', description='human STS EST48004, sequence tagged site', dbxrefs=[])

This was a BLASTX run, so the query sequence was translated:

>>> len(alignment.target.features)
0
>>> len(alignment.query.features)
1
>>> feature = alignment.query.features[0]
>>> feature
SeqFeature(SimpleLocation(ExactPosition(0), ExactPosition(133)), type='CDS', qualifiers=...)
>>> feature.qualifiers
{'coded_by': 'gi|1347369|gb|G25137.1|G25137:1..399'}

i.e., nucleotides 0:399 (in zero-based coordinates) encode the amino acids of the query in the alignment.

For an alignment against the reverse strand, the location in the qualifier is shown as in this example:

>>> record[72][0].query.features[0].qualifiers
{'coded_by': 'complement(gi|1347369|gb|G25137.1|G25137:345..530)'}

__init__(): Initialize the Record object.

__repr__(): Return repr(self).

__str__(): Return str(self).

__getitem__(key): x.__getitem__(y) <==> x[y]

keys(): Return a list of the target.id of each hit.

__contains__(key): Return key in self.

index(key): Return the index of the hit for which the target.id is equal to the key.

class Bio.Blast.Records(source)

Bases: UserList

Stores the BLAST results of a single BLAST run.

A Bio.Blast.Records object is an iterator. Iterating over it returns returns Bio.Blast.Record objects, each of which corresponds to one BLAST query.

Common attributes of a Bio.Blast.Records object are

source: The input data from which the Bio.Blast.Records object
was constructed.

program: The specific BLAST program that was used (e.g., ‘blastn’).

version: The version of the BLAST program (e.g., ‘BLASTN 2.2.27+’).

reference: The literature reference to the BLAST publication.

db: The BLAST database against which the query was run
(e.g., ‘nr’).

query: A SeqRecord object which may contain some or all of the

following information:

query.id: SeqId of the query;

query.description: Definition line of the query;

query.seq: The query sequence. The query sequence.
The query sequence.

param: A dictionary with the parameters used for the BLAST run.

You may find the following keys in this dictionary:

‘matrix’: the scoring matrix used in the BLAST run
(e.g., ‘BLOSUM62’) (string);

‘expect’: threshold on the expected number of chance
matches (float);

‘include’: e-value threshold for inclusion in
multipass model in psiblast (float);

‘sc-match’: score for matching nucleotides (integer);

‘sc-mismatch’: score for mismatched nucleotides
(integer);

‘gap-open’: gap opening cost (integer);

‘gap-extend’: gap extension cost (integer);

‘filter’: filtering options applied in the BLAST
run (string);

‘pattern’: PHI-BLAST pattern (string);

‘entrez-query’: Limit of request to Entrez query (string).

mbstat: A dictionary with Mega BLAST search statistics. As this
information is stored near the end of the XML file, this attribute can only be accessed after the file has been read completely (by iterating over the records until a StopIteration is issued. This dictionary can contain the same keys as the dictionary stored under the stat attribute of a Record object.

>>> from Bio import Blast
>>> path = "Blast/xml_2218_blastp_002.xml"

In a script, you would use a with block, as in

>>> with Blast.parse(path) as records:
...     print(records.source)
...
Blast/xml_2218_blastp_002.xml

to ensure that the file is closed at the end of the block. Here, we will simply do

>>> records = Blast.parse("Blast/xml_2218_blastp_002.xml")

so we can see the output of each command right away.

>>> type(records)
<class 'Bio.Blast.Records'>
>>> records.source
'Blast/xml_2218_blastp_002.xml'
>>> records.program
'blastp'
>>> records.version
'BLASTP 2.2.18+'
>>> records.reference
'Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.'
>>> records.db
'gpipe/9606/Previous/protein'
>>> records.param
{'matrix': 'BLOSUM62', 'expect': 0.01, 'gap-open': 11, 'gap-extend': 1, 'filter': 'm L; R -d repeat/repeat_9606;'}

Iterating over the records returns Bio.Blast.Record objects:

>>> record = next(records)
>>> type(record)
<class 'Bio.Blast.Record'>
>>> record.query.id
'gi|585505|sp|Q08386|MOPB_RHOCA'
>>> record = next(records)
>>> type(record)
<class 'Bio.Blast.Record'>
>>> record.query.id
'gi|129628|sp|P07175.1|PARA_AGRTU'
>>> record = next(records)  
Traceback (most recent call last):
...
StopIteration

You can also use the records as a list, for example by extracting a record by index, or by calling len or print on the records. The parser will then automatically iterate over the records and store them:

>>> records = Blast.parse("Blast/wnts.xml")
>>> record = records[3]  # this causes all records to be read in and stored
>>> record.query.id
'Query_4'
>>> len(records)
5

After the records have been read in, you can still iterate over them:

>>> for i, record in enumerate(records):
...     print(i, record.query.id)
...
0 Query_1
1 Query_2
2 Query_3
3 Query_4
4 Query_5

__init__(source): Initialize the Records object.

__enter__()

__exit__(exc_type, exc_value, exc_traceback)

__iter__()

__next__()

__getitem__(index)

property data: Overrides the data attribute of UserList.

__repr__(): Return repr(self).

__str__(): Return str(self).

__abstractmethods__ = frozenset({})

Bio.Blast.parse(source)

Parse an XML file containing BLAST output and return a Bio.Blast.Records object.

This returns an iterator object; iterating over it returns Bio.Blast.Record objects one by one.

The source can be a file stream or the path to an XML file containing the BLAST output. If a file stream, source must be in binary mode. This allows the parser to detect the encoding from the XML file,and to use it to convert any text in the XML to the correct Unicode string. The qblast function in Bio.Blast returns a file stream in binary mode. For files, please use mode “rb” when opening the file, as in

>>> from Bio import Blast
>>> stream = open("Blast/wnts.xml", "rb")  # opened in binary mode
>>> records = Blast.parse(stream)
>>> for record in records:
...     print(record.query.id, record.query.description)
...
Query_1 gi|195230749:301-1383 Homo sapiens wingless-type MMTV integration site family member 2 (WNT2), transcript variant 1, mRNA
Query_2 gi|325053704:108-1166 Homo sapiens wingless-type MMTV integration site family, member 3A (WNT3A), mRNA
Query_3 gi|156630997:105-1160 Homo sapiens wingless-type MMTV integration site family, member 4 (WNT4), mRNA
Query_4 gi|371502086:108-1205 Homo sapiens wingless-type MMTV integration site family, member 5A (WNT5A), transcript variant 2, mRNA
Query_5 gi|53729353:216-1313 Homo sapiens wingless-type MMTV integration site family, member 6 (WNT6), mRNA
>>> stream.close()

Bio.Blast.read(source)

Parse an XML file containing BLAST output for a single query and return it.

Internally, this function uses Bio.Blast.parse to obtain an iterator over BLAST records. The function then reads one record from the iterator, ensures that there are no further records, and returns the record it found as a Bio.Blast.Record object. An exception is raised if no records are found, or more than one record is found.

The source can be a file stream or the path to an XML file containing the BLAST output. If a file stream, source must be in binary mode. This allows the parser to detect the encoding from the XML file,and to use it to convert any text in the XML to the correct Unicode string. The qblast function in Bio.Blast returns a file stream in binary mode. For files, please use mode “rb” when opening the file, as in

>>> from Bio import Blast
>>> stream = open("Blast/xml_21500_blastn_001.xml", "rb")  # opened in binary mode
>>> record = Blast.read(stream)
>>> record.query.id
'Query_78041'
>>> record.query.description
'G26684.1 human STS STS_D11570, sequence tagged site'
>>> len(record)
11
>>> stream.close()

Use the Bio.Blast.parse function if you want to read a file containing BLAST output for more than one query.

Bio.Blast.write(records, destination, fmt='XML')

Write BLAST records as an XML file, and return the number of records.

Arguments:

records - A Bio.Blast.Records object.
destination - File or file-like object to write to, or filename as
string. The File object must have been opened for writing in binary mode, and must be closed (or flushed) by the caller after this function returns to ensure that all records are written.
fmt - string describing the file format to write
(case-insensitive). Currently, only “XML” and “XML2” are accepted.

Returns the number of records written (as an integer).

Bio.Blast.qblast(program, database, sequence, url_base=NCBI_BLAST_URL, auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=None, short_query=None, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None, megablast=None, template_type=None, template_length=None, username='blast', password=None)

BLAST search using NCBI’s QBLAST server.

Supports all parameters of the old qblast API for Put and Get.

Please note that NCBI uses the new Common URL API for BLAST searches on the internet (https://blast.ncbi.nlm.nih.gov/doc/blast-help/urlapi.html). Thus, some of the parameters used by this function are not (or are no longer) officially supported by NCBI. Although they are still functioning, this may change in the future.

Some useful parameters:

program blastn, blastp, blastx, tblastn, or tblastx (lower case)

database Which database to search against (e.g. “nr”).

sequence The sequence to search.

ncbi_gi TRUE/FALSE whether to give ‘gi’ identifier.

descriptions Number of descriptions to show. Def 500.

alignments Number of alignments to show. Def 500.

expect An expect value cutoff. Def 10.0.

matrix_name Specify an alt. matrix (PAM30, PAM70, BLOSUM80, BLOSUM45).

filter “none” turns off filtering. Default no filtering

format_type “XML” (default), “HTML”, “Text”, “XML2”, “JSON2”,
or “Tabular”.

entrez_query Entrez query to limit Blast search - only applies when searching nucleotide BLASTDBs

hitlist_size Number of hits to return. Default 50

megablast TRUE/FALSE whether to use MEga BLAST algorithm (blastn only)

short_query TRUE/FALSE whether to adjust the search parameters for a
short query sequence. Note that this will override manually set parameters like word size and e value. Turns off when sequence length is > 30 residues. Default: None.

service plain, psi, phi, rpsblast, megablast (lower case)

This function does no checking of the validity of the parameters and passes the values to the server as is. More help is available at: https://blast.ncbi.nlm.nih.gov/doc/blast-help/urlapi.html

The http.client.HTTPResponse object returned by this function has the additional attributes rid and rtoe with the Request ID and Request Time Of Execution for this BLAST search.