Bio.Blast package
Submodules
Module contents
Code to parse and store BLAST XML output, and to invoke the NCBI BLAST web server.
This module provides code to parse and store BLAST XML output, following its definition in the associated BLAST XML DTD file: https://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd
This module also provides code to invoke the BLAST web server provided by NCBI. https://blast.ncbi.nlm.nih.gov/
Variables:
email Set the Blast email parameter (default is None).
tool Set the Blast tool parameter (default is
biopython
).
- exception Bio.Blast.NotXMLError(message)
Bases:
ValueError
Failed to parse file as XML.
- __init__(message)
Initialize the class.
- __str__()
Return a string summary of the exception.
- exception Bio.Blast.CorruptedXMLError(message)
Bases:
ValueError
Corrupted XML.
- __init__(message)
Initialize the class.
- __str__()
Return a string summary of the exception.
- class Bio.Blast.HSP(sequences, coordinates=None)
Bases:
Alignment
Stores an alignment of one query sequence against a target sequence.
An HSP (High-scoring Segment Pair) stores the alignment of one query sequence segment against one target (hit) sequence segment. The
Bio.Blast.HSP
class inherits from theBio.Align.Alignment
class.In addition to the
target
andquery
attributes of aBio.Align.Alignment
, aBio.Blast.HSP
object has the following attributes:score: score of HSP;
- annotations: a dictionary that may contain the following keys:
‘bit score’: score (in bits) of HSP (float);
‘evalue’: e-value of HSP (float);
‘identity’: number of identities in HSP (integer);
‘positive’: number of positives in HSP (integer);
‘gaps’: number of gaps in HSP (integer);
‘midline’: formatting middle line.
A
Bio.Blast.HSP
object behaves the same as a Bio.Align.Alignment` object and can be used as such. However, when printing aBio.Blast.HSP
object, the BLAST e-value and bit score are included in the output (in addition to the alignment itself).See the documentation of
Bio.Blast.Record
for a more detailed explanation of how the information in BLAST records is stored in Biopython.- __repr__()
Return a representation of the alignment, including its shape.
The representation cannot be used with eval() to recreate the object, which is usually possible with simple python objects. For example:
<Alignment object (2 rows x 14 columns) at 0x10403d850>
The hex string is the memory address of the object and can be used to distinguish different Alignment objects. See help(id) for more information.
>>> import numpy as np >>> from Bio.Align import Alignment >>> alignment = Alignment(("ACCGT", "ACGT"), ... coordinates = np.array([[0, 2, 3, 5], ... [0, 2, 2, 4], ... ])) >>> print(alignment) target 0 ACCGT 5 0 ||-|| 5 query 0 AC-GT 4 >>> alignment <Alignment object (2 rows x 5 columns) at 0x...>
- __str__()
Return a human-readable string representation of the alignment.
For sequence alignments, each line has at most 80 columns. The first 10 columns show the (possibly truncated) sequence name, which may be the id attribute of a SeqRecord, or otherwise ‘target’ or ‘query’ for pairwise alignments. The next 10 columns show the sequence coordinate, using zero-based counting as usual in Python. The remaining 60 columns shown the sequence, using dashes to represent gaps. At the end of the alignment, the end coordinates are shown on the right of the sequence, again in zero-based coordinates.
Pairwise alignments have an additional line between the two sequences showing whether the sequences match (‘|’) or mismatch (‘.’), or if there is a gap (‘-‘). The coordinates shown for this line are the column indices, which can be useful when extracting a subalignment.
For example,
>>> from Bio.Align import PairwiseAligner >>> aligner = PairwiseAligner()
>>> seqA = "TTAACCCCATTTG" >>> seqB = "AAGCCCCTTT" >>> seqC = "AAAGGGGCTT"
>>> alignments = aligner.align(seqA, seqB) >>> len(alignments) 1 >>> alignment = alignments[0] >>> print(alignment) target 0 TTAA-CCCCATTTG 13 0 --||-||||-|||- 14 query 0 --AAGCCCC-TTT- 10
Note that seqC is the reverse complement of seqB. Aligning it to the reverse strand gives the same alignment, but the query coordinates are switched:
>>> alignments = aligner.align(seqA, seqC, strand="-") >>> len(alignments) 1 >>> alignment = alignments[0] >>> print(alignment) target 0 TTAA-CCCCATTTG 13 0 --||-||||-|||- 14 query 10 --AAGCCCC-TTT- 0
- __annotations__ = {}
- class Bio.Blast.Hit(alignments=())
Bases:
Alignments
Stores a single BLAST hit of one single query against one target.
The
Bio.Blast.Hit
class inherits from theBio.Align.Alignments
class, which is a subclass of a Python list. TheBio.Blast.Hit
class storesBio.Blast.HSP
objwcts, which inherit fromBio.Align.Alignment
. ABio.Blast.Hit
object is therefore effectively a list ofBio.Align.Alignment
objects. Most hits consist of only 1 or a few Alignment objects.Each
Bio.Blast.Hit
object has atarget
attribute containing the following information:target.id: seqId of subject;
target.description: definition line of subject;
target.name: accession of subject;
len(target.seq): sequence length of subject.
See the documentation of
Bio.Blast.Record
for a more detailed explanation of the information stored in the alignments contained in theBio.Blast.Hit
object.- __getitem__(key)
x.__getitem__(y) <==> x[y]
- __repr__()
Return repr(self).
- __str__()
Return a human readable summary of the Hit object.
- __abstractmethods__ = frozenset({})
- __annotations__ = {}
- class Bio.Blast.Record
Bases:
list
Stores the BLAST results for a single query.
A
Bio.Blast.Record
object is a list ofBio.Blast.Hit
objects, each corresponding to one hit for the query in the BLAST output.- The
Bio.Blast.Record
object may have the following attributes: - query: A
SeqRecord
object which may contain some or all of the - following information:
query.id: SeqId of query;
query.description: Definition line of query;
len(query.seq): Length of the query sequence.
- query: A
- stat: A dictionary with summary statistics of the BLAST run. It may
- contain the following keys:
‘db-num’: number of sequences in BLAST db (integer);
‘db-len’: length of BLAST db (integer);
‘hsp-len’: effective HSP length (integer);
‘eff-space’: effective search space (float);
‘kappa’: Karlin-Altschul parameter K (float);
‘lambda’: Karlin-Altschul parameter Lambda (float);
‘entropy’: Karlin-Altschul parameter H (float).
message: Some (error?) information.
Each
Bio.Blast.Hit
object has atarget
attribute containing the following information:target.id: seqId of subject;
target.description: definition line of subject;
target.name: accession of subject;
len(target.seq): sequence length of subject.
The
Bio.Blast.Hit
class inherits from theBio.Align.Alignments
class, which inherits from a Python list. In this list, theBio.Blast.Hit
object storesBio.Blast.HSP
objects, which inherit from theBio.Align.Alignment
class. ABio.Blast.Hit
object is therefore effectively a list of alignment objects.Each HSP in a
Bio.Blast.Hit
object has the attributestarget
andquery
attributes, as usual for of aBio.Align.Alignment
object storing a pairwise alignment, pointing to aSeqRecord
object representing the target and query, respectively. For translated BLAST searches, thefeatures
attribute of the target or query may contain aSeqFeature
of type CDS that stores the amino acid sequence region. Thequalifiers
attribute of such a feature is a dictionary with a single key ‘coded_by’; the corresponding value specifies the nucleotide sequence region, in a GenBank-style string with 1-based coordinates, that encodes the amino acid sequence.Each
Bio.Blast.HSP
object has the following additional attributes:score: score of HSP;
- annotations: a dictionary that may contain the following keys:
‘bit score’: score (in bits) of HSP (float);
‘evalue’: e-value of HSP (float);
‘identity’: number of identities in HSP (integer);
‘positive’: number of positives in HSP (integer);
‘gaps’: number of gaps in HSP (integer);
‘midline’: formatting middle line.
>>> from Bio import Blast >>> record = Blast.read("Blast/xml_2212L_blastx_001.xml") >>> record.query SeqRecord(seq=Seq(None, length=556), id='gi|1347369|gb|G25137.1|G25137', name='<unknown name>', description='human STS EST48004, sequence tagged site', dbxrefs=[]) >>> record.stat {'db-num': 2934173, 'db-len': 1011751523, 'hsp-len': 0, 'eff-space': 0, 'kappa': 0.041, 'lambda': 0.267, 'entropy': 0.14} >>> len(record) 78 >>> hit = record[0] >>> type(hit) <class 'Bio.Blast.Hit'> >>> from Bio.Align import Alignments >>> isinstance(hit, Alignments) True >>> hit.target SeqRecord(seq=Seq(None, length=319), id='gi|12654095|gb|AAH00859.1|', name='AAH00859', description='Unknown (protein for IMAGE:3459481) [Homo sapiens]', dbxrefs=[])
Most hits consist of only 1 or a few Alignment objects:
>>> len(hit) 1 >>> alignment = hit[0] >>> type(alignment) <class 'Bio.Blast.HSP'> >>> alignment.score 630.0 >>> alignment.annotations {'bit score': 247.284, 'evalue': 1.69599e-64, 'identity': 122, 'positive': 123, 'gaps': 0, 'midline': 'DLQLLIKAVNLFPAGTNSRWEVIANYMNIHSSSGVKRTAKDVIGKAKSLQKLDPHQKDDINKKAFDKFKKEHGVVPQADNATPSERF GPYTDFTP TTE QKL EQAL TYPVNT ERW IA AVPGR K+'}
Target and query information are stored in the respective attributes of the alignment:
>>> alignment.target SeqRecord(seq=Seq({155: 'DLQLLIKAVNLFPAGTNSRWEVIANYMNIHSSSGVKRTAKDVIGKAKSLQKLDP...TKK'}, length=319), id='gi|12654095|gb|AAH00859.1|', name='AAH00859', description='Unknown (protein for IMAGE:3459481) [Homo sapiens]', dbxrefs=[]) >>> alignment.query SeqRecord(seq=Seq('DLQLLIKAVNLFPAGTNSRWEVIANYMNIHSSSGVKRTAKDVIGKAKSLQKLDP...XKE'), id='gi|1347369|gb|G25137.1|G25137', name='<unknown name>', description='human STS EST48004, sequence tagged site', dbxrefs=[])
This was a BLASTX run, so the query sequence was translated:
>>> len(alignment.target.features) 0 >>> len(alignment.query.features) 1 >>> feature = alignment.query.features[0] >>> feature SeqFeature(SimpleLocation(ExactPosition(0), ExactPosition(133)), type='CDS', qualifiers=...) >>> feature.qualifiers {'coded_by': 'gi|1347369|gb|G25137.1|G25137:1..399'}
i.e., nucleotides 0:399 (in zero-based coordinates) encode the amino acids of the query in the alignment.
For an alignment against the reverse strand, the location in the qualifier is shown as in this example:
>>> record[72][0].query.features[0].qualifiers {'coded_by': 'complement(gi|1347369|gb|G25137.1|G25137:345..530)'}
- __init__()
Initialize the Record object.
- __repr__()
Return repr(self).
- __str__()
Return str(self).
- __getitem__(key)
x.__getitem__(y) <==> x[y]
- keys()
Return a list of the target.id of each hit.
- __contains__(key)
Return key in self.
- index(key)
Return the index of the hit for which the target.id is equal to the key.
- The
- class Bio.Blast.Records(source)
Bases:
UserList
Stores the BLAST results of a single BLAST run.
A
Bio.Blast.Records
object is an iterator. Iterating over it returns returnsBio.Blast.Record
objects, each of which corresponds to one BLAST query.Common attributes of a
Bio.Blast.Records
object are- source: The input data from which the
Bio.Blast.Records
object was constructed.
- source: The input data from which the
program: The specific BLAST program that was used (e.g., ‘blastn’).
version: The version of the BLAST program (e.g., ‘BLASTN 2.2.27+’).
reference: The literature reference to the BLAST publication.
- db: The BLAST database against which the query was run
(e.g., ‘nr’).
- query: A
SeqRecord
object which may contain some or all of the - following information:
query.id: SeqId of the query;
query.description: Definition line of the query;
- query.seq: The query sequence. The query sequence.
The query sequence.
- query: A
- param: A dictionary with the parameters used for the BLAST run.
- You may find the following keys in this dictionary:
- ‘matrix’: the scoring matrix used in the BLAST run
(e.g., ‘BLOSUM62’) (string);
- ‘expect’: threshold on the expected number of chance
matches (float);
- ‘include’: e-value threshold for inclusion in
multipass model in psiblast (float);
‘sc-match’: score for matching nucleotides (integer);
- ‘sc-mismatch’: score for mismatched nucleotides
(integer);
‘gap-open’: gap opening cost (integer);
‘gap-extend’: gap extension cost (integer);
- ‘filter’: filtering options applied in the BLAST
run (string);
‘pattern’: PHI-BLAST pattern (string);
‘entrez-query’: Limit of request to Entrez query (string).
- mbstat: A dictionary with Mega BLAST search statistics. As this
information is stored near the end of the XML file, this attribute can only be accessed after the file has been read completely (by iterating over the records until a
StopIteration
is issued. This dictionary can contain the same keys as the dictionary stored under thestat
attribute of aRecord
object.
>>> from Bio import Blast >>> path = "Blast/xml_2218_blastp_002.xml"
In a script, you would use a
with
block, as in>>> with Blast.parse(path) as records: ... print(records.source) ... Blast/xml_2218_blastp_002.xml
to ensure that the file is closed at the end of the block. Here, we will simply do
>>> records = Blast.parse("Blast/xml_2218_blastp_002.xml")
so we can see the output of each command right away.
>>> type(records) <class 'Bio.Blast.Records'> >>> records.source 'Blast/xml_2218_blastp_002.xml' >>> records.program 'blastp' >>> records.version 'BLASTP 2.2.18+' >>> records.reference 'Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.' >>> records.db 'gpipe/9606/Previous/protein' >>> records.param {'matrix': 'BLOSUM62', 'expect': 0.01, 'gap-open': 11, 'gap-extend': 1, 'filter': 'm L; R -d repeat/repeat_9606;'}
Iterating over the records returns Bio.Blast.Record objects:
>>> record = next(records) >>> type(record) <class 'Bio.Blast.Record'> >>> record.query.id 'gi|585505|sp|Q08386|MOPB_RHOCA' >>> record = next(records) >>> type(record) <class 'Bio.Blast.Record'> >>> record.query.id 'gi|129628|sp|P07175.1|PARA_AGRTU' >>> record = next(records) Traceback (most recent call last): ... StopIteration
You can also use the records as a list, for example by extracting a record by index, or by calling
len
orprint
on the records. The parser will then automatically iterate over the records and store them:>>> records = Blast.parse("Blast/wnts.xml") >>> record = records[3] # this causes all records to be read in and stored >>> record.query.id 'Query_4' >>> len(records) 5
After the records have been read in, you can still iterate over them:
>>> for i, record in enumerate(records): ... print(i, record.query.id) ... 0 Query_1 1 Query_2 2 Query_3 3 Query_4 4 Query_5
- __init__(source)
Initialize the Records object.
- __enter__()
- __exit__(exc_type, exc_value, exc_traceback)
- __iter__()
- __next__()
- __getitem__(index)
- property data
Overrides the data attribute of UserList.
- __repr__()
Return repr(self).
- __str__()
Return str(self).
- __abstractmethods__ = frozenset({})
- Bio.Blast.parse(source)
Parse an XML file containing BLAST output and return a Bio.Blast.Records object.
This returns an iterator object; iterating over it returns Bio.Blast.Record objects one by one.
The source can be a file stream or the path to an XML file containing the BLAST output. If a file stream, source must be in binary mode. This allows the parser to detect the encoding from the XML file,and to use it to convert any text in the XML to the correct Unicode string. The qblast function in Bio.Blast returns a file stream in binary mode. For files, please use mode “rb” when opening the file, as in
>>> from Bio import Blast >>> stream = open("Blast/wnts.xml", "rb") # opened in binary mode >>> records = Blast.parse(stream) >>> for record in records: ... print(record.query.id, record.query.description) ... Query_1 gi|195230749:301-1383 Homo sapiens wingless-type MMTV integration site family member 2 (WNT2), transcript variant 1, mRNA Query_2 gi|325053704:108-1166 Homo sapiens wingless-type MMTV integration site family, member 3A (WNT3A), mRNA Query_3 gi|156630997:105-1160 Homo sapiens wingless-type MMTV integration site family, member 4 (WNT4), mRNA Query_4 gi|371502086:108-1205 Homo sapiens wingless-type MMTV integration site family, member 5A (WNT5A), transcript variant 2, mRNA Query_5 gi|53729353:216-1313 Homo sapiens wingless-type MMTV integration site family, member 6 (WNT6), mRNA >>> stream.close()
- Bio.Blast.read(source)
Parse an XML file containing BLAST output for a single query and return it.
Internally, this function uses Bio.Blast.parse to obtain an iterator over BLAST records. The function then reads one record from the iterator, ensures that there are no further records, and returns the record it found as a Bio.Blast.Record object. An exception is raised if no records are found, or more than one record is found.
The source can be a file stream or the path to an XML file containing the BLAST output. If a file stream, source must be in binary mode. This allows the parser to detect the encoding from the XML file,and to use it to convert any text in the XML to the correct Unicode string. The qblast function in Bio.Blast returns a file stream in binary mode. For files, please use mode “rb” when opening the file, as in
>>> from Bio import Blast >>> stream = open("Blast/xml_21500_blastn_001.xml", "rb") # opened in binary mode >>> record = Blast.read(stream) >>> record.query.id 'Query_78041' >>> record.query.description 'G26684.1 human STS STS_D11570, sequence tagged site' >>> len(record) 11 >>> stream.close()
Use the Bio.Blast.parse function if you want to read a file containing BLAST output for more than one query.
- Bio.Blast.write(records, destination, fmt='XML')
Write BLAST records as an XML file, and return the number of records.
- Arguments:
records - A
Bio.Blast.Records
object.- destination - File or file-like object to write to, or filename as
string. The File object must have been opened for writing in binary mode, and must be closed (or flushed) by the caller after this function returns to ensure that all records are written.
- fmt - string describing the file format to write
(case-insensitive). Currently, only “XML” and “XML2” are accepted.
Returns the number of records written (as an integer).
- Bio.Blast.qblast(program, database, sequence, url_base=NCBI_BLAST_URL, auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=None, short_query=None, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None, megablast=None, template_type=None, template_length=None, username='blast', password=None)
BLAST search using NCBI’s QBLAST server.
Supports all parameters of the old qblast API for Put and Get.
Please note that NCBI uses the new Common URL API for BLAST searches on the internet (https://blast.ncbi.nlm.nih.gov/doc/blast-help/urlapi.html). Thus, some of the parameters used by this function are not (or are no longer) officially supported by NCBI. Although they are still functioning, this may change in the future.
Some useful parameters:
program blastn, blastp, blastx, tblastn, or tblastx (lower case)
database Which database to search against (e.g. “nr”).
sequence The sequence to search.
ncbi_gi TRUE/FALSE whether to give ‘gi’ identifier.
descriptions Number of descriptions to show. Def 500.
alignments Number of alignments to show. Def 500.
expect An expect value cutoff. Def 10.0.
matrix_name Specify an alt. matrix (PAM30, PAM70, BLOSUM80, BLOSUM45).
filter “none” turns off filtering. Default no filtering
- format_type “XML” (default), “HTML”, “Text”, “XML2”, “JSON2”,
or “Tabular”.
entrez_query Entrez query to limit Blast search - only applies when searching nucleotide BLASTDBs
hitlist_size Number of hits to return. Default 50
megablast TRUE/FALSE whether to use MEga BLAST algorithm (blastn only)
- short_query TRUE/FALSE whether to adjust the search parameters for a
short query sequence. Note that this will override manually set parameters like word size and e value. Turns off when sequence length is > 30 residues. Default: None.
service plain, psi, phi, rpsblast, megablast (lower case)
This function does no checking of the validity of the parameters and passes the values to the server as is. More help is available at: https://blast.ncbi.nlm.nih.gov/doc/blast-help/urlapi.html
The http.client.HTTPResponse object returned by this function has the additional attributes rid and rtoe with the Request ID and Request Time Of Execution for this BLAST search.