Bio.Entrez package
Submodules
- Bio.Entrez.Parser module
NoneElement
IntegerElement
StringElement
ListElement
DictionaryElement
OrderedListElement
ErrorElement
NotXMLError
CorruptedXMLError
ValidationError
DataHandlerMeta
DataHandler
DataHandler.global_dtd_dir
DataHandler.global_xsd_dir
DataHandler.local_dtd_dir
DataHandler.local_xsd_dir
DataHandler.__init__()
DataHandler.read()
DataHandler.parse()
DataHandler.xmlDeclHandler()
DataHandler.handleMissingDocumentDefinition()
DataHandler.startNamespaceDeclHandler()
DataHandler.endNamespaceDeclHandler()
DataHandler.schemaHandler()
DataHandler.startElementHandler()
DataHandler.startRawElementHandler()
DataHandler.startSkipElementHandler()
DataHandler.endStringElementHandler()
DataHandler.endRawElementHandler()
DataHandler.endSkipElementHandler()
DataHandler.endErrorElementHandler()
DataHandler.endElementHandler()
DataHandler.endIntegerElementHandler()
DataHandler.characterDataHandlerRaw()
DataHandler.characterDataHandlerEscape()
DataHandler.skipCharacterDataHandler()
DataHandler.parse_xsd()
DataHandler.elementDecl()
DataHandler.open_dtd_file()
DataHandler.open_xsd_file()
DataHandler.save_dtd_file()
DataHandler.save_xsd_file()
DataHandler.externalEntityRefHandler()
Module contents
Provides code to access NCBI over the WWW.
The main Entrez web page is available at: http://www.ncbi.nlm.nih.gov/Entrez/
Entrez Programming Utilities web page is available at: http://www.ncbi.nlm.nih.gov/books/NBK25501/
This module provides a number of functions like efetch
(short for
Entrez Fetch) which will return the data as a handle object. This is
a standard interface used in Python for reading data from a file, or
in this case a remote network connection, and provides methods like
.read()
or offers iteration over the contents line by line. See
also “What the heck is a handle?” in the Biopython Tutorial and
Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
The handle returned by these functions can be either in text mode or
in binary mode, depending on the data requested and the results
returned by NCBI Entrez. Typically, XML data will be in binary mode
while other data will be in text mode, as required by the downstream
parser to parse the data.
Unlike a handle to a file on disk from the open(filename)
function,
which has a .name
attribute giving the filename, the handles from
Bio.Entrez
all have a .url
attribute instead giving the URL
used to connect to the NCBI Entrez API.
The epost
, efetch
, and esummary
tools take an “id” parameter
which corresponds to one or more database UIDs (or accession.version
identifiers in the case of sequence databases such as “nuccore” or
“protein”). The Python value of the “id” keyword passed to these functions
may be either a single ID as a string or integer or multiple IDs as an
iterable of strings/integers. You may also pass a single string containing
multiple IDs delimited by commas. The elink
tool also accepts multiple
IDs but the argument is handled differently than the other three. See that
function’s docstring for more information.
All the functions that send requests to the NCBI Entrez API will
automatically respect the NCBI rate limit (of 3 requests per second
without an API key, or 10 requests per second with an API key) and
will automatically retry when encountering transient failures
(i.e. connection failures or HTTP 5XX codes). By default, Biopython
does a maximum of three tries before giving up, and sleeps for 15
seconds between tries. You can tweak these parameters by setting
Bio.Entrez.max_tries
and Bio.Entrez.sleep_between_tries
.
The Entrez module also provides an XML parser which takes a handle as input.
Variables:
email Set the Entrez email parameter (default is not set).
tool Set the Entrez tool parameter (default is
biopython
).api_key Personal API key from NCBI. If not set, only 3 queries per second are allowed. 10 queries per seconds otherwise with a valid API key.
max_tries Configures how many times failed requests will be automatically retried on error (default is 3).
sleep_between_tries The delay, in seconds, before retrying a request on error (default is 15).
Functions:
efetch Retrieves records in the requested format from a list of one or more primary IDs or from the user’s environment
epost Posts a file containing a list of primary IDs for future use in the user’s environment to use with subsequent search strategies
esearch Searches and retrieves primary IDs (for use in EFetch, ELink, and ESummary) and term translations and optionally retains results for future use in the user’s environment.
elink Checks for the existence of an external or Related Articles link from a list of one or more primary IDs. Retrieves primary IDs and relevancy scores for links to Entrez databases or Related Articles; creates a hyperlink to the primary LinkOut provider for a specific ID and database, or lists LinkOut URLs and Attributes for multiple IDs.
einfo Provides field index term counts, last update, and available links for each database.
esummary Retrieves document summaries from a list of primary IDs or from the user’s environment.
egquery Provides Entrez database counts in XML for a single search using Global Query.
espell Retrieves spelling suggestions.
ecitmatch Retrieves PubMed IDs (PMIDs) that correspond to a set of input citation strings.
read Parses the XML results returned by any of the above functions. Alternatively, the XML data can be read from a file opened in binary mode. Typical usage is:
>>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here@example.org" >>> handle = Entrez.einfo() # or esearch, efetch, ... >>> record = Entrez.read(handle) >>> handle.close()where record is now a Python dictionary or list.
parse Parses the XML results returned by those of the above functions which can return multiple records - such as efetch, esummary and elink. Typical usage is:
>>> handle = Entrez.esummary(db="pubmed", id="19304878,14630660", retmode="xml") >>> records = Entrez.parse(handle) >>> for record in records: ... # each record is a Python dictionary or list. ... print(record['Title']) Biopython: freely available Python tools for computational molecular biology and bioinformatics. PDB file parser and structure class implemented in Python. >>> handle.close()This function is appropriate only if the XML file contains multiple records, and is particular useful for large files.
_open Internally used function.
- Bio.Entrez.epost(db, **keywds)
Post a file of identifiers for future use.
Posts a file containing a list of UIs for future use in the user’s environment to use with subsequent search strategies.
See the online documentation for an explanation of the parameters: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EPost
- Returns:
Handle to the results.
- Raises:
urllib.error.URLError – If there’s a network error.
- Bio.Entrez.efetch(db, **keywords)
Fetch Entrez results which are returned as a handle.
EFetch retrieves records in the requested format from a list or set of one or more UIs or from user’s environment.
See the online documentation for an explanation of the parameters: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
Short example:
>>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here@example.org" >>> handle = Entrez.efetch(db="nucleotide", id="AY851612", rettype="gb", retmode="text") >>> print(handle.readline().strip()) LOCUS AY851612 892 bp DNA linear PLN 10-APR-2007 >>> handle.close()
This will automatically use an HTTP POST rather than HTTP GET if there are over 200 identifiers as recommended by the NCBI.
Warning: The NCBI changed the default retmode in Feb 2012, so many databases which previously returned text output now give XML.
- Returns:
Handle to the results.
- Raises:
urllib.error.URLError – If there’s a network error.
- Bio.Entrez.esearch(db, term, **keywds)
Run an Entrez search and return a handle to the results.
ESearch searches and retrieves primary IDs (for use in EFetch, ELink and ESummary) and term translations, and optionally retains results for future use in the user’s environment.
See the online documentation for an explanation of the parameters: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
Short example:
>>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here@example.org" >>> handle = Entrez.esearch( ... db="nucleotide", retmax=10, idtype="acc", ... term="opuntia[ORGN] accD 2007[Publication Date]" ... ) ... >>> record = Entrez.read(handle) >>> handle.close() >>> int(record["Count"]) >= 2 True >>> "EF590893.1" in record["IdList"] True >>> "EF590892.1" in record["IdList"] True
- Returns:
Handle to the results, which are always in XML format.
- Raises:
urllib.error.URLError – If there’s a network error.
- Bio.Entrez.elink(**keywds)
Check for linked external articles and return a handle.
ELink checks for the existence of an external or Related Articles link from a list of one or more primary IDs; retrieves IDs and relevancy scores for links to Entrez databases or Related Articles; creates a hyperlink to the primary LinkOut provider for a specific ID and database, or lists LinkOut URLs and attributes for multiple IDs.
See the online documentation for an explanation of the parameters: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ELink
Note that ELink treats the “id” parameter differently than the other tools when multiple values are given. You should generally pass multiple UIDs as a list of strings or integers. This will provide a “one-to-one” mapping from source database UIDs to destination database UIDs in the result. If multiple source UIDs are passed as a single comma-delimited string all destination UIDs will be mixed together in the result.
This example finds articles related to the Biopython application note’s entry in the PubMed database:
>>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here@example.org" >>> pmid = "19304878" >>> handle = Entrez.elink(dbfrom="pubmed", id=pmid, linkname="pubmed_pubmed") >>> record = Entrez.read(handle) >>> handle.close() >>> print(record[0]["LinkSetDb"][0]["LinkName"]) pubmed_pubmed >>> linked = [link["Id"] for link in record[0]["LinkSetDb"][0]["Link"]] >>> "14630660" in linked True
This is explained in much more detail in the Biopython Tutorial.
- Returns:
Handle to the results, by default in XML format.
- Raises:
urllib.error.URLError – If there’s a network error.
- Bio.Entrez.einfo(**keywds)
Return a summary of the Entrez databases as a results handle.
EInfo provides field names, index term counts, last update, and available links for each Entrez database.
See the online documentation for an explanation of the parameters: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EInfo
Short example:
>>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here@example.org" >>> record = Entrez.read(Entrez.einfo()) >>> 'pubmed' in record['DbList'] True
- Returns:
Handle to the results, by default in XML format.
- Raises:
urllib.error.URLError – If there’s a network error.
- Bio.Entrez.esummary(**keywds)
Retrieve document summaries as a results handle.
ESummary retrieves document summaries from a list of primary IDs or from the user’s environment.
See the online documentation for an explanation of the parameters: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESummary
This example discovers more about entry 19923 in the structure database:
>>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here@example.org" >>> handle = Entrez.esummary(db="structure", id="19923") >>> record = Entrez.read(handle) >>> handle.close() >>> print(record[0]["Id"]) 19923 >>> print(record[0]["PdbDescr"]) CRYSTAL STRUCTURE OF E. COLI ACONITASE B
- Returns:
Handle to the results, by default in XML format.
- Raises:
urllib.error.URLError – If there’s a network error.
- Bio.Entrez.egquery(**keywds)
Provide Entrez database counts for a global search (DEPRECATED).
EGQuery provided Entrez database counts in XML for a single search using Global Query. However, the NCBI are no longer maintaining this function and suggest using esearch on each database of interest.
See the online documentation for an explanation of the parameters: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EGQuery
This quick example based on a longer version from the Biopython Tutorial just checks there are over 60 matches for ‘Biopython’ in PubMedCentral:
>>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here@example.org" >>> handle = Entrez.egquery(term="biopython") >>> record = Entrez.read(handle) >>> handle.close() >>> for row in record["eGQueryResult"]: ... if "pmc" in row["DbName"]: ... print(int(row["Count"]) > 60) True
- Returns:
Handle to the results, by default in XML format.
- Raises:
urllib.error.URLError – If there’s a network error.
- Bio.Entrez.espell(**keywds)
Retrieve spelling suggestions as a results handle.
ESpell retrieves spelling suggestions, if available.
See the online documentation for an explanation of the parameters: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESpell
Short example:
>>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here@example.org" >>> record = Entrez.read(Entrez.espell(term="biopythooon")) >>> print(record["Query"]) biopythooon >>> print(record["CorrectedQuery"]) biopython
- Returns:
Handle to the results, by default in XML format.
- Raises:
urllib.error.URLError – If there’s a network error.
- Bio.Entrez.ecitmatch(**keywds)
Retrieve PMIDs for input citation strings, returned as a handle.
ECitMatch retrieves PubMed IDs (PMIDs) that correspond to a set of input citation strings.
See the online documentation for an explanation of the parameters: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ECitMatch
Short example:
>>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here@example.org" >>> citation_1 = {"journal_title": "proc natl acad sci u s a", ... "year": "1991", "volume": "88", "first_page": "3248", ... "author_name": "mann bj", "key": "citation_1"} >>> handle = Entrez.ecitmatch(db="pubmed", bdata=[citation_1]) >>> print(handle.read().strip().split("|")) ['proc natl acad sci u s a', '1991', '88', '3248', 'mann bj', 'citation_1', '2014248'] >>> handle.close()
- Returns:
Handle to the results, by default in plain text.
- Raises:
urllib.error.URLError – If there’s a network error.
- Bio.Entrez.read(source, validate=True, escape=False, ignore_errors=False)
Parse an XML file from the NCBI Entrez Utilities into python objects.
This function parses an XML file created by NCBI’s Entrez Utilities, returning a multilevel data structure of Python lists and dictionaries. Most XML files returned by NCBI’s Entrez Utilities can be parsed by this function, provided its DTD is available. Biopython includes the DTDs for most commonly used Entrez Utilities.
The argument
source
must be a file or file-like object opened in binary mode, or a filename. The parser detects the encoding from the XML file, and uses it to convert all text in the XML to the correct Unicode string. The functions in Bio.Entrez to access NCBI Entrez will automatically return XML data in binary mode. For files, use mode “rb” when opening the file, as in>>> from Bio import Entrez >>> path = "Entrez/esearch1.xml" >>> stream = open(path, "rb") # opened in binary mode >>> record = Entrez.read(stream) >>> print(record['QueryTranslation']) biopython[All Fields] >>> stream.close()
Alternatively, you can use the filename directly, as in
>>> record = Entrez.read(path) >>> print(record['QueryTranslation']) biopython[All Fields]
which is safer, as the file stream will automatically be closed after the record has been read, or if an error occurs.
If validate is True (default), the parser will validate the XML file against the DTD, and raise an error if the XML file contains tags that are not represented in the DTD. If validate is False, the parser will simply skip such tags.
If escape is True, all characters that are not valid HTML are replaced by HTML escape characters to guarantee that the returned strings are valid HTML fragments. For example, a less-than sign (<) is replaced by <. If escape is False (default), the string is returned as is.
If ignore_errors is False (default), any error messages in the XML file will raise a RuntimeError. If ignore_errors is True, error messages will be stored as ErrorElement items, without raising an exception.
Whereas the data structure seems to consist of generic Python lists, dictionaries, strings, and so on, each of these is actually a class derived from the base type. This allows us to store the attributes (if any) of each element in a dictionary my_element.attributes, and the tag name in my_element.tag.
- Bio.Entrez.parse(source, validate=True, escape=False, ignore_errors=False)
Parse an XML file from the NCBI Entrez Utilities into python objects.
This function parses an XML file created by NCBI’s Entrez Utilities, returning a multilevel data structure of Python lists and dictionaries. This function is suitable for XML files that (in Python) can be represented as a list of individual records. Whereas ‘read’ reads the complete file and returns a single Python list, ‘parse’ is a generator function that returns the records one by one. This function is therefore particularly useful for parsing large files.
Most XML files returned by NCBI’s Entrez Utilities can be parsed by this function, provided its DTD is available. Biopython includes the DTDs for most commonly used Entrez Utilities.
The argument
source
must be a file or file-like object opened in binary mode, or a filename. The parser detects the encoding from the XML file, and uses it to convert all text in the XML to the correct Unicode string. The functions in Bio.Entrez to access NCBI Entrez will automatically return XML data in binary mode. For files, use mode “rb” when opening the file, as in>>> from Bio import Entrez >>> path = "Entrez/pubmed1.xml" >>> stream = open(path, "rb") # opened in binary mode >>> records = Entrez.parse(stream) >>> for record in records: ... print(record['MedlineCitation']['Article']['Journal']['Title']) ... Social justice (San Francisco, Calif.) Biochimica et biophysica acta >>> stream.close()
Alternatively, you can use the filename directly, as in
>>> records = Entrez.parse(path) >>> for record in records: ... print(record['MedlineCitation']['Article']['Journal']['Title']) ... Social justice (San Francisco, Calif.) Biochimica et biophysica acta
which is safer, as the file stream will automatically be closed after all the records have been read, or if an error occurs.
If validate is True (default), the parser will validate the XML file against the DTD, and raise an error if the XML file contains tags that are not represented in the DTD. If validate is False, the parser will simply skip such tags.
If escape is True, all characters that are not valid HTML are replaced by HTML escape characters to guarantee that the returned strings are valid HTML fragments. For example, a less-than sign (<) is replaced by <. If escape is False (default), the string is returned as is.
If ignore_errors is False (default), any error messages in the XML file will raise a RuntimeError. If ignore_errors is True, error messages will be stored as ErrorElement items, without raising an exception.
Whereas the data structure seems to consist of generic Python lists, dictionaries, strings, and so on, each of these is actually a class derived from the base type. This allows us to store the attributes (if any) of each element in a dictionary my_element.attributes, and the tag name in my_element.tag.