Package Bio :: Package Entrez
[hide private]
[frames] | no frames]

Source Code for Package Bio.Entrez

  1  # Copyright 1999-2000 by Jeffrey Chang.  All rights reserved. 
  2  # Copyright 2008-2013 by Michiel de Hoon.  All rights reserved. 
  3  # Revisions copyright 2011-2016 by Peter Cock. All rights reserved. 
  4  # Revisions copyright 2015 by Eric Rasche. All rights reserved. 
  5  # Revisions copyright 2015 by Carlos Pena. All rights reserved. 
  6  # This code is part of the Biopython distribution and governed by its 
  7  # license.  Please see the LICENSE file that should have been included 
  8  # as part of this package. 
  9   
 10  """Provides code to access NCBI over the WWW. 
 11   
 12  The main Entrez web page is available at: 
 13  http://www.ncbi.nlm.nih.gov/Entrez/ 
 14   
 15  Entrez Programming Utilities web page is available at: 
 16  http://www.ncbi.nlm.nih.gov/books/NBK25501/ 
 17   
 18  This module provides a number of functions like ``efetch`` (short for 
 19  Entrez Fetch) which will return the data as a handle object. This is 
 20  a standard interface used in Python for reading data from a file, or 
 21  in this case a remote network connection, and provides methods like 
 22  ``.read()`` or offers iteration over the contents line by line. See 
 23  also "What the heck is a handle?" in the Biopython Tutorial and 
 24  Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
 25  http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
 26   
 27  Unlike a handle to a file on disk from the ``open(filename)`` function, 
 28  which has a ``.name`` attribute giving the filename, the handles from 
 29  ``Bio.Entrez`` all have a ``.url`` attribute instead giving the URL 
 30  used to connect to the NCBI Entrez API. 
 31   
 32  The Entrez module also provides an XML parser which takes a handle 
 33  as input. 
 34   
 35  Variables: 
 36   
 37      - email        Set the Entrez email parameter (default is not set). 
 38      - tool         Set the Entrez tool parameter (default is ``biopython``). 
 39   
 40  Functions: 
 41   
 42      - efetch       Retrieves records in the requested format from a list of one or 
 43        more primary IDs or from the user's environment 
 44      - epost        Posts a file containing a list of primary IDs for future use in 
 45        the user's environment to use with subsequent search strategies 
 46      - esearch      Searches and retrieves primary IDs (for use in EFetch, ELink, 
 47        and ESummary) and term translations and optionally retains 
 48        results for future use in the user's environment. 
 49      - elink        Checks for the existence of an external or Related Articles link 
 50        from a list of one or more primary IDs.  Retrieves primary IDs 
 51        and relevancy scores for links to Entrez databases or Related 
 52        Articles;  creates a hyperlink to the primary LinkOut provider 
 53        for a specific ID and database, or lists LinkOut URLs 
 54        and Attributes for multiple IDs. 
 55      - einfo        Provides field index term counts, last update, and available 
 56        links for each database. 
 57      - esummary     Retrieves document summaries from a list of primary IDs or from 
 58        the user's environment. 
 59      - egquery      Provides Entrez database counts in XML for a single search 
 60        using Global Query. 
 61      - espell       Retrieves spelling suggestions. 
 62      - ecitmatch    Retrieves PubMed IDs (PMIDs) that correspond to a set of 
 63        input citation strings. 
 64   
 65      - read         Parses the XML results returned by any of the above functions. 
 66        Typical usage is: 
 67   
 68            >>> from Bio import Entrez 
 69            >>> Entrez.email = "Your.Name.Here@example.org" 
 70            >>> handle = Entrez.einfo() # or esearch, efetch, ... 
 71            >>> record = Entrez.read(handle) 
 72            >>> handle.close() 
 73   
 74         where record is now a Python dictionary or list. 
 75   
 76      - parse        Parses the XML results returned by those of the above functions 
 77        which can return multiple records - such as efetch, esummary 
 78        and elink. Typical usage is: 
 79   
 80            >>> handle = Entrez.esummary(db="pubmed", id="19304878,14630660", retmode="xml") 
 81            >>> records = Entrez.parse(handle) 
 82            >>> for record in records: 
 83            ...     # each record is a Python dictionary or list. 
 84            ...     print(record['Title']) 
 85            Biopython: freely available Python tools for computational molecular biology and bioinformatics. 
 86            PDB file parser and structure class implemented in Python. 
 87            >>> handle.close() 
 88   
 89        This function is appropriate only if the XML file contains 
 90        multiple records, and is particular useful for large files. 
 91   
 92      - _open        Internally used function. 
 93   
 94  """ 
 95  from __future__ import print_function 
 96   
 97  import time 
 98  import warnings 
 99   
100  # Importing these functions with leading underscore as not intended for reuse 
101  from Bio._py3k import urlopen as _urlopen 
102  from Bio._py3k import urlencode as _urlencode 
103  from Bio._py3k import HTTPError as _HTTPError 
104   
105  from Bio._py3k import _binary_to_string_handle, _as_bytes 
106   
107   
108  email = None 
109  tool = "biopython" 
110   
111   
112  # XXX retmode? 
113 -def epost(db, **keywds):
114 """Post a file of identifiers for future use. 115 116 Posts a file containing a list of UIs for future use in the user's 117 environment to use with subsequent search strategies. 118 119 See the online documentation for an explanation of the parameters: 120 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EPost 121 122 Return a handle to the results. 123 124 Raises an IOError exception if there's a network error. 125 """ 126 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi' 127 variables = {'db': db} 128 variables.update(keywds) 129 return _open(cgi, variables, post=True)
130 131
132 -def efetch(db, **keywords):
133 """Fetches Entrez results which are returned as a handle. 134 135 EFetch retrieves records in the requested format from a list of one or 136 more UIs or from user's environment. 137 138 See the online documentation for an explanation of the parameters: 139 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch 140 141 Return a handle to the results. 142 143 Raises an IOError exception if there's a network error. 144 145 Short example: 146 147 >>> from Bio import Entrez 148 >>> Entrez.email = "Your.Name.Here@example.org" 149 >>> handle = Entrez.efetch(db="nucleotide", id="AY851612", rettype="gb", retmode="text") 150 >>> print(handle.readline().strip()) 151 LOCUS AY851612 892 bp DNA linear PLN 10-APR-2007 152 >>> handle.close() 153 154 This will automatically use an HTTP POST rather than HTTP GET if there 155 are over 200 identifiers as recommended by the NCBI. 156 157 **Warning:** The NCBI changed the default retmode in Feb 2012, so many 158 databases which previously returned text output now give XML. 159 """ 160 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi' 161 variables = {'db': db} 162 variables.update(keywords) 163 post = False 164 try: 165 ids = variables["id"] 166 except KeyError: 167 pass 168 else: 169 if isinstance(ids, list): 170 ids = ",".join(ids) 171 variables["id"] = ids 172 elif isinstance(ids, int): 173 ids = str(ids) 174 variables["id"] = ids 175 176 if ids.count(",") >= 200: 177 # NCBI prefers an HTTP POST instead of an HTTP GET if there are 178 # more than about 200 IDs 179 post = True 180 return _open(cgi, variables, post=post)
181 182
183 -def esearch(db, term, **keywds):
184 """ESearch runs an Entrez search and returns a handle to the results. 185 186 ESearch searches and retrieves primary IDs (for use in EFetch, ELink 187 and ESummary) and term translations, and optionally retains results 188 for future use in the user's environment. 189 190 See the online documentation for an explanation of the parameters: 191 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch 192 193 Return a handle to the results which are always in XML format. 194 195 Raises an IOError exception if there's a network error. 196 197 Short example: 198 199 >>> from Bio import Entrez 200 >>> Entrez.email = "Your.Name.Here@example.org" 201 >>> handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD", idtype="acc") 202 >>> record = Entrez.read(handle) 203 >>> handle.close() 204 >>> int(record["Count"]) >= 2 205 True 206 >>> "EF590893.1" in record["IdList"] 207 True 208 >>> "EF590892.1" in record["IdList"] 209 True 210 211 """ 212 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi' 213 variables = {'db': db, 214 'term': term} 215 variables.update(keywds) 216 return _open(cgi, variables)
217 218 256 257
258 -def einfo(**keywds):
259 """EInfo returns a summary of the Entez databases as a results handle. 260 261 EInfo provides field names, index term counts, last update, and 262 available links for each Entrez database. 263 264 See the online documentation for an explanation of the parameters: 265 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EInfo 266 267 Return a handle to the results, by default in XML format. 268 269 Raises an IOError exception if there's a network error. 270 271 Short example: 272 273 >>> from Bio import Entrez 274 >>> Entrez.email = "Your.Name.Here@example.org" 275 >>> record = Entrez.read(Entrez.einfo()) 276 >>> 'pubmed' in record['DbList'] 277 True 278 279 """ 280 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi' 281 variables = {} 282 variables.update(keywds) 283 return _open(cgi, variables)
284 285
286 -def esummary(**keywds):
287 """ESummary retrieves document summaries as a results handle. 288 289 ESummary retrieves document summaries from a list of primary IDs or 290 from the user's environment. 291 292 See the online documentation for an explanation of the parameters: 293 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESummary 294 295 Return a handle to the results, by default in XML format. 296 297 Raises an IOError exception if there's a network error. 298 299 This example discovers more about entry 19923 in the structure 300 database: 301 302 >>> from Bio import Entrez 303 >>> Entrez.email = "Your.Name.Here@example.org" 304 >>> handle = Entrez.esummary(db="structure", id="19923") 305 >>> record = Entrez.read(handle) 306 >>> handle.close() 307 >>> print(record[0]["Id"]) 308 19923 309 >>> print(record[0]["PdbDescr"]) 310 Crystal Structure Of E. Coli Aconitase B 311 312 """ 313 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi' 314 variables = {} 315 variables.update(keywds) 316 return _open(cgi, variables)
317 318
319 -def egquery(**keywds):
320 """EGQuery provides Entrez database counts for a global search. 321 322 EGQuery provides Entrez database counts in XML for a single search 323 using Global Query. 324 325 See the online documentation for an explanation of the parameters: 326 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EGQuery 327 328 Return a handle to the results in XML format. 329 330 Raises an IOError exception if there's a network error. 331 332 This quick example based on a longer version from the Biopython 333 Tutorial just checks there are over 60 matches for 'Biopython' 334 in PubMedCentral: 335 336 >>> from Bio import Entrez 337 >>> Entrez.email = "Your.Name.Here@example.org" 338 >>> handle = Entrez.egquery(term="biopython") 339 >>> record = Entrez.read(handle) 340 >>> handle.close() 341 >>> for row in record["eGQueryResult"]: 342 ... if "pmc" in row["DbName"]: 343 ... print(int(row["Count"]) > 60) 344 True 345 346 """ 347 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi' 348 variables = {} 349 variables.update(keywds) 350 return _open(cgi, variables)
351 352
353 -def espell(**keywds):
354 """ESpell retrieves spelling suggestions, returned in a results handle. 355 356 ESpell retrieves spelling suggestions, if available. 357 358 See the online documentation for an explanation of the parameters: 359 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESpell 360 361 Return a handle to the results, by default in XML format. 362 363 Raises an IOError exception if there's a network error. 364 365 Short example: 366 367 >>> from Bio import Entrez 368 >>> Entrez.email = "Your.Name.Here@example.org" 369 >>> record = Entrez.read(Entrez.espell(term="biopythooon")) 370 >>> print(record["Query"]) 371 biopythooon 372 >>> print(record["CorrectedQuery"]) 373 biopython 374 375 """ 376 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi' 377 variables = {} 378 variables.update(keywds) 379 return _open(cgi, variables)
380 381
382 -def _update_ecitmatch_variables(keywds):
383 # XML is the only supported value, and it actually returns TXT. 384 variables = {'retmode': 'xml'} 385 citation_keys = ('journal_title', 'year', 'volume', 'first_page', 'author_name', 'key') 386 387 # Accept pre-formatted strings 388 if isinstance(keywds['bdata'], str): 389 variables.update(keywds) 390 else: 391 # Alternatively accept a nicer interface 392 variables['db'] = keywds['db'] 393 bdata = [] 394 for citation in keywds['bdata']: 395 formatted_citation = '|'.join([citation.get(key, "") for key in citation_keys]) 396 bdata.append(formatted_citation) 397 variables['bdata'] = '\r'.join(bdata) 398 return variables
399 400
401 -def ecitmatch(**keywds):
402 """ECitMatch retrieves PMIDs-Citation linking 403 404 ECitMatch retrieves PubMed IDs (PMIDs) that correspond to a set of input citation strings. 405 406 See the online documentation for an explanation of the parameters: 407 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ECitMatch 408 409 Return a handle to the results, by default in plain text 410 411 Raises an IOError exception if there's a network error. 412 413 Short example: 414 415 >>> from Bio import Entrez 416 >>> Entrez.email = "Your.Name.Here@example.org" 417 >>> citation_1 = {"journal_title": "proc natl acad sci u s a", 418 ... "year": "1991", "volume": "88", "first_page": "3248", 419 ... "author_name": "mann bj", "key": "citation_1"} 420 >>> handle = Entrez.ecitmatch(db="pubmed", bdata=[citation_1]) 421 >>> print(handle.read().strip().split("|")) 422 ['proc natl acad sci u s a', '1991', '88', '3248', 'mann bj', 'citation_1', '2014248'] 423 >>> handle.close() 424 425 """ 426 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ecitmatch.cgi' 427 variables = _update_ecitmatch_variables(keywds) 428 return _open(cgi, variables, ecitmatch=True)
429 430
431 -def read(handle, validate=True):
432 """Parses an XML file from the NCBI Entrez Utilities into python objects. 433 434 This function parses an XML file created by NCBI's Entrez Utilities, 435 returning a multilevel data structure of Python lists and dictionaries. 436 Most XML files returned by NCBI's Entrez Utilities can be parsed by 437 this function, provided its DTD is available. Biopython includes the 438 DTDs for most commonly used Entrez Utilities. 439 440 If validate is True (default), the parser will validate the XML file 441 against the DTD, and raise an error if the XML file contains tags that 442 are not represented in the DTD. If validate is False, the parser will 443 simply skip such tags. 444 445 Whereas the data structure seems to consist of generic Python lists, 446 dictionaries, strings, and so on, each of these is actually a class 447 derived from the base type. This allows us to store the attributes 448 (if any) of each element in a dictionary my_element.attributes, and 449 the tag name in my_element.tag. 450 """ 451 from .Parser import DataHandler 452 handler = DataHandler(validate) 453 record = handler.read(handle) 454 return record
455 456
457 -def parse(handle, validate=True):
458 """Parses an XML file from the NCBI Entrez Utilities into python objects. 459 460 This function parses an XML file created by NCBI's Entrez Utilities, 461 returning a multilevel data structure of Python lists and dictionaries. 462 This function is suitable for XML files that (in Python) can be represented 463 as a list of individual records. Whereas 'read' reads the complete file 464 and returns a single Python list, 'parse' is a generator function that 465 returns the records one by one. This function is therefore particularly 466 useful for parsing large files. 467 468 Most XML files returned by NCBI's Entrez Utilities can be parsed by 469 this function, provided its DTD is available. Biopython includes the 470 DTDs for most commonly used Entrez Utilities. 471 472 If validate is True (default), the parser will validate the XML file 473 against the DTD, and raise an error if the XML file contains tags that 474 are not represented in the DTD. If validate is False, the parser will 475 simply skip such tags. 476 477 Whereas the data structure seems to consist of generic Python lists, 478 dictionaries, strings, and so on, each of these is actually a class 479 derived from the base type. This allows us to store the attributes 480 (if any) of each element in a dictionary my_element.attributes, and 481 the tag name in my_element.tag. 482 """ 483 from .Parser import DataHandler 484 handler = DataHandler(validate) 485 records = handler.parse(handle) 486 return records
487 488
489 -def _open(cgi, params=None, post=None, ecitmatch=False):
490 """Helper function to build the URL and open a handle to it (PRIVATE). 491 492 Open a handle to Entrez. cgi is the URL for the cgi script to access. 493 params is a dictionary with the options to pass to it. Does some 494 simple error checking, and will raise an IOError if it encounters one. 495 496 The arugment post should be a boolean to explicitly control if an HTTP 497 POST should be used rather an HTTP GET based on the query length. 498 By default (post=None), POST is used if the URL encoded paramters would 499 be over 1000 characters long. 500 501 This function also enforces the "up to three queries per second rule" 502 to avoid abusing the NCBI servers. 503 """ 504 # NCBI requirement: At most three queries per second. 505 # Equivalently, at least a third of second between queries 506 delay = 0.333333334 507 current = time.time() 508 wait = _open.previous + delay - current 509 if wait > 0: 510 time.sleep(wait) 511 _open.previous = current + wait 512 else: 513 _open.previous = current 514 515 params = _construct_params(params) 516 options = _encode_options(ecitmatch, params) 517 518 # By default, post is None. Set to a boolean to over-ride length choice: 519 if post is None and len(options) > 1000: 520 post = True 521 cgi = _construct_cgi(cgi, post, options) 522 523 try: 524 if post: 525 handle = _urlopen(cgi, data=_as_bytes(options)) 526 else: 527 handle = _urlopen(cgi) 528 except _HTTPError as exception: 529 raise exception 530 531 return _binary_to_string_handle(handle)
532 _open.previous = 0 533 534
535 -def _construct_params(params):
536 if params is None: 537 params = {} 538 539 # Remove None values from the parameters 540 for key, value in list(params.items()): 541 if value is None: 542 del params[key] 543 # Tell Entrez that we are using Biopython (or whatever the user has 544 # specified explicitly in the parameters or by changing the default) 545 if "tool" not in params: 546 params["tool"] = tool 547 # Tell Entrez who we are 548 if "email" not in params: 549 if email is not None: 550 params["email"] = email 551 else: 552 warnings.warn(""" 553 Email address is not specified. 554 555 To make use of NCBI's E-utilities, NCBI requires you to specify your 556 email address with each request. As an example, if your email address 557 is A.N.Other@example.com, you can specify it as follows: 558 from Bio import Entrez 559 Entrez.email = 'A.N.Other@example.com' 560 In case of excessive usage of the E-utilities, NCBI will attempt to contact 561 a user at the email address provided before blocking access to the 562 E-utilities.""", UserWarning) 563 return params
564 565
566 -def _encode_options(ecitmatch, params):
567 # Open a handle to Entrez. 568 options = _urlencode(params, doseq=True) 569 # _urlencode encodes pipes, which NCBI expects in ECitMatch 570 if ecitmatch: 571 options = options.replace('%7C', '|') 572 return options
573 574
575 -def _construct_cgi(cgi, post, options):
576 if not post: 577 # HTTP GET 578 cgi += "?" + options 579 return cgi
580 581 582 if __name__ == "__main__": 583 from Bio._utils import run_doctest 584 run_doctest() 585