Package Bio :: Package Entrez
[hide private]
[frames] | no frames]

Source Code for Package Bio.Entrez

  1  # Copyright 1999-2000 by Jeffrey Chang.  All rights reserved. 
  2  # Copyright 2008-2013 by Michiel de Hoon.  All rights reserved. 
  3  # Revisions copyright 2011-2016 by Peter Cock. All rights reserved. 
  4  # Revisions copyright 2015 by Eric Rasche. All rights reserved. 
  5  # Revisions copyright 2015 by Carlos Pena. All rights reserved. 
  6  # This code is part of the Biopython distribution and governed by its 
  7  # license.  Please see the LICENSE file that should have been included 
  8  # as part of this package. 
  9   
 10  """Provides code to access NCBI over the WWW. 
 11   
 12  The main Entrez web page is available at: 
 13  http://www.ncbi.nlm.nih.gov/Entrez/ 
 14   
 15  Entrez Programming Utilities web page is available at: 
 16  http://www.ncbi.nlm.nih.gov/books/NBK25501/ 
 17   
 18  This module provides a number of functions like ``efetch`` (short for 
 19  Entrez Fetch) which will return the data as a handle object. This is 
 20  a standard interface used in Python for reading data from a file, or 
 21  in this case a remote network connection, and provides methods like 
 22  ``.read()`` or offers iteration over the contents line by line. See 
 23  also "What the heck is a handle?" in the Biopython Tutorial and 
 24  Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
 25  http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
 26   
 27  Unlike a handle to a file on disk from the ``open(filename)`` function, 
 28  which has a ``.name`` attribute giving the filename, the handles from 
 29  ``Bio.Entrez`` all have a ``.url`` attribute instead giving the URL 
 30  used to connect to the NCBI Entrez API. 
 31   
 32  The Entrez module also provides an XML parser which takes a handle 
 33  as input. 
 34   
 35  Variables: 
 36   
 37      - email        Set the Entrez email parameter (default is not set). 
 38      - tool         Set the Entrez tool parameter (default is ``biopython``). 
 39   
 40  Functions: 
 41   
 42      - efetch       Retrieves records in the requested format from a list of one or 
 43        more primary IDs or from the user's environment 
 44      - epost        Posts a file containing a list of primary IDs for future use in 
 45        the user's environment to use with subsequent search strategies 
 46      - esearch      Searches and retrieves primary IDs (for use in EFetch, ELink, 
 47        and ESummary) and term translations and optionally retains 
 48        results for future use in the user's environment. 
 49      - elink        Checks for the existence of an external or Related Articles link 
 50        from a list of one or more primary IDs.  Retrieves primary IDs 
 51        and relevancy scores for links to Entrez databases or Related 
 52        Articles;  creates a hyperlink to the primary LinkOut provider 
 53        for a specific ID and database, or lists LinkOut URLs 
 54        and Attributes for multiple IDs. 
 55      - einfo        Provides field index term counts, last update, and available 
 56        links for each database. 
 57      - esummary     Retrieves document summaries from a list of primary IDs or from 
 58        the user's environment. 
 59      - egquery      Provides Entrez database counts in XML for a single search 
 60        using Global Query. 
 61      - espell       Retrieves spelling suggestions. 
 62      - ecitmatch    Retrieves PubMed IDs (PMIDs) that correspond to a set of 
 63        input citation strings. 
 64   
 65      - read         Parses the XML results returned by any of the above functions. 
 66        Typical usage is: 
 67   
 68            >>> from Bio import Entrez 
 69            >>> Entrez.email = "Your.Name.Here@example.org" 
 70            >>> handle = Entrez.einfo() # or esearch, efetch, ... 
 71            >>> record = Entrez.read(handle) 
 72            >>> handle.close() 
 73   
 74         where record is now a Python dictionary or list. 
 75   
 76      - parse        Parses the XML results returned by those of the above functions 
 77        which can return multiple records - such as efetch, esummary 
 78        and elink. Typical usage is: 
 79   
 80            >>> handle = Entrez.efetch("pubmed", id="19304878,14630660", retmode="xml") 
 81            >>> records = Entrez.parse(handle) 
 82            >>> for record in records: 
 83            ...     # each record is a Python dictionary or list. 
 84            ...     print(record['MedlineCitation']['Article']['ArticleTitle']) 
 85            Biopython: freely available Python tools for computational molecular biology and bioinformatics. 
 86            PDB file parser and structure class implemented in Python. 
 87            >>> handle.close() 
 88   
 89        This function is appropriate only if the XML file contains 
 90        multiple records, and is particular useful for large files. 
 91   
 92      - _open        Internally used function. 
 93   
 94  """ 
 95  from __future__ import print_function 
 96   
 97  import time 
 98  import warnings 
 99   
100  # Importing these functions with leading underscore as not intended for reuse 
101  from Bio._py3k import urlopen as _urlopen 
102  from Bio._py3k import urlencode as _urlencode 
103  from Bio._py3k import HTTPError as _HTTPError 
104   
105  from Bio._py3k import _binary_to_string_handle, _as_bytes 
106   
107   
108  email = None 
109  tool = "biopython" 
110   
111   
112  # XXX retmode? 
113 -def epost(db, **keywds):
114 """Post a file of identifiers for future use. 115 116 Posts a file containing a list of UIs for future use in the user's 117 environment to use with subsequent search strategies. 118 119 See the online documentation for an explanation of the parameters: 120 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EPost 121 122 Return a handle to the results. 123 124 Raises an IOError exception if there's a network error. 125 """ 126 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi' 127 variables = {'db': db} 128 variables.update(keywds) 129 return _open(cgi, variables, post=True)
130 131
132 -def efetch(db, **keywords):
133 """Fetches Entrez results which are returned as a handle. 134 135 EFetch retrieves records in the requested format from a list of one or 136 more UIs or from user's environment. 137 138 See the online documentation for an explanation of the parameters: 139 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch 140 141 Return a handle to the results. 142 143 Raises an IOError exception if there's a network error. 144 145 Short example: 146 147 >>> from Bio import Entrez 148 >>> Entrez.email = "Your.Name.Here@example.org" 149 >>> handle = Entrez.efetch(db="nucleotide", id="57240072", rettype="gb", retmode="text") 150 >>> print(handle.readline().strip()) 151 LOCUS AY851612 892 bp DNA linear PLN 10-APR-2007 152 >>> handle.close() 153 154 This will automatically use an HTTP POST rather than HTTP GET if there 155 are over 200 identifiers as recommended by the NCBI. 156 157 **Warning:** The NCBI changed the default retmode in Feb 2012, so many 158 databases which previously returned text output now give XML. 159 """ 160 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi' 161 variables = {'db': db} 162 variables.update(keywords) 163 post = False 164 try: 165 ids = variables["id"] 166 except KeyError: 167 pass 168 else: 169 if isinstance(ids, list): 170 ids = ",".join(ids) 171 variables["id"] = ids 172 elif isinstance(ids, int): 173 ids = str(ids) 174 variables["id"] = ids 175 176 if ids.count(",") >= 200: 177 # NCBI prefers an HTTP POST instead of an HTTP GET if there are 178 # more than about 200 IDs 179 post = True 180 return _open(cgi, variables, post=post)
181 182
183 -def esearch(db, term, **keywds):
184 """ESearch runs an Entrez search and returns a handle to the results. 185 186 ESearch searches and retrieves primary IDs (for use in EFetch, ELink 187 and ESummary) and term translations, and optionally retains results 188 for future use in the user's environment. 189 190 See the online documentation for an explanation of the parameters: 191 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch 192 193 Return a handle to the results which are always in XML format. 194 195 Raises an IOError exception if there's a network error. 196 197 Short example: 198 199 >>> from Bio import Entrez 200 >>> Entrez.email = "Your.Name.Here@example.org" 201 >>> handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD") 202 >>> record = Entrez.read(handle) 203 >>> handle.close() 204 >>> record["Count"] >= 2 205 True 206 >>> "156535671" in record["IdList"] 207 True 208 >>> "156535673" in record["IdList"] 209 True 210 211 """ 212 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi' 213 variables = {'db': db, 214 'term': term} 215 variables.update(keywds) 216 return _open(cgi, variables)
217 218 256 257
258 -def einfo(**keywds):
259 """EInfo returns a summary of the Entez databases as a results handle. 260 261 EInfo provides field names, index term counts, last update, and 262 available links for each Entrez database. 263 264 See the online documentation for an explanation of the parameters: 265 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EInfo 266 267 Return a handle to the results, by default in XML format. 268 269 Raises an IOError exception if there's a network error. 270 271 Short example: 272 273 >>> from Bio import Entrez 274 >>> Entrez.email = "Your.Name.Here@example.org" 275 >>> record = Entrez.read(Entrez.einfo()) 276 >>> 'pubmed' in record['DbList'] 277 True 278 279 """ 280 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi' 281 variables = {} 282 variables.update(keywds) 283 return _open(cgi, variables)
284 285
286 -def esummary(**keywds):
287 """ESummary retrieves document summaries as a results handle. 288 289 ESummary retrieves document summaries from a list of primary IDs or 290 from the user's environment. 291 292 See the online documentation for an explanation of the parameters: 293 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESummary 294 295 Return a handle to the results, by default in XML format. 296 297 Raises an IOError exception if there's a network error. 298 299 This example discovers more about entry 30367 in the journals database: 300 301 >>> from Bio import Entrez 302 >>> Entrez.email = "Your.Name.Here@example.org" 303 >>> handle = Entrez.esummary(db="journals", id="30367") 304 >>> record = Entrez.read(handle) 305 >>> handle.close() 306 >>> print(record[0]["Id"]) 307 30367 308 >>> print(record[0]["Title"]) 309 Computational biology and chemistry 310 311 """ 312 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi' 313 variables = {} 314 variables.update(keywds) 315 return _open(cgi, variables)
316 317
318 -def egquery(**keywds):
319 """EGQuery provides Entrez database counts for a global search. 320 321 EGQuery provides Entrez database counts in XML for a single search 322 using Global Query. 323 324 See the online documentation for an explanation of the parameters: 325 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EGQuery 326 327 Return a handle to the results in XML format. 328 329 Raises an IOError exception if there's a network error. 330 331 This quick example based on a longer version from the Biopython 332 Tutorial just checks there are over 60 matches for 'Biopython' 333 in PubMedCentral: 334 335 >>> from Bio import Entrez 336 >>> Entrez.email = "Your.Name.Here@example.org" 337 >>> handle = Entrez.egquery(term="biopython") 338 >>> record = Entrez.read(handle) 339 >>> handle.close() 340 >>> for row in record["eGQueryResult"]: 341 ... if "pmc" in row["DbName"]: 342 ... print(row["Count"] > 60) 343 True 344 345 """ 346 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi' 347 variables = {} 348 variables.update(keywds) 349 return _open(cgi, variables)
350 351
352 -def espell(**keywds):
353 """ESpell retrieves spelling suggestions, returned in a results handle. 354 355 ESpell retrieves spelling suggestions, if available. 356 357 See the online documentation for an explanation of the parameters: 358 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESpell 359 360 Return a handle to the results, by default in XML format. 361 362 Raises an IOError exception if there's a network error. 363 364 Short example: 365 366 >>> from Bio import Entrez 367 >>> Entrez.email = "Your.Name.Here@example.org" 368 >>> record = Entrez.read(Entrez.espell(term="biopythooon")) 369 >>> print(record["Query"]) 370 biopythooon 371 >>> print(record["CorrectedQuery"]) 372 biopython 373 374 """ 375 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi' 376 variables = {} 377 variables.update(keywds) 378 return _open(cgi, variables)
379 380
381 -def _update_ecitmatch_variables(keywds):
382 # XML is the only supported value, and it actually returns TXT. 383 variables = {'retmode': 'xml'} 384 citation_keys = ('journal_title', 'year', 'volume', 'first_page', 'author_name', 'key') 385 386 # Accept pre-formatted strings 387 if isinstance(keywds['bdata'], str): 388 variables.update(keywds) 389 else: 390 # Alternatively accept a nicer interface 391 variables['db'] = keywds['db'] 392 bdata = [] 393 for citation in keywds['bdata']: 394 formatted_citation = '|'.join([citation.get(key, "") for key in citation_keys]) 395 bdata.append(formatted_citation) 396 variables['bdata'] = '\r'.join(bdata) 397 return variables
398 399
400 -def ecitmatch(**keywds):
401 """ECitMatch retrieves PMIDs-Citation linking 402 403 ECitMatch retrieves PubMed IDs (PMIDs) that correspond to a set of input citation strings. 404 405 See the online documentation for an explanation of the parameters: 406 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ECitMatch 407 408 Return a handle to the results, by default in plain text 409 410 Raises an IOError exception if there's a network error. 411 412 Short example: 413 414 >>> from Bio import Entrez 415 >>> Entrez.email = "Your.Name.Here@example.org" 416 >>> citation_1 = { 417 ... "journal_title": "proc natl acad sci u s a", 418 ... "year": "1991", "volume": "88", "first_page": "3248", 419 ... "author_name": "mann bj", "key": "citation_1"} 420 >>> record = Entrez.ecitmatch(db="pubmed", bdata=[citation_1]) 421 >>> print(record["Query"]) 422 """ 423 cgi = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ecitmatch.cgi' 424 variables = _update_ecitmatch_variables(keywds) 425 return _open(cgi, variables, ecitmatch=True)
426 427
428 -def read(handle, validate=True):
429 """Parses an XML file from the NCBI Entrez Utilities into python objects. 430 431 This function parses an XML file created by NCBI's Entrez Utilities, 432 returning a multilevel data structure of Python lists and dictionaries. 433 Most XML files returned by NCBI's Entrez Utilities can be parsed by 434 this function, provided its DTD is available. Biopython includes the 435 DTDs for most commonly used Entrez Utilities. 436 437 If validate is True (default), the parser will validate the XML file 438 against the DTD, and raise an error if the XML file contains tags that 439 are not represented in the DTD. If validate is False, the parser will 440 simply skip such tags. 441 442 Whereas the data structure seems to consist of generic Python lists, 443 dictionaries, strings, and so on, each of these is actually a class 444 derived from the base type. This allows us to store the attributes 445 (if any) of each element in a dictionary my_element.attributes, and 446 the tag name in my_element.tag. 447 """ 448 from .Parser import DataHandler 449 handler = DataHandler(validate) 450 record = handler.read(handle) 451 return record
452 453
454 -def parse(handle, validate=True):
455 """Parses an XML file from the NCBI Entrez Utilities into python objects. 456 457 This function parses an XML file created by NCBI's Entrez Utilities, 458 returning a multilevel data structure of Python lists and dictionaries. 459 This function is suitable for XML files that (in Python) can be represented 460 as a list of individual records. Whereas 'read' reads the complete file 461 and returns a single Python list, 'parse' is a generator function that 462 returns the records one by one. This function is therefore particularly 463 useful for parsing large files. 464 465 Most XML files returned by NCBI's Entrez Utilities can be parsed by 466 this function, provided its DTD is available. Biopython includes the 467 DTDs for most commonly used Entrez Utilities. 468 469 If validate is True (default), the parser will validate the XML file 470 against the DTD, and raise an error if the XML file contains tags that 471 are not represented in the DTD. If validate is False, the parser will 472 simply skip such tags. 473 474 Whereas the data structure seems to consist of generic Python lists, 475 dictionaries, strings, and so on, each of these is actually a class 476 derived from the base type. This allows us to store the attributes 477 (if any) of each element in a dictionary my_element.attributes, and 478 the tag name in my_element.tag. 479 """ 480 from .Parser import DataHandler 481 handler = DataHandler(validate) 482 records = handler.parse(handle) 483 return records
484 485
486 -def _open(cgi, params=None, post=None, ecitmatch=False):
487 """Helper function to build the URL and open a handle to it (PRIVATE). 488 489 Open a handle to Entrez. cgi is the URL for the cgi script to access. 490 params is a dictionary with the options to pass to it. Does some 491 simple error checking, and will raise an IOError if it encounters one. 492 493 The arugment post should be a boolean to explicitly control if an HTTP 494 POST should be used rather an HTTP GET based on the query length. 495 By default (post=None), POST is used if the URL encoded paramters would 496 be over 1000 characters long. 497 498 This function also enforces the "up to three queries per second rule" 499 to avoid abusing the NCBI servers. 500 """ 501 # NCBI requirement: At most three queries per second. 502 # Equivalently, at least a third of second between queries 503 delay = 0.333333334 504 current = time.time() 505 wait = _open.previous + delay - current 506 if wait > 0: 507 time.sleep(wait) 508 _open.previous = current + wait 509 else: 510 _open.previous = current 511 512 params = _construct_params(params) 513 options = _encode_options(ecitmatch, params) 514 515 # By default, post is None. Set to a boolean to over-ride length choice: 516 if post is None and len(options) > 1000: 517 post = True 518 cgi = _construct_cgi(cgi, post, options) 519 520 try: 521 if post: 522 handle = _urlopen(cgi, data=_as_bytes(options)) 523 else: 524 handle = _urlopen(cgi) 525 except _HTTPError as exception: 526 raise exception 527 528 return _binary_to_string_handle(handle)
529 _open.previous = 0 530 531
532 -def _construct_params(params):
533 if params is None: 534 params = {} 535 536 # Remove None values from the parameters 537 for key, value in list(params.items()): 538 if value is None: 539 del params[key] 540 # Tell Entrez that we are using Biopython (or whatever the user has 541 # specified explicitly in the parameters or by changing the default) 542 if "tool" not in params: 543 params["tool"] = tool 544 # Tell Entrez who we are 545 if "email" not in params: 546 if email is not None: 547 params["email"] = email 548 else: 549 warnings.warn(""" 550 Email address is not specified. 551 552 To make use of NCBI's E-utilities, NCBI requires you to specify your 553 email address with each request. As an example, if your email address 554 is A.N.Other@example.com, you can specify it as follows: 555 from Bio import Entrez 556 Entrez.email = 'A.N.Other@example.com' 557 In case of excessive usage of the E-utilities, NCBI will attempt to contact 558 a user at the email address provided before blocking access to the 559 E-utilities.""", UserWarning) 560 return params
561 562
563 -def _encode_options(ecitmatch, params):
564 # Open a handle to Entrez. 565 options = _urlencode(params, doseq=True) 566 # _urlencode encodes pipes, which NCBI expects in ECitMatch 567 if ecitmatch: 568 options = options.replace('%7C', '|') 569 return options
570 571
572 -def _construct_cgi(cgi, post, options):
573 if not post: 574 # HTTP GET 575 cgi += "?" + options 576 return cgi
577 578
579 -def _test():
580 """Run the module's doctests (PRIVATE).""" 581 print("Running doctests...") 582 import doctest 583 doctest.testmod() 584 print("Done")
585 586 if __name__ == "__main__": 587 _test() 588