Package Bio :: Package Entrez
[hide private]
[frames] | no frames]

Source Code for Package Bio.Entrez

  1  # Copyright 1999-2000 by Jeffrey Chang.  All rights reserved. 
  2  # Copyright 2008-2013 by Michiel de Hoon.  All rights reserved. 
  3  # Revisions copyright 2011-2015 by Peter Cock. All rights reserved. 
  4  # Revisions copyright 2015 by Eric Rasche. All rights reserved. 
  5  # This code is part of the Biopython distribution and governed by its 
  6  # license.  Please see the LICENSE file that should have been included 
  7  # as part of this package. 
  8   
  9  """Provides code to access NCBI over the WWW. 
 10   
 11  The main Entrez web page is available at: 
 12  http://www.ncbi.nlm.nih.gov/Entrez/ 
 13   
 14  Entrez Programming Utilities web page is available at: 
 15  http://www.ncbi.nlm.nih.gov/books/NBK25501/ 
 16   
 17  This module provides a number of functions like ``efetch`` (short for 
 18  Entrez Fetch) which will return the data as a handle object. This is 
 19  a standard interface used in Python for reading data from a file, or 
 20  in this case a remote network connection, and provides methods like 
 21  ``.read()`` or offers iteration over the contents line by line. See 
 22  also "What the heck is a handle?" in the Biopython Tutorial and 
 23  Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
 24  http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
 25   
 26  Unlike a handle to a file on disk from the ``open(filename)`` function, 
 27  which has a ``.name`` attribute giving the filename, the handles from 
 28  ``Bio.Entrez`` all have a ``.url`` attribute instead giving the URL 
 29  used to connect to the NCBI Entrez API. 
 30   
 31  The Entrez module also provides an XML parser which takes a handle 
 32  as input. 
 33   
 34  Variables: 
 35   
 36      - email        Set the Entrez email parameter (default is not set). 
 37      - tool         Set the Entrez tool parameter (default is ``biopython``). 
 38   
 39  Functions: 
 40   
 41      - efetch       Retrieves records in the requested format from a list of one or 
 42        more primary IDs or from the user's environment 
 43      - epost        Posts a file containing a list of primary IDs for future use in 
 44        the user's environment to use with subsequent search strategies 
 45      - esearch      Searches and retrieves primary IDs (for use in EFetch, ELink, 
 46        and ESummary) and term translations and optionally retains 
 47        results for future use in the user's environment. 
 48      - elink        Checks for the existence of an external or Related Articles link 
 49        from a list of one or more primary IDs.  Retrieves primary IDs 
 50        and relevancy scores for links to Entrez databases or Related 
 51        Articles;  creates a hyperlink to the primary LinkOut provider 
 52        for a specific ID and database, or lists LinkOut URLs 
 53        and Attributes for multiple IDs. 
 54      - einfo        Provides field index term counts, last update, and available 
 55        links for each database. 
 56      - esummary     Retrieves document summaries from a list of primary IDs or from 
 57        the user's environment. 
 58      - egquery      Provides Entrez database counts in XML for a single search 
 59        using Global Query. 
 60      - espell       Retrieves spelling suggestions. 
 61      - ecitmatch    Retrieves PubMed IDs (PMIDs) that correspond to a set of 
 62        input citation strings. 
 63   
 64      - read         Parses the XML results returned by any of the above functions. 
 65        Typical usage is: 
 66   
 67            >>> from Bio import Entrez 
 68            >>> Entrez.email = "Your.Name.Here@example.org" 
 69            >>> handle = Entrez.einfo() # or esearch, efetch, ... 
 70            >>> record = Entrez.read(handle) 
 71            >>> handle.close() 
 72   
 73         where record is now a Python dictionary or list. 
 74   
 75      - parse        Parses the XML results returned by those of the above functions 
 76        which can return multiple records - such as efetch, esummary 
 77        and elink. Typical usage is: 
 78   
 79            >>> handle = Entrez.efetch("pubmed", id="19304878,14630660", retmode="xml") 
 80            >>> records = Entrez.parse(handle) 
 81            >>> for record in records: 
 82            ...     # each record is a Python dictionary or list. 
 83            ...     print(record['MedlineCitation']['Article']['ArticleTitle']) 
 84            Biopython: freely available Python tools for computational molecular biology and bioinformatics. 
 85            PDB file parser and structure class implemented in Python. 
 86            >>> handle.close() 
 87   
 88        This function is appropriate only if the XML file contains 
 89        multiple records, and is particular useful for large files. 
 90   
 91      - _open        Internally used function. 
 92   
 93  """ 
 94  from __future__ import print_function 
 95   
 96  import time 
 97  import warnings 
 98  import os.path 
 99   
100  # Importing these functions with leading underscore as not intended for reuse 
101  from Bio._py3k import urlopen as _urlopen 
102  from Bio._py3k import urlencode as _urlencode 
103  from Bio._py3k import HTTPError as _HTTPError 
104   
105  from Bio._py3k import _binary_to_string_handle, _as_bytes 
106   
107   
108  email = None 
109  tool = "biopython" 
110   
111   
112  # XXX retmode? 
113 -def epost(db, **keywds):
114 """Post a file of identifiers for future use. 115 116 Posts a file containing a list of UIs for future use in the user's 117 environment to use with subsequent search strategies. 118 119 See the online documentation for an explanation of the parameters: 120 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EPost 121 122 Return a handle to the results. 123 124 Raises an IOError exception if there's a network error. 125 """ 126 cgi = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi' 127 variables = {'db': db} 128 variables.update(keywds) 129 return _open(cgi, variables, post=True)
130 131
132 -def efetch(db, **keywords):
133 """Fetches Entrez results which are returned as a handle. 134 135 EFetch retrieves records in the requested format from a list of one or 136 more UIs or from user's environment. 137 138 See the online documentation for an explanation of the parameters: 139 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch 140 141 Return a handle to the results. 142 143 Raises an IOError exception if there's a network error. 144 145 Short example: 146 147 >>> from Bio import Entrez 148 >>> Entrez.email = "Your.Name.Here@example.org" 149 >>> handle = Entrez.efetch(db="nucleotide", id="57240072", rettype="gb", retmode="text") 150 >>> print(handle.readline().strip()) 151 LOCUS AY851612 892 bp DNA linear PLN 10-APR-2007 152 >>> handle.close() 153 154 This will automatically use an HTTP POST rather than HTTP GET if there 155 are over 200 identifiers as recommended by the NCBI. 156 157 **Warning:** The NCBI changed the default retmode in Feb 2012, so many 158 databases which previously returned text output now give XML. 159 """ 160 cgi = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi' 161 variables = {'db': db} 162 variables.update(keywords) 163 post = False 164 try: 165 ids = variables["id"] 166 except KeyError: 167 pass 168 else: 169 if isinstance(ids, list): 170 ids = ",".join(ids) 171 variables["id"] = ids 172 elif isinstance(ids, int): 173 ids = str(ids) 174 variables["id"] = ids 175 176 if ids.count(",") >= 200: 177 # NCBI prefers an HTTP POST instead of an HTTP GET if there are 178 # more than about 200 IDs 179 post = True 180 return _open(cgi, variables, post=post)
181 182
183 -def esearch(db, term, **keywds):
184 """ESearch runs an Entrez search and returns a handle to the results. 185 186 ESearch searches and retrieves primary IDs (for use in EFetch, ELink 187 and ESummary) and term translations, and optionally retains results 188 for future use in the user's environment. 189 190 See the online documentation for an explanation of the parameters: 191 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch 192 193 Return a handle to the results which are always in XML format. 194 195 Raises an IOError exception if there's a network error. 196 197 Short example: 198 199 >>> from Bio import Entrez 200 >>> Entrez.email = "Your.Name.Here@example.org" 201 >>> handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD") 202 >>> record = Entrez.read(handle) 203 >>> handle.close() 204 >>> record["Count"] >= 2 205 True 206 >>> "156535671" in record["IdList"] 207 True 208 >>> "156535673" in record["IdList"] 209 True 210 211 """ 212 cgi = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi' 213 variables = {'db': db, 214 'term': term} 215 variables.update(keywds) 216 return _open(cgi, variables)
217 218 256 257
258 -def einfo(**keywds):
259 """EInfo returns a summary of the Entez databases as a results handle. 260 261 EInfo provides field names, index term counts, last update, and 262 available links for each Entrez database. 263 264 See the online documentation for an explanation of the parameters: 265 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EInfo 266 267 Return a handle to the results, by default in XML format. 268 269 Raises an IOError exception if there's a network error. 270 271 Short example: 272 273 >>> from Bio import Entrez 274 >>> Entrez.email = "Your.Name.Here@example.org" 275 >>> record = Entrez.read(Entrez.einfo()) 276 >>> 'pubmed' in record['DbList'] 277 True 278 279 """ 280 cgi = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi' 281 variables = {} 282 variables.update(keywds) 283 return _open(cgi, variables)
284 285
286 -def esummary(**keywds):
287 """ESummary retrieves document summaries as a results handle. 288 289 ESummary retrieves document summaries from a list of primary IDs or 290 from the user's environment. 291 292 See the online documentation for an explanation of the parameters: 293 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESummary 294 295 Return a handle to the results, by default in XML format. 296 297 Raises an IOError exception if there's a network error. 298 299 This example discovers more about entry 30367 in the journals database: 300 301 >>> from Bio import Entrez 302 >>> Entrez.email = "Your.Name.Here@example.org" 303 >>> handle = Entrez.esummary(db="journals", id="30367") 304 >>> record = Entrez.read(handle) 305 >>> handle.close() 306 >>> print(record[0]["Id"]) 307 30367 308 >>> print(record[0]["Title"]) 309 Computational biology and chemistry 310 311 """ 312 cgi = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi' 313 variables = {} 314 variables.update(keywds) 315 return _open(cgi, variables)
316 317
318 -def egquery(**keywds):
319 """EGQuery provides Entrez database counts for a global search. 320 321 EGQuery provides Entrez database counts in XML for a single search 322 using Global Query. 323 324 See the online documentation for an explanation of the parameters: 325 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EGQuery 326 327 Return a handle to the results in XML format. 328 329 Raises an IOError exception if there's a network error. 330 331 This quick example based on a longer version from the Biopython 332 Tutorial just checks there are over 60 matches for 'Biopython' 333 in PubMedCentral: 334 335 >>> from Bio import Entrez 336 >>> Entrez.email = "Your.Name.Here@example.org" 337 >>> handle = Entrez.egquery(term="biopython") 338 >>> record = Entrez.read(handle) 339 >>> handle.close() 340 >>> for row in record["eGQueryResult"]: 341 ... if "pmc" in row["DbName"]: 342 ... print(row["Count"] > 60) 343 True 344 345 """ 346 cgi = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi' 347 variables = {} 348 variables.update(keywds) 349 return _open(cgi, variables)
350 351
352 -def espell(**keywds):
353 """ESpell retrieves spelling suggestions, returned in a results handle. 354 355 ESpell retrieves spelling suggestions, if available. 356 357 See the online documentation for an explanation of the parameters: 358 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESpell 359 360 Return a handle to the results, by default in XML format. 361 362 Raises an IOError exception if there's a network error. 363 364 Short example: 365 366 >>> from Bio import Entrez 367 >>> Entrez.email = "Your.Name.Here@example.org" 368 >>> record = Entrez.read(Entrez.espell(term="biopythooon")) 369 >>> print(record["Query"]) 370 biopythooon 371 >>> print(record["CorrectedQuery"]) 372 biopython 373 374 """ 375 cgi = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi' 376 variables = {} 377 variables.update(keywds) 378 return _open(cgi, variables)
379 380
381 -def ecitmatch(**keywds):
382 """ECitMatch retrieves PMIDs-Citation linking 383 384 ECitMatch retrieves PubMed IDs (PMIDs) that correspond to a set of input citation strings. 385 386 See the online documentation for an explanation of the parameters: 387 http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ECitMatch 388 389 Return a handle to the results, by default in plain text 390 391 Raises an IOError exception if there's a network error. 392 393 Short example: 394 395 >>> from Bio import Entrez 396 >>> Entrez.email = "Your.Name.Here@example.org" 397 >>> citation_1 = { 398 ... "journal_title": "proc natl acad sci u s a", 399 ... "year": "1991", "volume": "88", "first_page": "3248", 400 ... "author_name": "mann bj", "key": "citation_1"} 401 >>> record = Entrez.ecitmatch(db="pubmed", bdata=[citation_1]) 402 >>> print(record["Query"]) 403 """ 404 cgi = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ecitmatch.cgi' 405 # XML is the only supported value, and it actually returns TXT. 406 variables = {'retmode': 'xml'} 407 citation_keys = ('journal_title', 'year', 'volume', 'first_page', 'author_name', 'key') 408 409 # Accept pre-formatted strings 410 if isinstance(keywds['bdata'], str): 411 variables.update(keywds) 412 else: 413 # Alternatively accept a nicer interface 414 variables['db'] = keywds['db'] 415 bdata = [] 416 for citation in keywds['bdata']: 417 formatted_citation = '|'.join([citation.get(key, "") for key in citation_keys]) 418 bdata.append(formatted_citation) 419 variables['bdata'] = '\r'.join(bdata) 420 421 return _open(cgi, variables, ecitmatch=True)
422 423
424 -def read(handle, validate=True):
425 """Parses an XML file from the NCBI Entrez Utilities into python objects. 426 427 This function parses an XML file created by NCBI's Entrez Utilities, 428 returning a multilevel data structure of Python lists and dictionaries. 429 Most XML files returned by NCBI's Entrez Utilities can be parsed by 430 this function, provided its DTD is available. Biopython includes the 431 DTDs for most commonly used Entrez Utilities. 432 433 If validate is True (default), the parser will validate the XML file 434 against the DTD, and raise an error if the XML file contains tags that 435 are not represented in the DTD. If validate is False, the parser will 436 simply skip such tags. 437 438 Whereas the data structure seems to consist of generic Python lists, 439 dictionaries, strings, and so on, each of these is actually a class 440 derived from the base type. This allows us to store the attributes 441 (if any) of each element in a dictionary my_element.attributes, and 442 the tag name in my_element.tag. 443 """ 444 from .Parser import DataHandler 445 handler = DataHandler(validate) 446 record = handler.read(handle) 447 return record
448 449
450 -def parse(handle, validate=True):
451 """Parses an XML file from the NCBI Entrez Utilities into python objects. 452 453 This function parses an XML file created by NCBI's Entrez Utilities, 454 returning a multilevel data structure of Python lists and dictionaries. 455 This function is suitable for XML files that (in Python) can be represented 456 as a list of individual records. Whereas 'read' reads the complete file 457 and returns a single Python list, 'parse' is a generator function that 458 returns the records one by one. This function is therefore particularly 459 useful for parsing large files. 460 461 Most XML files returned by NCBI's Entrez Utilities can be parsed by 462 this function, provided its DTD is available. Biopython includes the 463 DTDs for most commonly used Entrez Utilities. 464 465 If validate is True (default), the parser will validate the XML file 466 against the DTD, and raise an error if the XML file contains tags that 467 are not represented in the DTD. If validate is False, the parser will 468 simply skip such tags. 469 470 Whereas the data structure seems to consist of generic Python lists, 471 dictionaries, strings, and so on, each of these is actually a class 472 derived from the base type. This allows us to store the attributes 473 (if any) of each element in a dictionary my_element.attributes, and 474 the tag name in my_element.tag. 475 """ 476 from .Parser import DataHandler 477 handler = DataHandler(validate) 478 records = handler.parse(handle) 479 return records
480 481
482 -def _open(cgi, params=None, post=None, ecitmatch=False):
483 """Helper function to build the URL and open a handle to it (PRIVATE). 484 485 Open a handle to Entrez. cgi is the URL for the cgi script to access. 486 params is a dictionary with the options to pass to it. Does some 487 simple error checking, and will raise an IOError if it encounters one. 488 489 The arugment post should be a boolean to explicitly control if an HTTP 490 POST should be used rather an HTTP GET based on the query length. 491 By default (post=None), POST is used if the query URL would be over 492 1000 characters long. 493 494 The arugment post should be a boolean to explicitly control if an HTTP 495 POST should be used rather an HTTP GET based on the query length. 496 497 This function also enforces the "up to three queries per second rule" 498 to avoid abusing the NCBI servers. 499 """ 500 if params is None: 501 params = {} 502 # NCBI requirement: At most three queries per second. 503 # Equivalently, at least a third of second between queries 504 delay = 0.333333334 505 current = time.time() 506 wait = _open.previous + delay - current 507 if wait > 0: 508 time.sleep(wait) 509 _open.previous = current + wait 510 else: 511 _open.previous = current 512 # Remove None values from the parameters 513 for key, value in list(params.items()): 514 if value is None: 515 del params[key] 516 # Tell Entrez that we are using Biopython (or whatever the user has 517 # specified explicitly in the parameters or by changing the default) 518 if "tool" not in params: 519 params["tool"] = tool 520 # Tell Entrez who we are 521 if "email" not in params: 522 if email is not None: 523 params["email"] = email 524 else: 525 warnings.warn(""" 526 Email address is not specified. 527 528 To make use of NCBI's E-utilities, NCBI requires you to specify your 529 email address with each request. As an example, if your email address 530 is A.N.Other@example.com, you can specify it as follows: 531 from Bio import Entrez 532 Entrez.email = 'A.N.Other@example.com' 533 In case of excessive usage of the E-utilities, NCBI will attempt to contact 534 a user at the email address provided before blocking access to the 535 E-utilities.""", UserWarning) 536 537 # Open a handle to Entrez. 538 options = _urlencode(params, doseq=True) 539 # _urlencode encodes pipes, which NCBI expects in ECitMatch 540 if ecitmatch: 541 options = options.replace('%7C', '|') 542 # print cgi + "?" + options 543 544 # By default, post is None. Set to a boolean to over-ride length choice: 545 if post is None and len(options) > 1000: 546 post = True 547 try: 548 if post: 549 # HTTP POST 550 handle = _urlopen(cgi, data=_as_bytes(options)) 551 else: 552 # HTTP GET 553 cgi += "?" + options 554 handle = _urlopen(cgi) 555 except _HTTPError as exception: 556 raise exception 557 558 return _binary_to_string_handle(handle)
559 560 _open.previous = 0 561 562
563 -def _test():
564 """Run the module's doctests (PRIVATE).""" 565 print("Running doctests...") 566 import doctest 567 doctest.testmod() 568 print("Done")
569 570 if __name__ == "__main__": 571 _test() 572