Bio.Entrez.Parser module

Parser for XML results returned by NCBI’s Entrez Utilities.

This parser is used by the read() function in Bio.Entrez, and is not intended be used directly.

The question is how to represent an XML file as Python objects. Some XML files returned by NCBI look like lists, others look like dictionaries, and others look like a mix of lists and dictionaries.

My approach is to classify each possible element in the XML as a plain string, an integer, a list, a dictionary, or a structure. The latter is a dictionary where the same key can occur multiple times; in Python, it is represented as a dictionary where that key occurs once, pointing to a list of values found in the XML file.

The parser then goes through the XML and creates the appropriate Python object for each element. The different levels encountered in the XML are preserved on the Python side. So a subelement of a subelement of an element is a value in a dictionary that is stored in a list which is a value in some other dictionary (or a value in a list which itself belongs to a list which is a value in a dictionary, and so on). Attributes encountered in the XML are stored as a dictionary in a member .attributes of each element, and the tag name is saved in a member .tag.

To decide which kind of Python object corresponds to each element in the XML, the parser analyzes the DTD referred at the top of (almost) every XML file returned by the Entrez Utilities. This is preferred over a hand- written solution, since the number of DTDs is rather large and their contents may change over time. About half the code in this parser deals with parsing the DTD, and the other half with the XML itself.

class Bio.Entrez.Parser.NoneElement(tag, attributes, key)

Bases: object

NCBI Entrez XML element mapped to None.

__init__(tag, attributes, key): Create a NoneElement.

__eq__(other): Define equality with other None objects.

__ne__(other): Define non-equality.

__repr__(): Return a string representation of the object.

__hash__ = None

class Bio.Entrez.Parser.IntegerElement(value, *args, **kwargs)

Bases: int

NCBI Entrez XML element mapped to an integer.

static __new__(cls, value, *args, **kwargs): Create an IntegerElement.

__init__(value, tag, attributes, key): Initialize an IntegerElement.

__repr__(): Return a string representation of the object.

class Bio.Entrez.Parser.StringElement(value, *args, **kwargs)

Bases: str

NCBI Entrez XML element mapped to a string.

static __new__(cls, value, *args, **kwargs): Create a StringElement.

__init__(value, tag, attributes, key): Initialize a StringElement.

__repr__(): Return a string representation of the object.

class Bio.Entrez.Parser.ListElement(tag, attributes, allowed_tags, key=None)

Bases: list

NCBI Entrez XML element mapped to a list.

__init__(tag, attributes, allowed_tags, key=None): Create a ListElement.

__repr__(): Return a string representation of the object.

store(value): Append an element to the list, checking tags.

class Bio.Entrez.Parser.DictionaryElement(tag, attrs, allowed_tags, repeated_tags=None, key=None)

Bases: dict

NCBI Entrez XML element mapped to a dictionaray.

__init__(tag, attrs, allowed_tags, repeated_tags=None, key=None): Create a DictionaryElement.

__repr__(): Return a string representation of the object.

store(value): Add an entry to the dictionary, checking tags.

class Bio.Entrez.Parser.OrderedListElement(tag, attributes, allowed_tags, first_tag, key=None)

Bases: list

NCBI Entrez XML element mapped to a list of lists.

OrderedListElement is used to describe a list of repeating elements such as A, B, C, A, B, C, A, B, C … where each set of A, B, C forms a group. This is then stored as [[A, B, C], [A, B, C], [A, B, C], …]

__init__(tag, attributes, allowed_tags, first_tag, key=None): Create an OrderedListElement.

__repr__(): Return a string representation of the object.

store(value): Append an element to the list, checking tags.

class Bio.Entrez.Parser.ErrorElement(value, *args, **kwargs)

Bases: str

NCBI Entrez XML element containing an error message.

static __new__(cls, value, *args, **kwargs): Create an ErrorElement.

__init__(value, tag): Initialize an ErrorElement.

__repr__(): Return the error message as a string.

exception Bio.Entrez.Parser.NotXMLError(message)

Bases: ValueError

Failed to parse file as XML.

__init__(message): Initialize the class.

__str__(): Return a string summary of the exception.

exception Bio.Entrez.Parser.CorruptedXMLError(message)

Bases: ValueError

Corrupted XML.

__init__(message): Initialize the class.

__str__(): Return a string summary of the exception.

exception Bio.Entrez.Parser.ValidationError(name)

Bases: ValueError

XML tag found which was not defined in the DTD.

Validating parsers raise this error if the parser finds a tag in the XML that is not defined in the DTD. Non-validating parsers do not raise this error. The Bio.Entrez.read and Bio.Entrez.parse functions use validating parsers by default (see those functions for more information).

__init__(name): Initialize the class.

__str__(): Return a string summary of the exception.

class Bio.Entrez.Parser.DataHandlerMeta(*args, **kwargs)

Bases: type

A metaclass is needed until Python supports @classproperty.

__init__(*args, **kwargs): Initialize the class.

property directory: Directory for caching XSD and DTD files.

__annotations__ = {}

class Bio.Entrez.Parser.DataHandler(validate, escape, ignore_errors)

Bases: object

Data handler for parsing NCBI XML from Entrez.

global_dtd_dir = '/home/circleci/.pyenv/versions/3.10.19/lib/python3.10/site-packages/Bio/Entrez/DTDs'

global_xsd_dir = '/home/circleci/.pyenv/versions/3.10.19/lib/python3.10/site-packages/Bio/Entrez/XSDs'

local_dtd_dir = '/home/circleci/.config/biopython/Bio/Entrez/DTDs'

local_xsd_dir = '/home/circleci/.config/biopython/Bio/Entrez/XSDs'

__init__(validate, escape, ignore_errors): Create a DataHandler object.

read(source): Set up the parser and let it read the XML results.

parse(source): Set up the parser and let it read the XML results.

xmlDeclHandler(version, encoding, standalone): Set XML handlers when an XML declaration is found.

handleMissingDocumentDefinition(tag, attrs): Raise an Exception if neither a DTD nor an XML Schema is found.

startNamespaceDeclHandler(prefix, uri): Handle start of an XML namespace declaration.

endNamespaceDeclHandler(prefix): Handle end of an XML namespace declaration.

schemaHandler(name, attrs): Process the XML schema (before processing the element).

startElementHandler(tag, attrs): Handle start of an XML element.

startRawElementHandler(name, attrs): Handle start of an XML raw element.

startSkipElementHandler(name, attrs): Handle start of an XML skip element.

endStringElementHandler(tag): Handle end of an XML string element.

endRawElementHandler(name): Handle end of an XML raw element.

endSkipElementHandler(name): Handle end of an XML skip element.

endErrorElementHandler(tag): Handle end of an XML error element.

endElementHandler(name): Handle end of an XML element.

endIntegerElementHandler(tag): Handle end of an XML integer element.

characterDataHandlerRaw(content): Handle character data as-is (raw).

characterDataHandlerEscape(content): Handle character data by encoding it.

skipCharacterDataHandler(content): Handle character data by skipping it.

parse_xsd(root): Parse an XSD file.

elementDecl(name, model)

Call a call-back function for each element declaration in a DTD.

This is used for each element declaration in a DTD like:

<!ELEMENT       name          (...)>

The purpose of this function is to determine whether this element should be regarded as a string, integer, list, dictionary, structure, or error.

open_dtd_file(filename): Open specified DTD file.

open_xsd_file(filename): Open specified XSD file.

save_dtd_file(filename, text): Save DTD file to cache.

save_xsd_file(filename, text): Save XSD file to cache.

externalEntityRefHandler(context, base, systemId, publicId)

Handle external entity reference in order to cache DTD locally.

The purpose of this function is to load the DTD locally, instead of downloading it from the URL specified in the XML. Using the local DTD results in much faster parsing. If the DTD is not found locally, we try to download it. If new DTDs become available from NCBI, putting them in Bio/Entrez/DTDs will allow the parser to see them.