Package BioSQL :: Module Loader :: Class DatabaseLoader
[hide private]
[frames] | no frames]

Class DatabaseLoader

source code

Object used to load SeqRecord objects into a BioSQL database.
Instance Methods [hide private]
 
__init__(self, adaptor, dbid, fetch_NCBI_taxonomy=False)
Initialize with connection information for the database.
source code
 
load_seqrecord(self, record)
Load a Biopython SeqRecord into the database.
source code
 
_get_ontology_id(self, name, definition=None)
Returns the identifier for the named ontology (PRIVATE).
source code
 
_get_term_id(self, name, ontology_id=None, definition=None, identifier=None)
Get the id that corresponds to a term (PRIVATE).
source code
 
_add_dbxref(self, dbname, accession, version)
Insert a dbxref and return its id.
source code
 
_get_taxon_id(self, record)
Get the taxon id for this record (PRIVATE).
source code
 
_fix_name_class(self, entrez_name)
Map Entrez name terms to those used in taxdump (PRIVATE).
source code
 
_get_taxon_id_from_ncbi_taxon_id(self, ncbi_taxon_id, scientific_name=None, common_name=None)
Get the taxon id for this record from the NCBI taxon ID (PRIVATE).
source code
 
_get_taxon_id_from_ncbi_lineage(self, taxonomic_lineage)
This is recursive! (PRIVATE).
source code
 
_load_bioentry_table(self, record)
Fill the bioentry table with sequence information (PRIVATE).
source code
 
_load_bioentry_date(self, record, bioentry_id)
Add the effective date of the entry into the database.
source code
 
_load_biosequence(self, record, bioentry_id)
Record a SeqRecord's sequence and alphabet in the database (PRIVATE).
source code
 
_load_comment(self, record, bioentry_id)
Record a SeqRecord's annotated comment in the database (PRIVATE).
source code
 
_load_annotations(self, record, bioentry_id)
Record a SeqRecord's misc annotations in the database (PRIVATE).
source code
 
_load_reference(self, reference, rank, bioentry_id)
Record a SeqRecord's annotated references in the database (PRIVATE).
source code
 
_load_seqfeature(self, feature, feature_rank, bioentry_id)
Load a biopython SeqFeature into the database (PRIVATE).
source code
 
_load_seqfeature_basic(self, feature_type, feature_rank, bioentry_id)
Load the first tables of a seqfeature and returns the id (PRIVATE).
source code
 
_load_seqfeature_locations(self, feature, seqfeature_id)
Load all of the locations for a SeqFeature into tables (PRIVATE).
source code
 
_insert_location(self, location, rank, seqfeature_id)
Add a location of a SeqFeature to the seqfeature_location table (PRIVATE).
source code
 
_load_seqfeature_qualifiers(self, qualifiers, seqfeature_id)
Insert the (key, value) pair qualifiers relating to a feature (PRIVATE).
source code
 
_load_seqfeature_dbxref(self, dbxrefs, seqfeature_id)
Add database crossreferences of a SeqFeature to the database (PRIVATE).
source code
Int
_get_dbxref_id(self, db, accession)
the accession number
source code
 
_get_seqfeature_dbxref(self, seqfeature_id, dbxref_id, rank)
Check for a pre-existing seqfeature_dbxref entry with the passed seqfeature_id and dbxref_id.
source code
 
_add_seqfeature_dbxref(self, seqfeature_id, dbxref_id, rank)
Insert a seqfeature_dbxref row and return the seqfeature_id and dbxref_id
source code
 
_load_dbxrefs(self, record, bioentry_id)
Load any sequence level cross references into the database (PRIVATE).
source code
 
_get_bioentry_dbxref(self, bioentry_id, dbxref_id, rank)
Check for a pre-existing bioentry_dbxref entry with the passed seqfeature_id and dbxref_id.
source code
 
_add_bioentry_dbxref(self, bioentry_id, dbxref_id, rank)
Insert a bioentry_dbxref row and return the seqfeature_id and dbxref_id
source code
Method Details [hide private]

__init__(self, adaptor, dbid, fetch_NCBI_taxonomy=False)
(Constructor)

source code 

Initialize with connection information for the database.

Creating a DatabaseLoader object is normally handled via the BioSeqDatabase DBServer object, for example:

from BioSQL import BioSeqDatabase
server = BioSeqDatabase.open_database(driver="MySQLdb", user="gbrowse",
                 passwd = "biosql", host = "localhost", db="test_biosql")
try:
    db = server["test"]
except KeyError:
    db = server.new_database("test", description="For testing GBrowse")

_get_ontology_id(self, name, definition=None)

source code 

Returns the identifier for the named ontology (PRIVATE).

This looks through the onotology table for a the given entry name. If it is not found, a row is added for this ontology (using the definition if supplied). In either case, the id corresponding to the provided name is returned, so that you can reference it in another table.

_get_term_id(self, name, ontology_id=None, definition=None, identifier=None)

source code 

Get the id that corresponds to a term (PRIVATE).

This looks through the term table for a the given term. If it is not found, a new id corresponding to this term is created. In either case, the id corresponding to that term is returned, so that you can reference it in another table.

The ontology_id should be used to disambiguate the term.

_get_taxon_id(self, record)

source code 

Get the taxon id for this record (PRIVATE).

record - a SeqRecord object

This searches the taxon/taxon_name tables using the NCBI taxon ID, scientific name and common name to find the matching taxon table entry's id.

If the species isn't in the taxon table, and we have at least the NCBI taxon ID, scientific name or common name, at least a minimal stub entry is created in the table.

Returns the taxon id (database key for the taxon table, not an NCBI taxon ID), or None if the taxonomy information is missing.

See also the BioSQL script load_ncbi_taxonomy.pl which will populate and update the taxon/taxon_name tables with the latest information from the NCBI.

_fix_name_class(self, entrez_name)

source code 

Map Entrez name terms to those used in taxdump (PRIVATE).

We need to make this conversion to match the taxon_name.name_class values used by the BioSQL load_ncbi_taxonomy.pl script.

e.g. "ScientificName" -> "scientific name", "EquivalentName" -> "equivalent name", "Synonym" -> "synonym",

_get_taxon_id_from_ncbi_taxon_id(self, ncbi_taxon_id, scientific_name=None, common_name=None)

source code 

Get the taxon id for this record from the NCBI taxon ID (PRIVATE).

ncbi_taxon_id - string containing an NCBI taxon id scientific_name - string, used if a stub entry is recorded common_name - string, used if a stub entry is recorded

This searches the taxon table using ONLY the NCBI taxon ID to find the matching taxon table entry's ID (database key).

If the species isn't in the taxon table, and the fetch_NCBI_taxonomy flag is true, Biopython will attempt to go online using Bio.Entrez to fetch the official NCBI lineage, recursing up the tree until an existing entry is found in the database or the full lineage has been fetched.

Otherwise the NCBI taxon ID, scientific name and common name are recorded as a minimal stub entry in the taxon and taxon_name tables. Any partial information about the lineage from the SeqRecord is NOT recorded. This should mean that (re)running the BioSQL script load_ncbi_taxonomy.pl can fill in the taxonomy lineage.

Returns the taxon id (database key for the taxon table, not an NCBI taxon ID).

_get_taxon_id_from_ncbi_lineage(self, taxonomic_lineage)

source code 

This is recursive! (PRIVATE).

taxonomic_lineage - list of taxonomy dictionaries from Bio.Entrez

First dictionary in list is the taxonomy root, highest would be the species. Each dictionary includes: - TaxID (string, NCBI taxon id) - Rank (string, e.g. "species", "genus", ..., "phylum", ...) - ScientificName (string) (and that is all at the time of writing)

This method will record all the lineage given, returning the taxon id (database key, not NCBI taxon id) of the final entry (the species).

_load_bioentry_table(self, record)

source code 

Fill the bioentry table with sequence information (PRIVATE).

record - SeqRecord object to add to the database.

_load_bioentry_date(self, record, bioentry_id)

source code 

Add the effective date of the entry into the database.

record - a SeqRecord object with an annotated date bioentry_id - corresponding database identifier

_load_biosequence(self, record, bioentry_id)

source code 

Record a SeqRecord's sequence and alphabet in the database (PRIVATE).

record - a SeqRecord object with a seq property bioentry_id - corresponding database identifier

_load_comment(self, record, bioentry_id)

source code 

Record a SeqRecord's annotated comment in the database (PRIVATE).

record - a SeqRecord object with an annotated comment bioentry_id - corresponding database identifier

_load_annotations(self, record, bioentry_id)

source code 

Record a SeqRecord's misc annotations in the database (PRIVATE).

The annotation strings are recorded in the bioentry_qualifier_value table, except for special cases like the reference, comment and taxonomy which are handled with their own tables.

record - a SeqRecord object with an annotations dictionary bioentry_id - corresponding database identifier

_load_reference(self, reference, rank, bioentry_id)

source code 

Record a SeqRecord's annotated references in the database (PRIVATE).

record - a SeqRecord object with annotated references bioentry_id - corresponding database identifier

_load_seqfeature_basic(self, feature_type, feature_rank, bioentry_id)

source code 

Load the first tables of a seqfeature and returns the id (PRIVATE).

This loads the "key" of the seqfeature (ie. CDS, gene) and the basic seqfeature table itself.

_load_seqfeature_locations(self, feature, seqfeature_id)

source code 

Load all of the locations for a SeqFeature into tables (PRIVATE).

This adds the locations related to the SeqFeature into the seqfeature_location table. Fuzzies are not handled right now. For a simple location, ie (1..2), we have a single table row with seq_start = 1, seq_end = 2, location_rank = 1.

For split locations, ie (1..2, 3..4, 5..6) we would have three row tables with:

start = 1, end = 2, rank = 1
start = 3, end = 4, rank = 2
start = 5, end = 6, rank = 3

_insert_location(self, location, rank, seqfeature_id)

source code 

Add a location of a SeqFeature to the seqfeature_location table (PRIVATE).

TODO - Add location operator to location_qualifier_value?

_load_seqfeature_qualifiers(self, qualifiers, seqfeature_id)

source code 

Insert the (key, value) pair qualifiers relating to a feature (PRIVATE).

Qualifiers should be a dictionary of the form:
{key : [value1, value2]}

_load_seqfeature_dbxref(self, dbxrefs, seqfeature_id)

source code 

Add database crossreferences of a SeqFeature to the database (PRIVATE).

o dbxrefs List, dbxref data from the source file in the
format <database>:<accession>
o seqfeature_id Int, the identifier for the seqfeature in the
seqfeature table

Insert dbxref qualifier data for a seqfeature into the seqfeature_dbxref and, if required, dbxref tables. The dbxref_id qualifier/value sets go into the dbxref table as dbname, accession, version tuples, with dbxref.dbxref_id being automatically assigned, and into the seqfeature_dbxref table as seqfeature_id, dbxref_id, and rank tuples

_get_dbxref_id(self, db, accession)

source code 
o db String, the name of the external database containing
the accession number

o accession String, the accession of the dbxref data

Finds and returns the dbxref_id for the passed data. The method attempts to find an existing record first, and inserts the data if there is no record.

Returns: Int

_get_seqfeature_dbxref(self, seqfeature_id, dbxref_id, rank)

source code 
Check for a pre-existing seqfeature_dbxref entry with the passed seqfeature_id and dbxref_id. If one does not exist, insert new data

_load_dbxrefs(self, record, bioentry_id)

source code 

Load any sequence level cross references into the database (PRIVATE).

See table bioentry_dbxref.

_get_bioentry_dbxref(self, bioentry_id, dbxref_id, rank)

source code 
Check for a pre-existing bioentry_dbxref entry with the passed seqfeature_id and dbxref_id. If one does not exist, insert new data