Package Bio :: Package UniGene
Package UniGene

source code

Parse Unigene flat file format files such as the file.

Here is an overview of the flat file format that this parser deals with:

Line types/qualifiers:

ID           UniGene cluster ID
TITLE        Title for the cluster
GENE         Gene symbol
CYTOBAND     Cytological band
EXPRESS      Tissues of origin for ESTs in cluster
RESTR_EXPR   Single tissue or development stage contributes
             more than half the total EST frequency for this gene.
GNM_TERMINUS genomic confirmation of presence of a 3' terminus;
             T if a non-templated polyA tail is found among
             a cluster's sequences; else
             I if templated As are found in genomic sequence or
             S if a canonical polyA signal is found on
               the genomic sequence
GENE_ID      Entrez gene identifier associated with at least one
             sequence in this cluster;
             to be used instead of LocusLink.
LOCUSLINK    LocusLink identifier associated with at least one
             sequence in this cluster;
             deprecated in favor of GENE_ID
HOMOL        Homology;
CHROMOSOME   Chromosome.  For plants, CHROMOSOME refers to mapping
             on the arabidopsis genome.
STS          STS
     ACC=         GenBank/EMBL/DDBJ accession number of STS
                  [optional field]
     UNISTS=      identifier in NCBI's UNISTS database
TXMAP        Transcript map interval
     MARKER=      Marker found on at least one sequence in this
     RHPANEL=     Radiation Hybrid panel used to place marker
PROTSIM      Protein Similarity data for the sequence with
             highest-scoring protein similarity in this cluster
     ORG=         Organism
     PROTGI=      Sequence GI of protein
     PROTID=      Sequence ID of protein
     PCT=         Percent alignment
     ALN=         length of aligned region (aa)
SCOUNT       Number of sequences in the cluster
SEQUENCE     Sequence
     ACC=         GenBank/EMBL/DDBJ accession number of sequence
     NID=         Unique nucleotide sequence identifier (gi)
     PID=         Unique protein sequence identifier (used for
     CLONE=       Clone identifier (used for ESTs only)
     END=         End (5'/3') of clone insert read (used for
                  ESTs only)
     LID=         Library ID; see for library name
                  and tissue
     MGC=         5' CDS-completeness indicator; if present, the
                  clone associated with this sequence is believed
                  CDS-complete. A value greater than 511 is the gi
                  of the CDS-complete mRNA matched by the EST,
                  otherwise the value is an indicator of the
                  reliability of the test indicating CDS
                  completeness; higher values indicate more
                  reliable CDS-completeness predictions.
    SEQTYPE=      Description of the nucleotide sequence.
                  Possible values are mRNA, EST and HTC.
    TRACE=        The Trace ID of the EST sequence, as provided by
                  NCBI Trace Archive
Store the information for one SEQUENCE line from a Unigene file
Store the information for one PROTSIM line from a Unigene file
Store the information for one STS line from a Unigene file
Store a Unigene record
parse(handle)
read(handle)
_read(handle)
