Edit this page on GitHub

Mapping genetic coordinates with the Bio.SeqUtils.Mapper module.

Contributed by Lenna X. Peterson

Mapping genetic locations between different coordinate systems is an essential part of much genomic analysis. This is a brief introduction to the SeqMapper module.

Note: This uses the coordinate mapper currently available on my branch. This in turn is a branch of Peter’s branch introducing CompoundLocations. It will not work with the old style of SeqFeature.

Configuration

The mapper and mapping positions store all positions in Python coordinates, i.e. 0-based. Many users are working with data that is 1-based, such as GenBank or HGVS. Each map position type has a “to_hgvs” and “to_genbank” method for easy conversion. It is also possible to set the default print representation of mapping positions to either of these formats:

from Bio.SeqUtils.Mapper import MapPosition


def use_genbank(self):
    return str(self.to_genbank())


MapPosition.__str__ = use_genbank

Mapping

The first step to using the mapper is to get the exons from a GenBank or similar file. The mapper will accept exons as a sequence of pairs, a SeqRecord with a CDS feature, or a CDS SeqFeature. The file used in this example is located in the Tests directory of the Biopython source code.

from Bio.SeqUtils.Mapper import CoordinateMapper
from Bio import SeqIO


def get_first_CDS(parser):
    for rec in parser:
        for feat in rec.features:
            if feat.type == "CDS" and len(feat.location.parts) > 1:
                return feat


exons = get_first_CDS(SeqIO.parse("GenBank/cor6_6.gb", "genbank"))
cm = CoordinateMapper(exons)

Once the mapper is constructed, its methods can be used to transform positions located within the given CDS. Note that attempting to transcribe and translate a genomic position that is not within CDS will raise an exception. Also note that printing the list will show the repr of the positions, which are in Python 0-based coordinates.

sample_g_values = [50, 150, 250, 350, 450, 550, 650]

protein_positions = []
for raw_g_pos in sample_g_values:
    # EAFP method
    try:
        # Dialect argument converts g_pos from Genbank to Python coordinates
        p_pos = cm.g2p(raw_g_pos, dialect="genbank")
    except ValueError:
        p_pos = None
    protein_positions.append(p_pos)
print(protein_positions)

Here’s an example function that prints a table of the genomic, CDS, and protein coordinates given a coordinate mapper and a list of genomic values.

from Bio.SeqUtils.Mapper import GenomePosition


def gcp_table(mapper, g_list):
    """Print a table of genomic, CDS, and protein coordinates"""
    # Print header
    print("%4s | %6s | %2s" % ("g", "CDS", "p"))
    print("-" * 20)
    for g_pos in g_list:
        # Directly convert g_pos from Genbank to Python coordinates
        g_pos = GenomePosition.from_dialect("genbank", g_pos)
        c_pos = mapper.g2c(g_pos)
        # LBYL method:
        if c_pos.pos_type == "exon":
            p_pos = mapper.c2p(c_pos)
        else:
            p_pos = ""
        # Print formatted row
        print("%4s | %6s | %2s" % (g_pos, c_pos, p_pos))


gcp_table(cm, sample_g_values)

The code shown in this example may be downloaded here: https://gist.github.com/lennax/10600113 The example file used is located in the Tests directory of the Biopython source code.