Developing statistics for the Population Genetics Module.
Introduction
A few observations about population genetics and bioinformatics
- Population genetics is used in a wide variety of settings, from
cancer studies to conservation of endangered species. Even names
might not be standardized, e.g. a human geneticist might use Short
Tandem Repeat (STR) while an animal geneticist might
use microsatellite.
- Different research communities have completely different sets of
requirements: markers change (SNPs, microsatellites/STSs, RFLPs,
sequences); number of individuals sampled change (for few units, to
thousands to very large synthetic datasets); number of populations
change; the density of markers change (from 10 million SNPs in the
HapMap for humans and full sequences appearing for many individuals
on parasites, to the opposite, where only 20-30 loci are available)
- A good implementation should support the vast majority of scenarios
above: should support as many markers as possible, big and small
datasets
- The most used format in population genetics is still the
Genepop format. This format is not used much outside the
popgen community. For an overview of the importance of the format,
read Excoffier and Heckel (2006).
This happens to be a marker-independent format.
- The popgen community while producing lots of free software cares
very little about licensing or data-standardization issues.
Different ad-hoc formats exist (again, see the above paper).
As such it is important to characterize all types of existing statistics
and to create a framework that accommodates all of them if people decide
to implement them in the future.
Restricted use cases should be handled with extreme care. If
possible people from different backgrounds should contribute.
Different dimensions
Here we characterize the different dimensions that need to be accounted
for. For a good intro see page 98 of the Arlequin3
manual.
Marker dependent versus marker independent statistics
Some statistics require a special kind of marker. For instance Tajima D
requires a sequence or a RFLP. Allelic range requires
microsatellites/STRs. Other statistics are marker independent. For
instance Fst relies on allele counts per population.
Intra-Population versus Inter-population statistics
As an example observed heterozygosity has no notion of population
structure.
Fst is an example of a statistic that exists to measure population
differentiation. These kind of statistics require some notion of
population differentiation.
Data type
Haplotypic, genotypic (phase unknown), genotypic (phase known),
genoptypic dominant, frequency only.
Say, for expected heterozygosity frequencies are enough, for observed
heterozygosity genotypic (phase unknown) data is necessary.
Single locus versus multi-loci
Single locus as in allelic richness, ExpHe, Fst.
Multi-loci as in number of polimorphic sites, LD or EHH.
Temporal/longitudinal vs single point in time
Say temporal-Fst versus Fst.
Population versus Landscape
This issue I suggest abandon for now.
Example of statistic classification
- ExpHz non-temporal, intra, single-locus, marker independent,
genotypic - gametic unk
- ObsHz non-temporal, intra, single-locus, independent, genotypic -
gametic kn
- Fst(CW) non-temporal, inter, single-locus, indep, genotypic -
gametic unk
- temporal-Fst temporal, intra, single-locus, indep, genotypic -
gametic unk
- LD(D’) non-temporal, intra, multi-locus, indep, haplo/geno
- Fk temporal, intra, single-locus, indep, geno
- S (polimorphic sites), non-temporal, intra, multi-locus, indep,
haplo/geno
- Alleic range, nt, intra, single-locus, microsat, haplo/geno
- EHH, nt, positional
- Tajima D, nt, intra, single-locus, sequence/RFLP
Design
The design will have to be able to cope with all the dimensions above
Pending issues
There is still the issue of statistical tests (say Hardy-Weinberg
deviation).