Questions the KITP program has raised about population genetics


On Monday November 24th, Paul Higgs lead a discussion asking what the big questions raised by the conference were. Below are notes taken by Ryan Gutenkunst. He apologies to anyone whose contributions he overlooked. Please add them in!

Key to participant comments:
BC: Ben Callahan
SG: Simon Gravel
RG: Ryan Gutenkunst
PH: Paul Higgs
MK: Marty Kreitman
ML: Michael Lässig
VM: Ville Mustonen
RN: Richard Neher
SS: Stephan Schiffels
BS: Boris Shraiman
GS: Gergely Szöllosi

PH (4pm Monday) Thanks Ryan - I have expanded on what I remember from the discussion and generally put words into people's mouths. I also cheated by adding things that I didnt say this morning! Everyone else please go ahead and add to the bits you are interested in.

1. How useful is Ne?

PH: Inferred values of Ne are often orders of magnitude smaller than the census size. So how can we interpret it? One reason we might need an Ne is because the real N does change (e.g. exponential expansion or bottlenecks). If this were the only effect, it would make sense to talk about Ne in terms of population size. However, the estimated Ne is strongly influenced by other factors that have nothing to do with population size - notably selection on genotype and sexual selection causing variation in reproductive success of individuals. It does not make sense to discuss the effects of these other factors as though they caused a change in N.

RN: It parameterizes the stochasticity of the evolution, so it's like the inverse of a diffusion constant.
BS: It's a proxy for diffusion, let's name it D.
ML: Yes, it would be helpful if it had a "more enigmatic name".
PH: Surely we want a LESS enigmatic name! Its true that it would make sense to talk about the strength of diffusion (genetic drift) and call it D. However, whatever we call it, the main problem is to explain why the apparent Ne is small. The apparent Ne is estimated by measuring theta (defined from fraction of polymorphic sites, for example). The fact that theta and Ne are small probably is not because drift is large - it is because factors other than drift are influencing the estimate of theta and Ne.

MK: Ne is a scaling factor, particularly for the rate of entry of new mutations in the population. But its inferred value varies widely across the genome.
PH: The rate of entry of new mutations is Nu (i.e. the real N). This is a relevant quantity but we don't need another symbol for it.
ML: Theta is a relevant quantity because it can be directly measured from polymorphism data.
PH: Yes, we should focus on predicting quantities that can be observed directly. It makes sense to say that the density of polymorphic sites varies along the genome but not that Ne varies along the genome. This means that it is no longer true that theta = 2Ne.u. Theta should be defined as the observed fraction of polymorphic sites and not in terms of this formula (which is only true for the neutral theory).

BS: If one thinks of evolution as a birth-death process, one naturally wants a fluctuating population size. It seems that fixing N constant in the Wright-Fisher model makes the theory more painful to work with. We should use a branching process with equal birth and death rates and caluculate properties with the constraint that the population has not yet gone extinct.
PH: This branching process has a constant mean population but any one trajectory will go very much higher or lower than that and will eventually go extinct. This does not seem like a good model. We need something where the population size is constrained to fluctuate within a reasonable range around the average. Or maybe Wright-Fisher is OK anyway. It seems closer to reality than the branching process.

ML: Ne is a derived parameter, proportional to the population size, but modified by other parameters.
BS: But Daniel Fisher pointed out that in the tightly linked clonal interference case, theta goes like log (Ne).
ML: What you estimate for Ne also depends on your sample size. Do you see only established mutations, are also newly arising ones?
PH: Can somebody add somethng here please??? There was a discussion about the importance of recombination and whether Ne is still proportional to N if r is present. The fact that Ne is not even proportional to N in some cases means that it is completely ridiculous to talk about Ne as though it were a population size.

MK: Warren Ewens gave a talk a couple years ago about this. His advice: interpret at your own risk. <<<Is there a ref for this ?>>>
ML: What about strong selection limit? Then any sort of diffusion is a poor description?

2. Given that the neutral coalescent can be a poor model, what can we conclude from inferences based on it?

MK: There is a big gap between what can be calculated using coalescent models and what is actually done when data are analyzed. Even those trained in coalescent methods don't necessary use them on real data.
ML: Note that the connections between observables implied by the neutral coalescent model won't necessarily be true in the real data.
VM: Selection violates the fundamental assumptions of the coalescent.
BS: Carsten Wiuf's talk on rare alleles showed that you can do some limited things with selection in a coalescent framework. Rare alleles behave as though they were not constrained by total population size. Therefore the branching process model is relevant. <<<PH: is this what you said? Please expland>>>
MK: C. Neuhauser and S. M. Krone have an influential paper on selection in the coalescent where they work in the weak selection limit link.
MK: Recombination is also difficult in the coalescent framework. Felsenstein has some methods to do the coalescent with recombination. (Marty to insert references).

2.5 What does adaptation mean?

Group was generally inspired by Ville Mustonen's talk. In a static fitness landscape, one can have lots of positively selected mutations fixing, but there's no adaptation. Those mutations are just counteracting previous deleterious mutations.
BC: Fitness definition should include a time scale. Fitness over one generation may not correspond to long scale fitness.
PH: Can you be more specific about why fitness changes in time? Is it environmental change? Or selection induced by change in other species? Or changes elsewhere in the genome that cause epistatic selection on the first?
RN: Long term fitness is conditional on survival. For example, Paul Sniegowski's mutators are fitter, until they kill themselves off.
PH: The mutator stuff is interesting, but in that case selection acts indirectly on the things that are linked to the mutator allele and it does not act on the mutator directly. Therefore it is not a good example for this point.

SS: In VM's talk, fitness is defined on a population level, because we count fitness at fixation. As the allele segregates, it may not always have same fitness landscape.
VM: Maybe static fitness landscape is a better null model than completely neutral. No adaptation, but lots of selection.

MK: Lewontin has written on the meaning of adaptation. For example, if cliff-dwelling birds have limited nesting sites, an allele leading to more eggs per nest may have a large fitness advantage, and it may sweep through the population. In the end, however, the population size is still limited by the number of nesting sites. Was that increase in egg numbers really adaptive if it didn't end up in more birds?
PH: The above example is like the Wright-Fisher model with fixed population. Yes of course it was adaptive even though the population didnt go up. We talked about the difference between relative and absolute fitnesses. Relative fitness makes sense whether the population is fixed or not. It is usually the relative fitness that matters because usually the maximum population is constrained. The only example I can think of right now where the absolute fitness is of central importance is 'mutational meltdown'. If deleterious mutations accumulate to the point where the average number of offspring per individual is less than one then the populations dies by meltdown. As long as the average number of offspring for a well-fed uncrowded individual is greater than one, then the population expands to the carrying capacity where it is limited by food, space or whatever. Then we are back to the Wright-Fisher case with fixed population and only relative fitness that matters.

VM: Should adaptation be phrased as phenotype moving in a coherent direction?
MK: The alternative is "running in place".

3. What about dependence between sites?

RG: In the neutral case, expected values of single-site statistics don't change.
ML: That's expected; there's no extra force on them. But multiple site quantities are still difficult.
MK: Linkage increases the variance in even single-site quantities. This makes data interpretation difficult.

RN: Two meanings of "linkage": physical linkage along chromosome, and correlation between allele frequencies. Confusing terminology.
ML: Linkage disequilibrium is a misnomer, because LD doesn't depend on physical linkage and LD can exist in equilibrium.
PH: Lets talk about correlation when we mean correlation between allele frequencies on unliked sites, and talk about linkage when we are really interested in linked sites on a chromosome.

SS: How good is the correspondence between genetic maps inferred from population genetics and from direct experiments?
??: In humans it's pretty good.
MK: Correspondence is poor in flies. <<<PH Please can someone say more about humans flies etc. This is important.>>>

3.5 What about epistasis?

ML: How much is there? Is it a perturbation or the rule?
VM: Scanning for correlations between sites is like genome-wide association squared. Really poor statistical power.
??: Lots of sign epistasis yields such a rugged landscape on which you can't evolve.
MK: That's clearly not what is seen in the real world. <<<PH: Please expand on this. How do we know there are not many more higher peaks out there that cannot be reached from the local maximum on which we find ourselves>>>

BS: Epistatis comes in two flavors:
1. Physiological epistasis. Introduce mutations in the lab, and see that their effects aren't additive. Systems biology makes this unsurprising. << PH: Boris - please expand on the type of experiments you mean and what was found>>
2. Epistasis in natural populations. Epistasis leads to multiple fitness optima. A population will typically only occupy a single maximum, so one won't see the potential epistasis. One would, however, expect outbreeding depression. If you mate populations that chose different epistatic optima, the epistasis will matter again and cause reduced fitness. Paul Sniegowski sees this in yeast.
SG: Hard optimization problems don't have as many local maxima as people used to think. Instead they have many ridges.
PH: What about epistasis in protein sequence evolution? We need a model that goes from structure to fitness. Protein folding must be a global problem that depends on the whole sequence. There are many interactions with each residue. There are many close contacts between residues that are far apart along the sequence. Surely there should be a lot of epistasis? A good example of ridges would be the neutral networks for RNA evolution and maybe protein evolution too. Having ridges in the landscape may be important, but still the majority of mutations take you off the ridge. There is still a lot of epistasis in models with neutral networks and ridges.

BC: So is it like the neutral theory, which just ignores the many supposed highly deleterious mutations? Here we just ignore the strongly deleterious epsistatic combinations.
BS: There are many definitions of epistasis. Makes it a mess to work on.

To be Continued
PH: Here is the rest of my agenda. Does anyone want to revise this before we start again next week.
4. Interaction of selection with geography/migration
5. How is human population genetics similar or different to other species?
6. What data would we like?
7. Genome evolution - duplications/deletions/inventions of new genes
8. Evolution of regulatory regions.