New Scientist wrote:At least 75 per cent of our DNA really is useless junk after all
You’re far from a perfect product. The code that makes us is at least 75 per cent rubbish, according to a study that suggests most of our DNA really is junk after all.
After 20 years of biologists arguing that most of the human genome must have some kind of function, the study calculated that in fact the vast majority of our DNA has to be useless. It came to this conclusion by calculating that, because of the way evolution works, we’d each have to have a million children, and almost all of them would need to die, if most of our DNA had a purpose.
But we each have just a few children on average, and our genetic health is mostly fine. The study therefore concludes that most of our DNA really must be junk – a suggestion that contradicts controversial claims to the contrary from a group of prominent genomics researchers in 2012.
Junk or not?
When researchers first worked out how DNA encodes the instructions for making proteins in the 1950s, they assumed that almost all DNA codes for proteins. However, by the 1970s, it was becoming clear that only a tiny proportion of a genome encodes functional proteins – about 1 per cent in the case of us humans.
Biologists realised that some of the non-coding DNA might still have an important role, such as regulating the activity of the protein-coding genes. But around 90 per cent of our genome is still junk DNA, they suggested – a term that first appeared in print in a 1972 article in New Scientist.
But throughout the 2000s, a number of studies purported to show that junk DNA was nothing of the sort, based on demonstrating that some tiny bits of non-coding DNA had some use or other. These claims proved popular with creationists, who were struggling to explain why an intelligently designed genome would consist mostly of rubbish.
The grandest claim came in 2012, when a consortium of genomics researchers called ENCODE declared that, according to their project, a huge 80 per cent of the DNA in the human genome has a function. “They had spent $400 million, they wanted something big to say,” says Dan Graur of the University of Houston.
Graur is one of many researchers who didn’t believe ENCODE’s claim. The heart of the issue is how you define functional. ENCODE defined DNA as such if it showed any “biochemical activity”, for instance, if it was copied into RNA. But Graur doesn’t think a bit of activity like this is enough to prove DNA has a meaningful use. Instead, he argues that a sequence can only be described as functional if it has evolved to do something useful, and if a mutation disrupting it would have a harmful effect.
Millions of children
Mutations to DNA happen at random for several reasons, such as UV radiation or mistakes made when DNA replicates during cell division. These mutations change one base of DNA into another – an A to a T, for example – and when they occur in a gene are more likely to be harmful than beneficial.
When we reproduce, our children inherit a shuffled bag of mutations, and those with a collection of particularly bad ones are more likely to die before having children of their own. This is how evolution stops bad mutations building up to dangerously high levels in a species.
Following Graur’s logic, if most of our DNA is functional, we would accumulate a large proportion of harmful mutations in important sequences. But if most of our DNA is junk, the majority of mutations would have no effect.
Graur’s team have now calculated how many children a couple would need to conceive so evolution could weed out enough bad mutations from our genomes as fast as they arise. If the entire genome was functional, couples would need to have around 100 million children, and almost all would have to die. Even if just a quarter of the genome is functional, each couple would still have to have nearly four children on average, with only two surviving to adulthood, to prevent harmful mutations building up to dangerous levels.
Taking into account estimates of the mutation rate and average prehistorical reproduction rate, Graur’s team calculated that only around 8 to 14 per cent of our DNA is likely to have a function.
So, let's move on to the actual paper, shall we? Which is this one:
An Upper Limit On The Functional Fraction Of The Human Genome by Dan Graur, Genome Biology & Evolution, 2017 evx121. DOI: 10.1093/gbe/evx121
This paper is at the moment freely downloadable from here.
We'll start as usual with the abstract:
Graur, 2017 wrote:Abstract:
For the human population to maintain a constant size from generation to generation, an increase in fertility must compensate for the reduction in the mean fitness of the population caused, among others, by deleterious mutations. The required increase in fertility due to this mutational load depends on the number of sites in the genome that are functional, the mutation rate, and the fraction of deleterious mutations among all mutations in functional regions. These dependencies and the fact that there exists a maximum tolerable replacement level fertility can be used to put an upper limit on the fraction of the human genome that can be functional. Mutational load considerations lead to the conclusion that the functional fraction within the human genome cannot exceed 25%, and is probably considerably lower.
Graur's argument is not especially complicated, but requires several steps to be taken carefully one at a time, in order to be appreciated fully. So, we start with the introduction:
Graur, 2017 wrote:Introduction:
Many evolutionary processes can cause a population to have a mean fitness lower than its theoretical maximum. For example, deleterious mutations may occur faster than selection can get rid of them; recombination may break apart favorable combinations of alleles, thus creating less fit combinations; and genetic drift may cause allele frequencies to change in a manner that is antagonistic to the effects of natural selection. Genetic load (L) is defined as the reduction in the mean fitness of a population (w) relative to the individual with the maximal fitness (wmax) in the population (Haldane1937; Muller 1950).
L = (wmax - w)/wmax [1]
There are many kinds of genetic loads, such as the load caused by deleterious mutations, the segregation load, the substitutional load (also referred to as the “cost of natural selection”), the load due to recombination, and loads due to migration and inbreeding. In the following, we use the mutational load, i.e., the reduction in mean population fitness due to deleterious mutations, as a proxy for the overall genetic load. This is a conservative approach, as the true genetic load can only be equal to or higher than our estimate.
In conformity with standard practice, the terms involving means are, in the original paper, written with bars above them, which will have to be omitted due to limitations of the forum software with regard to rendering non-standard character sequences (there's no built in overstrike facility for one). So, in the discussion that follows, you'll have to insert the bars mentally when required.

So, the relatively simple expression above, relates genetic load L, as a function of the mean fitness of a population, in relation to the maximum fitness enjoyed by one member thereof. By restricting atttention to mutational load only, as stated above, the estimate is conservative, and consequently provides a lower bound on the actual value of L for any given population. This is important in what follows.
Next, Graur moves on to consider replacement fertility values, and the relationship thereof with mutational load:
Graur, 2017 wrote:The mutational load determines the mean fitness of a population, which in turn determines the mean fertility required to maintain a constant population size, i.e., the replacement level fertility, as a function of the number of functional sites in the genome. Obviously, fertility values cannot be arbitrarily large, and that there exists a relatively modest upper limit for tolerable mean fertility values in human populations.
Here, we use empirical data on genome size, mutation rates, the fraction of deleterious mutations from among all mutations in functional regions, as well as data on fertility rates to estimate an upper limit on the functional fraction of the human genome.
Graur's entire thesis, as has already been revealed in the non-technical account, is to demonstrate that a reasonable analysis of the relationship between mutational load and replacement fertility, places considerable and tight constraints on the fraction of the genome that can consist of functional DNA. However, a key aspect of Graur's work consists of providing a proper definition of 'functional DNA', which duly follows below:
Graur, 2017 wrote:Definitions
Throughout this paper, the term “function” is used to denote selected effect function, i.e., a capacity that has been shaped by and is maintained by natural selection (Wright 1973; Graur et al. 2013, 2014; Brunet and Doolittle 2014). The selected effect function stands in contradistinction with the causal role function (or activity), which is ahistorical and nonevolutionary, and merely describes what an entity does (Cummins 1975; Amundson and Lauder 1994). A genomic segment is considered to possess a selected effect function if at least one out of all the possible mutations that can affect its sequence is deleterious (Graur 2016; pp. 492–496).
In short, Graur argues that mere reactivity with other molecules is woefully insufficient to constitute 'function' in a biological setting, and that instead, for a sequence to be genuinely functional therein, it must be suitably coupled, in a rigorous manner, to a product capable of being subject to selection processes. In particular, the product in question must be capable of being disrupted by a deleterious mutation, and for this to have, at least in principle, a quantifiable phenotypic effect, even if said quantification proves difficult in practice.
Now we're ready to move on to population fertility, dealt with succinctly thus:
Graur, 2017 wrote:The mean fertility of a population is the mean number of offspring born per individual. Here, we are interested in the mean replacement level fertility (F), i.e., the fertility required to maintain a constant population size from generation to generation.
Again, insert the bar where needed.

Now we come to the crux of the matter:
Graur, 2017 wrote:Model
The purpose of this model is to make a quantitative connection between the rate of deleterious mutation, the fraction of the genome that is functional, and replacement level fertility.
In the model we assume that the probability of a mutation occurring in a certain region of the genome is independent of the functionality or lack of functionality of the region in which the mutation arises (Luria and Delbrück 1943; Lederberg and Lederberg 1952). We also assume that all mutations occurring in the nonfunctional fraction of the genome are neutral. Mutations occurring in the functional fraction of the genome, on the other hand, are assumed to be either deleterious or neutral. Advantageous mutations are known to be extremely rare (e.g., Eyre-Walker and Keightley 2009) and, hence, unlikely to affect the results.
Although this model, as stated above, is an elementary one, it is one that is consonant with data, as supplied in the cited references.
Now the next key step is introduced, viz:
Graur, 2017 wrote:By assuming that the fitness contributions of different loci are independent from one another, i.e., that there is no epistasis, then the load of mutation can be approximated as
L ≈ Cμdel [2]
where μdel is the mean deleterious mutation rate and C is a constant between 1 for completely recessive mutations and 2 for completely dominant mutations (Crow and Kimura 1970).
Thus, the mutational load does not depend on the strength of selection against any particular mutation. This surprising result comes from the fact that alleles under strong selection are relatively rare, but their effects on mean fitness are large, while the alleles under weak purifying selection are common, but their effects on mean fitness are small. As a result, the effects of these two types of mutation neatly cancel out. To understand the magnitude of the mutational load in a population, we need only determine the deleterious mutation rate, not the distribution of fitness effects.
So, already, the model proposed allows us to relate simple population variables, without having to delve into the minutiae of the populations, without loss of generality of application. Those familiar with the thermodynamic modelling of gases, will appreciate the smiliarity between this and Graur's model in this respect.
Note that in the above, μdel is a properly constituted probability value, lying in the interval [0, 1].
Graur now moves on to consider how mean fitness is related to mean deleterious mutation rate:
Graur, 2017 wrote:The mean fitness of the population can be defined by two variables, the mean deleterious mutation rate per functional nucleotide site per generation (μdel) and the number of functional nucleotide sites (n) in the genome (Kimura 1961; Nei 2013).
w = (1 - μdel)n [3]
Note that the larger n is, the lower w will be.
I emphasised the last part above in blue, because by definition, (1 - μdel) < 1 for any nonzero value of μdel within the permitted interval, and xn decreases as n increases, whenever 0<x<1. Consequently, for large n, w will be small, unless μdel is itself very small, and (1 - μdel) is consequently very close to 1. Numbers will be supplied later, to allow a quantitative view of this relationship.
Now, Graur moves on to considering the relationship between replacement fertility and mean fitness, viz:
Graur, 2017 wrote:Let us now consider the connection between mutational load and replacement level fertility (F). If the mortality rate before reproduction age is 0 and mean fertility is 1, then the population will remain constant in size from generation to generation. In real populations, however, the mortality rate before reproduction is greater than 0 and, hence, mean fertility needs to be larger than 1 to maintain a constant population size. In the general case, for a population to maintain constant size, its replacement level fertility should be
F = 1/w [4]
(Nei 2013)
Again, insert bars where appropriate.

Now it's time for some data ...
Graur, 2017 wrote:Data
Genome size
The maximal possible number of functional sites in the human genome equals the size of the diploid genome. The human diploid genome size has been estimated to be 6.114 × 109 nucleotides in length (Doležel and Greilhuber 2010).
Mutation rates
Human germline mutation rates are known to vary among different regions of the genome (Harpak et al. 2016), to be different between males and females (Li et al. 2002), and to correlate with father’s age (Kong et al. 2012). In humans, the mean germline point mutation rate at the DNA level has been inferred by many methods and by using a variety of data sets (Kondrashov and Crow 1993; Drake et al. 1998; Nachman and Crowell 2000, Kondrashov 2003; Xue et al. 2009; Roach et al. 2010; Campbell et al. 2000; Kong et al. 2012; Michaelson et al. 2012). Notwithstanding the large number of estimates and estimation methodologies, the range of recent (i.e., 2010–2016) values for the germline mutation rate varies by merely a factor of 2.5, from 1.0 × 10–8 to 2.5 × 10–8 mutations per nucleotide site per generation (Scally 2016).
Fraction of mutations in functional regions that are deleterious
What fraction of the mutations occurring in a functional region of the genome consists of deleterious mutation? Since at present we cannot answer this question as far as RNA-specifying and nontranscribed genes are concerned, we will use coding regions in protein-coding genes as models for functional genomic regions. Approximately, 24% of all mutations occurring coding regions are synonymous and, hence, almost certainly not subject to purifying selection (Price and Graur 2016). If we assume that all missense and nonsense mutations are deleterious, then a maximalist estimate for the deleterious mutation rate in functional regions (μdel) will be 76% of the total mutation rate (μ). Alternatively, we may assume that only nonsense mutations are deleterious, i.e., that all amino acid replacements are neutral, in which case a minimalist estimate of the deleterious mutation rate will be 4% of the total mutation rate. Empirical data indicate that about half of all missense mutations in coding regions are deleterious (Soskine and Tawfik 2010). Adding the deleterious missense mutations to the deleterious nonsense mutations yields an empirical mean estimate for the deleterious mutation rate of about 40% of the total mutation rate.
Range of deleterious mutation rates
By multiplying the lowest mutation rate estimate by the lowest possible fraction of deleterious mutations (4%), and by multiplying the highest mutation rate estimate by the highest possible fraction of deleterious mutations (76%), we infer that the rate of deleterious mutation ranges between 4 × 10–10 to 2 × 10–8 mutations per nucleotide site per generation. If we use the empirical estimate for the fraction of deleterious mutations out of all mutations (40%), then the range of deleterious mutation rates becomes 4 × 10–9 – 1 × 10–8 mutations per nucleotide site per generation.
So, from the above data, we have an estimate for μdel, which allows us to determine what values of mean fitness w will arise, for different values of n (up to nmax, the size of the entire diploid genome) via equation [3] above, and therefore, the replacement fertility rate required via equation [4], for each of these mean fitness values.
The results are summarised thus:
Graur, 2017 wrote:Results
The required replacement level fertility was calculated for a range of deleterious mutation rate values from 4.0 × 10−10 to 2 × 10−8 mutations per nucleotide per generation as a function of the fraction of the human genome that is assumed to be functional. The results are shown in Table S1 in the Supplementary Material. Some columns are reproduced in Table 1. We note that scales positively and steeply with both the deleterious mutation rate and the number of functional sites in the genome.
I downloaded that table from the journal page linked to above, and the figures are pretty astounding to view. For example, starting with the smallest mutation rate, μdel = 4.0×10-10 mutations per nucleotide per generation, them the replacement fertility rate required is very close to 1.0, when the functional fraction of the genome (call this G) is just 0.01 (1%). At G set to 10%, the required value of F becomes 1.3, and at G set to 20%, F becomes 1.6. However, the value for G set 80% (the value recently offered up by the ENCODE project), requires a replacement fertility rate of 7.1, and by the time the entire genome is functional (G set to 100%), then F becomes 12.0.
We only have to increase the value of μdel, to 2.0×10-9, for the numbers to start climbing alarmingly. The figures are as follows:
G = 0.01 (1%) : F = 1.1
G = 0.10 (10%) : F = 3.4
G = 0.20 (20%) : F = 12
G = 0.80 (80%) : F = 19,000
G = 1.00 (100%) : F = 220,000
So, even a relatively modest deleterious mutation rate, results in replacement fertility requirement increasing enormously to compensate, if a significant fraction of the genome is actually functional instead of junk.
Once we take μdel to the maximum value consonant with empirical data, namely 2×10-8, then the data is as follows:
G = 0.01 (1%) : F = 3.4
G = 0.10 (10%) : F = 220,000
G = 0.20 (20%) : F = 4.7×1010
G = 0.80 (80%) : F = 4.9×1042
G = 1.00 (100%) : F = 2.3×1053
So, for the lowest empirically reasonable value of μdel, each reproducing entity within the population has to produce a minimum of over 7 offspring, in order to maintain replacement fertility, if the ENCODE assertion on functional proportion of the genome is correct. For those creationists who assert that the entire genome is functional, that replacement fertility jumps to 12, of which 10 are destined to die.
For a reasonable value of μdel, the ENCODE project assertion requires replacement fertility to be 19,000 offspring per reproducing entity, and creationist assertions require that to jump to 220,000 offspring per reproducing entity. Any organisms producing fewer offspring than these values, are doomed to extinction in short order, if either ENCODE's assertion, or creationist assertions, are something other than fantasy.
For the maximum empirically reasonable value of μdel, the numbers are even more alarming, and indeed, require frankly impossible numbers of offspring to be produced with each new generation, in order to maintain replacement fertility - with vast numbers of those offspring dying before reproduction. Here's Graur's own description thereof (I invite you all to share the hilarity:)
Graur, 2017 wrote:Discussion
How high a replacement level fertility value can a human population tolerate? The answer is that F values cannot be arbitrarily large. One cannot imagine F = 50, i.e., the situation in which each woman in a population gives birth to an average of 100 children of which on average 98 will die or fail to reproduce. Thus, there must exist a relatively modest upper limit for the tolerable mean replacement level fertility.
While the oldest Homo sapiens fossil is ~315,000 years old (Hublin et al. 2017), the common ancestor of all modern human populations is only 100,000–200,000 years old (Green and Shapiro 2013). Throughout this period, mean replacement level fertility remained fairly constant (Davis 1986) and varied from less than 1.05 to nearly 1.75 per person, or from less than 2.1 to nearly 3.5 per couple (Espenshade et al. 2004). Given these numbers, we decided to use F = 1.8 as the maximum tolerable value.
From Table 1, we see that even for unrealistically low estimates of deleterious mutation rates, the fraction of the genome that can be functional cannot exceed 25%. If the fraction of deleterious mutations out of all mutations in functional regions is even slightly higher than 4%, then the fraction of the genome that can be functional becomes much lower. Realistically, the functional fraction of the genome cannot exceed 10–15%. These results agree with empirical estimates in the literature on the fraction of the human genome that is evolutionarily constrained (Rands et al. 2014).
Let us now see what happens if we assume that 80% of the diploid human genome is functional, as was claimed by the ENCODE Project Consortium (2012). By using the lower bound for the deleterious mutation rate (4 × 10−10 mutations per nucleotide per generation), the mean individual fertility required to maintain a constant population size would be F= 7.14. For 80% of the human genome to be functional, each couple in the world would have to beget on average 15 children and all but two would have to die or fail to reproduce. If we use the upper bound for the deleterious mutation rate (2 × 10−8 mutations per nucleotide per generation), then becomes ~5 × 1042, i.e., the number of children that each couple would have to have to maintain a constant population size would exceed the number of stars in the visible universe by ten orders of magnitude. The absurdity of such numbers was realized by Muller (1950, 1967) who suggested that genetic load values cannot exceed L = 1. Indeed, a recent estimate of the mutational load suggest that humans have a mutational genetic load of about 0.99 (Eory et al. 2010).
Enjoy his little swipe at Francis Collins immediately after the above.

Meanwhile, Graur rounds off the paper with some considerations of epistasis:
Graur, 2017 wrote:Above, we assumed that the mating pattern within human population is random and that deleterious mutations have independent effects on fitness. Deviations from either of these assumptions can affect the mutational load and consequently our estimate of the mean fertility required to maintain a constant population size. For example, both inbreeding and negative fitness epistasis (also referred to as synergistic epistasis on deleterious mutations) will reduce the mutational load by increasing the number of deleterious mutations removed from the population (Kimura and Maruyama 1966; Barrett and Charlesworth 1991). On the other hand, any factor that decreases the efficacy of selection, such as positive fitness epistasis (also referred to as antagonistic epistasis on deleterious mutations) or reduction in effective population size, will increase the mutational genetic load (Kimura et al. 1963).
Let us first deal to inbreeding. Empirical data pertaining to human populations show that with the exception of some isolates in Oceania and the Americas, genomic inbreeding coefficients in human populations are quite small and, in the context of mutational genetic load, negligible (Pemberton and Rosenberg 2014). Dealing with fitness epistasis in humans and other non-model organisms is somewhat more complicated. In the largest study to date, Wang et al. (2016) took advantage of the fact that most African Americans inherited their genome from both African and European ancestors. In such a population, it is possible to discover fitness epistasis between two loci by detecting combinations of an African allele at one locus and a European allele at another locus that exist in the population at greater or lesser proportions than expected by chance. In Wang et al.’s study, more than 24 million pairwise-locus tests from more than 16,000 individuals were performed. A single case of suspected epistasis was found, indicating that epistasis is quite rare. This finding is in agreement with previous studies (e.g., Kouyos et al. 2007; Halligan and Keightley 2009), which showed that “there is little empirical evidence that net synergistic epistasis for fitness is common” (Keightley 2012).
Recently, Sohail et al. (2017) claimed to have found evidence for negative epistasis among deleterious alleles in humans. In their study, they divided mutations into synonymous, nonsynonymous (or missense), stop-codon gain, stop codon-loss, and mutations affecting splicing. The last three categories of mutations were grouped together into a category called loss-of-function (LoF). As a proxy for epistasis they used linkage disequilibrium as follows: In the absence of epistasis, alleles should contribute to the mutation burden independently, such that the variance of the mutation burden is equal to the sum of the variances at all loci, i.e., to the additive variance (VA). For rare mutant alleles, the mutation burden should follow a Poisson distribution with a variance (σ2) equal to its mean (μ). Hence under no epistasis, VA = σ2 or σ2/VA = 1. If negative or synergistic epistasis on deleterious alleles operates, negative linkage disequilibrium will be observed and, as a result, the variance of the mutation burden will be reduced, leading to σ2/VA < 1. In contrast, under positive or antagonistic epistasis on deleterious alleles positive linkage disequilibrium between deleterious alleles will be observed, leading to σ2/VA > 1.
Sohail et al. (2017) reasonably assumed that most LoF mutations are deleterious. In the LoF category, the value of σ2/VA was 0.930, a very slight decrease in comparison with the expectation under no epistasis. This led them to claim that synergistic epistasis is prevalent among deleterious mutations. The problem with this conclusion is that LoF mutations constitute only about 3% of the mutations in the GoNL sample. Approximately 65% of the mutations were nonsynonymous. We know that to a greater or lesser extent, many nonsynonymous mutations have deleterious effects on fitness (e.g., Eyre-Walker et al. 2006; Eyre-Walker and Keightley 2007), but for this category of mutations, σ2/VA = 2.077, more than twice the expectation under no epistasis. This result indicates that for the vast majority of deleterious mutations, positive epistasis prevails. The existence of positive epistasis indicates that the 25% estimate for the upper limit on the functional fraction of the human genome may be exaggerated.
Finally, we note that in addition to inferring an upper limit on the functional fraction of the human genome, we can also conclude that the fraction of deleterious mutations out of all mutations in functional regions should be very small. If more than 20% of all mutations in functional regions are deleterious, then the upper limit on the functional fraction of the human genome would be less than 2%, which is clearly false.
Looks like that's pretty much done and dusted then, doesn't it?
