Salmon Run: Modeling the Evolution of Skin Color

Sometime back, someone asked this question on Quora: if all humans descended from a single population of ancestors in Africa, how did different skin colors come about? Pippi Maria Groving provided an answer (the first one) that I found intuitively very appealing, although she does mention (and later answers also indicate) that this may be a bit of an oversimplification.

Essentially she states that there are 3 important genes that govern skin color in humans, say A, B and C. Each human has a pair of each, and each pair may be a combination of dominant and recessive versions of these genes, denoted by uppercase and lowercase respectively. Thus, given this model, one can have a combination of 6 possible genes (in pairs) for skin color: A, a, B, b, C, c. The dominant genes produce pigments that darken the skin, so the more dominant genes one has, the darker the skin color. She also provides a 8x8 Punnet Square exhaustively listing out all combinations of genes (called genotypes) and their resulting skin color (phenotypes). Binning the phenotypes into 7 distinct skin colors results in a theoretical ratio of 1:6:15:20:15:6:1 from very dark to very light.

I am currently also doing an edX course (offered by MIT) on Quantitative Biology, so I figured it would be interesting to try to model this as a simulation to see if my experimental results matched up to the theoretical ratios. Since skin color is an adaptation based on the weather, with darker skin providing protection from ultraviolet (UV) light in hot and sunny climates, and lighter skin able to make more vitamin D with limited UV in colder and less sunny climates, I ran another simulation to model that situation. This post is a result of that work.

The functions used for this simulation is provided below. I describe the relevant functions as they are called from the main code below.

from operator import itemgetter
from random import random
import math
import matplotlib.pyplot as plt
import nltk
import numpy as np

def person():
    alleles = []
    for allele in ['a','b','c']:
        pairs = []
        for pair in range(2):
            pairs.append(allele if random() <= 0.5 else allele.upper())
        alleles.append("".join(sorted(pairs)))
    return alleles

def shuffle_and_choose(counts):
    shuffled = [x[0] for x in sorted(enumerate([random() for i in 
                range(len(counts))]), key=itemgetter(1))]
    return counts[shuffled[0]]

def compute_mating_likelihood(left, right):
    left_dominant = get_num_dominant(left)
    right_dominant = get_num_dominant(right)
    diff = abs(left_dominant - right_dominant)
    return math.exp(-diff)

def mate(left, right):
    mated_alleles = []
    for i in range(3):
        child_pairs = []
        for lp in left[i]:
            for rp in right[i]:
                child_pairs.append("".join(sorted([lp, rp])))
        mated_alleles.append(shuffle_and_choose(child_pairs))
    return mated_alleles

def get_num_dominant(allele):
    return len([c for c in "".join(allele) if c == c.upper()])    
        
def produce_next_generation(curr_gen, region_filter=None):
    next_gen = []
    males = curr_gen[:len(curr_gen)/2]
    females = curr_gen[len(curr_gen)/2:]
    i = 0
    while i < len(curr_gen):
        mptr = int(random() * len(males))
        fptr = int(random() * len(females))
        offspring = mate(males[mptr], females[fptr])
        if region_filter is not None:
            num_dominant = get_num_dominant(offspring)
            if not num_dominant in region_filter:
                if random() > 0.1:
                    continue
        next_gen.append(offspring)
        i = i + 1
    return next_gen
        
SKIN_COLORS = {
  6: 0x111111, 5: 0x6B0000, 4: 0x7B3812, 3: 0xAB671D, 
  2: 0xE0AD87, 1: 0xFDDACA, 0: 0xFEF2DF 
};

def get_color_distrib(curr_gen):
    color_dist = {k:0 for k in SKIN_COLORS.keys()}
    for alleles in curr_gen:
        num_dominant = get_num_dominant(alleles)
        color_dist[num_dominant] = color_dist[num_dominant] + 1
    dist_values = []
    for k in sorted(list(color_dist.keys())):
        dist_values.append(color_dist[k])
    return np.array(dist_values)

def plot_population_chart(color_pop, gen_title):
    xs = [str(hex(SKIN_COLORS[x])).replace("0x", "#") for x in range(7)]
    plt.bar(range(len(xs)), color_pop, color=xs)
    plt.xlabel("Skin Colors")
    plt.ylabel("Frequency")
    plt.xticks([])
    plt.title("Skin Color Distribution: %s" % (gen_title))
    plt.show()

def plot_population_drift(drift_data, title):
    generations = range(drift_data.shape[1])
    xs = range(drift_data.shape[0])
    colors = [str(hex(SKIN_COLORS[x])).replace("0x", "#") for x in xs]
    plt.stackplot(generations, drift_data, baseline="zero", colors=colors)
    plt.xlabel("Generations")
    plt.ylabel("Frequency")
    plt.title("Phenotype Drift:%s" % (title))
    plt.show()

The first part models a situation where the distribution of skin color is truly random. I generate a population of 2,000 individuals, each with a random set of 3 gene pairs (alleles). Then I split the set in half resulting in a pair of sets of 1,000 individuals each. A random member of each set is mated to a random member of the other to produce an offspring - to ensure randomness in the offspring, I exhaustively compute all possibilities for each allele and randomly pick one to create each corresponding offspring allele. This is repeated for 100 generations and the resulting population distribution binned by the 7 skin color phenotypes.

num_generations = 100

drift_data = np.zeros((len(SKIN_COLORS), num_generations + 1))
curr_gen = [person() for i in range(2000)]
drift_data[:, 0] = get_color_distrib(curr_gen)
plot_population_chart(drift_data[:, 0], "Initial")

for i in range(num_generations):
    next_gen = produce_next_generation(curr_gen)
    drift_data[:, i+1] = get_color_distrib(next_gen)
    curr_gen = next_gen
    
plot_population_chart(drift_data[:, num_generations], "Final")    

print drift_data[:, num_generations]

As can be seen, the distribution of the skin color phenotype in the 100-th generation looks remarkably similar to that of the first generation. The colors are identical to those used in Pippa's answer (thanks to this online Color Picker Tool).

The chart below shows the drift in the phenotype distribution across generations. Once again, the long term trend seems quite flat and unchanging.

The counts of the number of individuals in the different skin color categories in the final generation for me were 39:182:467:637:473:169:33, which is remarkably similar to the theoretical observed ratio as shown below. Note: some numbers were rounded to make them line up for easier visual comparison.

>>> from __future__ import division
>>> import numpy as np
>>> theoretical = np.array([1, 6, 15, 20, 15, 6, 1])
>>> observed = np.array([39, 182, 467, 637, 473, 169, 33])
>>> theoretical / np.sum(theoretical)
array([ 0.0156, 0.094, 0.2343, 0.3125, 0.2344, 0.0938, 0.0156])
>>> observed / np.sum(observed)
array([ 0.0195, 0.091, 0.2335, 0.3185, 0.2365, 0.0845, 0.0165])

I now tried to simulate the situation where identically random sets of people moved to different geographical regions (5 in my case) with different levels of sunlight. In each region, natural selection would ensure that people of certain skin colors survived. I choose skin color "trigrams" for each region. Thus, for region 1 (cold and dark) I choose the skin color categories (1,2,3), for region 2 I choose categories (2,3,4), and so on. Offspring produced in each generation whose genotypes resolved to one of the "approved" skin colors for the region would survive unconditionally, while others would survive with a 10% chance.

regions = [x for x in nltk.trigrams(range(7))]
for i in range(len(regions)):
    drift_data = np.zeros((len(SKIN_COLORS), num_generations + 1))
    curr_gen = [person() for x in range(2000)]
    for j in range(num_generations):
        next_gen = produce_next_generation(curr_gen, region_filter=set(regions[i]))
        drift_data[:, j+1] = get_color_distrib(next_gen)
        curr_gen = next_gen
    plot_population_drift(drift_data, "Dispersion, Region %d" % (i+1))

This produces the following population drift charts.

As can be seen, each region seems to have a preferred skin color that begins to dominate after a while. So the model, grossly oversimplified as it is, seems to agree with the facts.

I started on this because I was curious if I could build something that approximated reality using randomness (ie random.random() or flipping a coin). I had lots of fun with it, hope you enjoyed reading it also. The code for this is available on my project on GitHub here.