An extensive Tutorial on Demographic Modeling with fastsimcoal2 Coalescent Simulations
In evolutionary biology, reconstructing and quantifying evolutionary history is often challenging due to the complexity of underlying processes. Demographic modeling provides a powerful framework to address this issue by simulating evolutionary scenarios under defined parameters, comparing the resulting genomic patterns to empirical data, and refining model fit through optimization algorithms. As such, demographic modeling is a highly valuable tool in evolutionary research. In this tutorial, I present a concise introduction to the topic using fastsimcoal2.
BIOINFORMATICS
Tim L. Heller
7/8/20256 min read


fastsimcoal2 is a powerful Markov coalescent simulator, modeling complex evolutionary scenarios including shifting migration dynamics, effective population sizes, and population expansion or constrictions. These models can be created by template files, permitting to introduce unknown parameter variables for estimation. Providing observed nucleotide data enables the iterative optimization of these estimated parameters by Brent algorithms aiming to increase the likelihood of the simulated fitting the observed data. While this program is obtruse and requires precision in the file preparation, it is a precious addition to a biologist’s toolbelt. In this tutorial, I will provide a well-explained introduction to the most common use cases of fastsimcoal. For more complex scenarios or further information, check the original manual.
Preparation
To start the modeling process with fastsimcoal, it is best practice to define a Prefix, which will stay relevant throughout the analysis, and assign Deme numbers to each of your included population/ species. Fastsimcoal always starts numbering from 0. Write these down and we will return to them shortly.
This could look like:
PREFIX=Heller_Tutorial
#Demes (Populations)
C_caudatus = deme 0
C_bitaeniatus = deme 1
C_voeltzkowi = deme 2
C_Hybrid = deme 3
Observed Site Frequency Spectra (SFS) are needed to improve parameter fit in the model. There are two distinct types of SFS. Both require certain format changes to seamlessly work with fastsimcoal. I have presented these in this tutorial.
All SFS files provided to fastsimcoal must follow a strict naming convention. While easySFS automatically prepares these names in the fastsimcoal-folder, the ANGSD SFS files require a manual change. The names are dependent on the SFS state. For unfolded SFS (derived allele frequency), fastsimcoal2 expects:
“$PREFIX”_jointDAF1_0.obs, signifying the two-dimensional derived allele frequence between deme 1 and 0
Analogous, the folded SFS (minor allele frequency) is named by this pattern:
“$PREFIX”_jointMAF2_1.obs, describing the two-dimensional minor allele frequency between deme 2 and 1.
For multiSFS, which is a valid input type since fastsimcoal 2.6, the x-dimensional SFS should be named “$PREFIX”_DSFS.obs or “$PREFIX”_MSFS.obs for derived or minor allele frequency spectra accordingly.
I propose to use 2dSFS, as I had more success with them and they are easier to create, particularly if you have more than three demes involved.
Template file
Once you have created all 2dSFS pairings or the multidimensional SFS and named them correctly, you can start to describe your evolutionary model in the "$PREFIX".tpl file. I propose to download a template file and adjust accordingly, without implementing empty lines, tabs, symbols, or spaces.
Careful: Fastsimcoal is very sensitive to lettercases, spaces, empty lines, hidden symbols etc. Thus, a meticulous definition of your input file is invaluable. Comments can be added via “//” and are ignored by the algorithm. Furthermore, variables can be defined freely but must avoid overlap with each other or functions of fastsimcoal itself. It is good practice to add Dollar signs to the end of each variable, omitting overlap between variables like N_A and N_ANC.
Specify demes
First, we write the number of demes and their effective population sizes. If you did not previously estimate these parameters, you can give them a variable name. For our case, this would look like this.
// Number of Demes throughout the model
4
// Effective haploid popuation sizes (diploid Ne * 2)
1500
1500
500
N_HYB$
Then we describe the haploid sample sizes and generations since sampling. Haploid sample sizes relate to twice the number of your diploid samples, unless you use haplotype data. Usually, the sampling time of one population is set to 0 and used as an anchor unless there are observations between the last genomic sampling and the current state that ought to be included in the model (e.g., the extinction of a deme).
// Haploid sample sizes and generations passed since sampling
20 5
10 0
18 10
18 0
Finally, we have to model growth rates of our populations. Note, that these are defined backwards in time. Hence, negative growth rates define a population expansion forward in time and vice-versa. Unless you have specific reasons to assume the deme underwent constant growth, I prefer to leave these at zero. Non-zero growth rates tend to introduce population size inflation or non-coalescence in models unless priors are restrictive.
//Growth rates
0
0
0
0
Define Migration Matrices
After this, we can already define our first migration matrices. This must equally be done with great care to only model migration between existing demes and not to itself. The migration matrix in fastsimcoal are Nij matrices with i and j being the number of demes. Like all events and dynamics in fastsimcoal2, migration matrices are specified backwards in time. So the recipient population receives migrants if you reverse the flow of time. In chronological order, the recipient population is therefore the source and the donor population the sink of migrants. While this notation may confuse you at first, it is still necessary to consider both perspectives as fastsimcoal2 defines the migration as the likelihood of genes from the source population ending up in the sink. By reversing time, it quantifies how many migrants of the recipient population’s Ne migrated per generation.
So, in our example we could model an event, where 300 generations ago, a few individuals of C. bitaeniatus (deme 1) migrated towards the island Juan de Nova, home to C. caudatus (deme 0):
// Number of migration matrices
2
// Migration Matrix 0 :
0 0 0 0
N_10$ 0 0 0
0 0 0 0
0 0 0 0
As you remember, we defined two migration matrices. It is good practice to limit migration before the divergence of species to prevent self-migration, hence:
// NoMig Matrix 1:
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
Include historical Events
Next, we must define our evolutionary events. As evolutionary events, we can define extinction events, migration pulses, divergence, or population size changes (e.g. sudden bottlenecks). Furthermore, these are useful to set changes in evolutionary dynamics as limiting migration after a certain period or changes in growth coefficients. Like the migration matrices, historical events are specified backwards in time. Consider that the time points of the events are defined by their time stamp and not their order in the template.
The events are defined by the following space-delimited categories.
TIME_(GEN) SOURCE SINK MIGRATION NEW_DEME_SIZE NEW GROWTH NEW_MIG_MATRIX
We now model the following scenarios:
-Divergence of two deme 100 years ago
-Introgression (Migration) to one deme
-Extinction of a deme
-Hybrid speciation
Here, I elaborate the model I used in my first interaction with fastsimcoal2. It combines many evolutionary events, such as successive deme divergence, introgression to one deme, and hybrid speciation. The "historical event" is not a typo, it must be written like this and is part of the syntax.
//Historical event: time (in gen), source, sink, proportion of migrants, new deme size, new growth rate, new migration matrix
6 historical event
1 2 0 M_HYB3J$ 1 0 1
1 2 1 M_HYB3E$ 1 0 1
5 2 0 M_HYBJ$ 1 0 1
5 2 1 1 1 0 keep
T_DIV31$ 3 1 1 1 0 keep
T_DIV01$ 0 1 1 1 0 keep
Describe the Genome
Finally, we specify the number of unlinked blocks. To avoid errors, it is recommended to just use 1 for the whole SFS and define it as following:
//Number of independent (unlinked) chromosomes, and "chromosome structure" flag: 0 for identical structure across chromosomes, and 1 for different structures on different chromosomes.
1 0
//Number of contiguous linkage blocks in chromosome 1:
1
Then, SFS data is termed as FREQ, unless you have insights into recombination rates in your specimens, leave it at zero. Remember, rather have a simple than an overfitted model. The mutation rate can be derived for your focal group. E.g., for reptiles it is calculated to be 1.7e-8.
//per Block: data type, num loci, rec. rate and mut rate + optional parameters
FREQ 1 0 1.7e-8 OUTEXP
Estimation Files
Estimation Files are a necessary to specify model priors of the unspecified variables from the template file, which itself is intended to describe the evolutionary model. They are composed of a distribution information, a minimum, and a maximum value and must be named as $PREFIX.est.
The columns are:
Integer (1 = no, 0 = yes)
Variable Name (as defined in $PREFIX.tpl)
Distribution (uniform or loguniform)
Minimum Value
Maximum Value
"output" (if you provide this keyword, the estimated values are shown in your output files)
"bounded" (Meaning the model parameter cannot escape the prior spectrum)
"reference" (relevant for one parameter, as an anchor. I recommend to use the best researched information)
The file can look as following with the columns separated by simple spaces.
[PARAMETERS]
//#isInt? #name #dist. #min #max
//all N are in number of haploid individuals
1 N_JN$ logunif 9500 18000 output
1 N_EU$ logunif 9500 18000 output
1 N_TR$ unif 20 38 output bounded reference
1 N_SM$ logunif 450 650 output
1 T_DIV01$ logunif 9000 14000 output
1 T_DIV31$ logunif 4500 6500 output
0 M_HYBJ$ unif 0.2 0.8 output
0 M_HYB3E$ unif 0 0.8 output
0 M_HYB3J$ unif 0 0.5 output
All that is left to do, is to run the model. Of course, you can vary the parameters. However, I would be careful about playing around with them as they can easily stop fastsimcoal from working properly.
A command can look something like this:
../../fsc28_linux64/fsc28 -t Test_FSC.tpl -e Test_FSC.est -M -L 20 -n 10000 -d -I -x -y 3 -q > $output.$i.log