A Customizable Script for Hybrid Triangles from Genotype Likelihoods

Hybrid triangles are a reliable tool to infer the demographic history of hybrid populations. By incorporating the hybrid index (γ) and the interclass heterozygosity between the parents, this analysis distinguishes F1 from multi-generational hybrids and exposes backcrossing with either parent taxon. Established tools rely on VCF files for hybrid triangle generation, necessitating the hard-calling of SNPs. Here, I present a customizable script to perform the hybrid triangle analysis on genotype likelihood using a multi-step approach in ANGSD.

BIOINFORMATICS

Tim L. Heller

12/4/20255 min read

As I was recently confronted with a hybrid dataset of extremely low coverage (2-3X), I developed a customizable script to infer hybrid triangles from genotype likelihoods using ANGSD allele state inference and NGSadmix. In brief, I determined fixed (minor allele frequency > 0.95), divergent sites relative to an ancestral reference sequence for both parental taxa and performed individual SNP calling on this subset of sites for the hybrid swarm. To calculate the hybrid index (γ), I ran NGSadmix on the hybrid parents and the hybrids with K = 2 and extracted individual admixture proportions in each hybrid individual.

But why go through all that trouble? First, hybrid triangles might be the most straightforward way to establish, if a hybrid has backcrossed either to itself or with either one of the parents. This is particularly useful to determine, if a hybrid population is fertile (non-fertile hybrids would not reproduce past F1) and if there is still genetic exchange with the parents. In F1 hybrids, we expect equal contribution from both parents (γ = 0.5), as recombination between the genetic material of the parents had no time to act. Additionally, in diploid organisms, we assume all divergent fixed sites between the parents to be heterozygous in the hybrid's genome. This immediately exposes F1 hybrids in the triangle. Further backcrossing to itself might lead to deviations in γ and according to the Hardy-Weinberg equilibrium to an interclass heterozygosity of 0.5 (p = 0.5, q = 0.5; i.e., heterozygosity = 2pq). If a hybrid backcrosses to either parent, γ will be shifted towards that direction.

To start calculating hybrid triangles, we can first run NGSadmix on our hybrids together with both parents. It is advised to keep sample sizes of all three groups in a similar magnitude to avoid biases in ADMIXTURE analysis.
NGSadmix requires beagle-type likelihoods which can be generated from .bam-files with ANGSD like:
angsd -nThreads 6 -bam ./path/to/list/of/bam/files -out ./out/prefix -rf ./path/to/list/of/regions/chromosomes -uniqueOnly 1 -minMapQ 20-minQ 20 -GL 2 -doGlf 2 -doMajorMinor 1 -skipTriallelic 1 -doMaf 1 -minMaf 0.05 -SNP_pval 1e-6 -doCounts 1 -setMinDepthInd 1

Then, we execute NGSadmix.
./NGSadmix -likes /path/to/beagle/gz -K 2 -o ./out_prefix -P $threads

We will keep this output-file (.qopt) for later. However, you need to make sure to keep the order of individuals as provided in the bam-list (-bam) for the beagle likelihoods.

The next step is a bit more complex. We need to run multiple scripts to get this to work properly. We first repeat the complete SNP calling on both parents by setting an outgroup as a reference to determine the divergent sites between the hybrid parents. By setting an outgroup, we ensure that ANGSD does not skip the fixed (monomorphic) sites in either parent population. Of course, many of these sites have diverged from the outgroup in the most recent common ancestor of the parent species which is why we join this list afterwards. All that we need is a list of the sites that are different (diverged) between the parent.

conda activate angsd

## Infer diverged, fixed sites in both parents

files=("Europe" "JuandeNova")

for file in "${files[@]}"; do

echo "Processing file: ${file}_BAM.list"

angsd \

-nThreads 4 \

-bam "${file}_BAM.list" \

-out /dss/dsslegfs01/pr53da/pr53da-dss-0039/crypto/03.variant_calling/code/00.angsd/Fst/Heterozygosity/Tromelin/"${file}" \

-GL 2 -doMaf 1 -minMapQ 20 -doMajorMinor 5 -doCounts 1 -minQ 20 -remove_bads 1 -minMAF 0.95 -SNP_pval 1e-3 -minInd 10 \

-anc /dss/dsslegfs01/pr53da/pr53da-dss-0039/crypto/02.reads_mapped/code/nf-umap/reference/rCryege1.pri.20221122.fasta

done

cd Heterozygosity/Tromelin/

zcat Europe.mafs.gz > Europe.mafs

zcat JuandeNova.mafs.gz > JuandeNova.mafs

## Input .mafs files

EUROPE="Europe.mafs"

JUANDE="JuandeNova.mafs"

## Output file

OUTPUT="fixed_diff_sites.txt"

## Extract chromo, position, minor allele from each .mafs file

awk 'NR > 1 {print $1"\t"$2"\t"$4}' "$EUROPE" > europe.minor

awk 'NR > 1 {print $1"\t"$2"\t"$4}' "$JUANDE" > juande.minor

## Sort by chromo and position

sort -k1,1 -k2,2n europe.minor > europe.minor.sorted

sort -k1,1 -k2,2n juande.minor > juande.minor.sorted

## Join the two files on chromo and position (handle missing entries with NA)

join -t $'\t' -a1 -a2 -e "NA" -o 0,1.2,1.3,2.3 europe.minor juande.minor | awk '($3 != $4) && ($3 != "NA" && $4 != "NA" || $3 == "NA" || $4 == "NA") {print $1"\t"$2}' > "$OUTPUT"

## Clean up intermediate files (optional)

rm europe.minor juande.minor

echo "Done. Divergent fixed sites written to $OUTPUT"

angsd sites index fixed_diff_sites.txt

Subsequently, we employ this list to calculate the interclass heterozygosity in the hybrids with the following script:


conda activate angsd

angsd sites index fixed_diff_sites.txt

## Set paths

BAM_LIST="../../Tromelin_no6_BAM.list"

SITES_FILE="fixed_diff_sites.txt"

ANGSD_PATH="angsd" # If ANGSD is in your PATH, otherwise provide full path

INTERCLASS_OUTPUT="interclass_heterozygosity.txt"

rm $INTERCLASS_OUTPUT

for files in ./*.mafs.gz; do

name=$(basename $files .mafs.gz)

sites=$(zcat $files | wc -l)

Hets=$(zcat $files | awk '$5 > 0.45 && $5 < 0.55' | wc -l)

if [ "$sites" -ne 0 ]; then

ratio=$(echo "scale=5; $Hets / $sites" | bc)

else

ratio="NA"

fi

echo -e "$name\t$Hets\t$sites\t$ratio" >> $INTERCLASS_OUTPUT

done

Now, with the list of the samples for the NGSadmix run (check -bam list -> Popfile.txt), the admixture proportions in the .qopt file (hybrids.qopt) from NGSadmix, and the interclass_heterozygosity.txt, we can switch over from Linux to R.

library(ggplot2)
library(tidyr)
libary(dplyr)
setwd("~/Thesis/Hybrid Triangle")

interclass = read.table("Interclass_Heterozygosity.txt", sep = "\t")

admix = read.table("hybrids.qopt", sep = "\t")

Pops = read.table("Popfile.txt", sep="\t")

final = cbind(Pops,admix)

colnames(final) = c("Sample", "Pop", "V1","V2")

palette_expanded_3 <- c(

"#8B0000", # Dark Charcoal

"#B4C72B", # Dark Olive Green (Replaced Dark Salmon)

"#4682B4", # Steel Blue

"#2E4A25", # Dark Forest Green

"#E7A86B", # Terra Cotta

"#2F4F4F", # Dark Slate Gray

"#8B3E3E", # Mahogany

"#A76E1B", # Saddle Brown

"#B4C72B", # Olive Drab

"#CD5C5C", # Indian Red

"#8B0000", # Dark Red

"#708090", # Slate Gray

"#6A5ACD", # Slate Blue

"#C0C0C0" # Silver

)

Tromelin = final %>% filter(Pop=="TR") %>% cbind(interclass)

colnames(Tromelin)=c("Sample", "Pop", "V1", "V2", "Sam", "Het", "Sites")

Tromelin$ICH = Tromelin$Het / Tromelin$Sites

Rest = final %>% filter(!Pop=="TR") %>% mutate(ICH = 0)

Tromelin_filt = Tromelin[,c(1,2,3,4,8)]

Final = rbind(Rest, Tromelin_filt)

# Open triangle (without bottom edge)

triangle_df <- data.frame(

x = c(0.5, 1, 0),

y = c(1, 0, 0)

)

# Hardy-Weinberg Equilibrium (HWE) curve: Interclass heterozygosity (p12) as a function of hybrid index (p1)

p1_vals <- seq(0, 1, length.out = 100) # Hybrid index values from 0 to 1

hwe_df <- data.frame(

x = p1_vals,

y = 2 p1_vals (1 - p1_vals) # HWE formula for interclass heterozygosity (p12)

)

# Plot

ggplot() +

geom_point(data = Final, aes(x = V1, y = ICH, color = Pop), size = 2) +

geom_polygon(data = triangle_df, aes(x = x, y = y), fill = NA, color = "black", linewidth=1) + # Open triangle

geom_line(data = hwe_df, aes(x = x, y = y), linetype = "dashed", color = "black") + # HWE curve

geom_text(data = filter(Final, Pop == "C. sp. \"Tromelin\""), aes(x = V1, y = ICH, label = Sample), vjust = -0.5, color = "black") + # Label TR samples

coord_fixed(ratio = 1) +

labs(x = "Hybrid Index", y = "Interclass Heterozygosity") +

theme_classic() + scale_color_manual(values=palette_expanded_3)

Please let me know, if it has worked for you too. I am happy to help if problems arise!