SynFlow: A workflow for synteny and chromosomal rearrangement

Overview
Software requirements
Installation
1. Install pixi
2. Clone the repository
3. Install dependencies
Pixi usage
SLURM execution
Using snakemake.sh
SLURM profile (slurm/config.yaml)
Configuration
Workflow config file (config.yaml)
Workflow steps
Output files
Primary outputs
Intermediate files (tmp/)
Troubleshooting
Citation
License

Overview

A Snakemake workflow for synteny detection and chromosomal rearrangement analysis between two or more genome assemblies.

Software requirements

All dependencies are managed by pixi. No manual conda/pip setup is needed.

Tool	Version	Purpose
Snakemake	7.32.4	Workflow management
SyRI	≥1.7.1	Synteny & rearrangement detection
minimap2	≥2.30	Fast sequence alignment
MUMmer4	≥4.0.1	Nucmer genome alignment
gffread	≥0.12.7	GFF/GTF processing
DIAMOND	≥2.1.24	Protein sequence alignment
MCScanX	≥1.0.0	Collinearity detection
Biopython	≥1.83	Sequence utilities
---

Installation

1. Install pixi

curl  -fsSL  https://pixi.sh/install.sh | bash

Restart your shell or run source ~/.bashrc (or ~/.zshrc) to make pixi available.

2. Clone the repository

git  clone  https://gitlab.cirad.fr/agap/cluster/snakemake/synflow.git
cd  synflow

3. Install dependencies

pixi  install

This reads pixi.toml and installs all tools into an isolated environment under .pixi/.

Pixi usage

Pixi manages the environment and provides shortcut commands. You do not need to activate any environment manually.

# Install or update the environment
pixi  install --manifest-path pixi.toml

# Run snakemake through the pixi environment
pixi  run  snakemake  --configfile  config.yaml  --cores  8

# Open an interactive shell inside the pixi environment
pixi  shell

# Check installed tool versions
pixi  run  snakemake  --version
pixi  run  minimap2  --version

SLURM execution

Using `snakemake.sh`

The snakemake.sh script is designed for HPC submission via SLURM. It handles:

Automatic pixi installation if not present on the node
Environment setup (pixi install) before running
Snakemake dispatch with a set of predefined commands

# Submit to SLURM
sbatch  snakemake.sh  run

# Dry-run (check the workflow without executing)
sbatch  snakemake.sh  dry

# Generate DAG graph (requires graphviz dot)
sbatch  snakemake.sh  dag

# Unlock a locked working directory
sbatch  snakemake.sh  unlock

# Delete all outputs (with confirmation)
sbatch  snakemake.sh  clear

Available options: | Option | Default | Description | |--------|---------|-------------| | -p PROFILE | slurm | Snakemake profile directory | | -j NJOBS | 200 | Maximum number of SLURM jobs | | -c NCORES | 700 | Maximum total cores | | -r FILE | config.yaml | Workflow config file | | -o FILE | images/dag.png | DAG output image |

Example:


sbatch  snakemake.sh  run  -p  slurm  -j  50  -c  200  -r  config.yaml

The script wraps snakemake so that all calls go through pixi run, ensuring the correct environment is used even on compute nodes that do not have snakemake in their PATH.

SLURM profile (`slurm/config.yaml`)

The SLURM profile controls job submission parameters:

cluster:
mkdir -p logs/ &&
sbatch
--partition={resources.partition}
--account={resources.account}
--cpus-per-task={threads}
--mem={resources.mem_mb}
--time={resources.time}
--job-name=smk-{rule}-{wildcards}
--output=logs/{rule}-%j.out
--error=logs/{rule}-%j.err

set-resources:
- minimap2:mem_mb=32384
- minimap2:time=2000
- nucmer:mem_mb=32384
- nucmer:time=2000
- syri_minimap2:mem_mb=18096
- syri_nucmer:mem_mb=18096

default-resources:
- mem_mb=4096
- partition=cpu-dedicated
- account=dedicated-cpu@cirad
- time=60

set-threads:
- minimap2=8
- nucmer=8
- syri_minimap2=8
- syri_nucmer=8

Configuration

Workflow config file (`config.yaml`)

Create a config.yaml file (or copy from test/) with the following structure:

# --- Input genomes (required) ---
input_genomes:
genome1: "path/to/genome1.fasta"
genome2: "path/to/genome2.fasta"
# Add more genomes as needed

# --- Alignment method ---
method: "nucmer"  # "nucmer" (default) or "minimap2"

# --- Nucmer alignment parameters ---
min_length_match: 100  # Minimum match length (-l)
min_length_cluster: 500  # Minimum cluster length (-c)
min_distance_extension: 500  # Minimum distance for extension (-b)
min_alignment_identity: 90  # Minimum identity % for delta-filter (-i)
min_alignment_length: 5000  # Minimum alignment length for delta-filter (-l)

# --- Minimap2 parameters (used only if method: minimap2) ---

preset: "asm5"  # Minimap2 preset (e.g. asm5, asm10, asm20)

bandwidth: 500  # Bandwidth for chaining (-r)

secondary: "no"  # Report secondary alignments (yes/no)

# --- Optional: GFF annotations (required for MCScanX) ---
# Provide GFF files for all genomes when chromosome counts differ
input_gff:
genome1: "path/to/genome1.gff3"
genome2: "path/to/genome2.gff3"

Notes: - At least 2 genomes are required. All pairwise combinations are processed. - input_gff is optional but required when any two genomes have different chromosome counts (triggers MCScanX). - method applies only to the SyRI pipeline (same chromosome count pairs).

Workflow steps

SynFlow workflow

The workflow is organized as follows:

1. `preprocess_fasta` (checkpoint)

Normalizes sequence IDs in each FASTA file (replaces special characters)
Produces a JSON mapping of original → normalized IDs
Counts the number of sequences (chromosomes/scaffolds) per genome
Output: tmp/{genome}_processed.fasta, tmp/{genome}_id_mapping.json, tmp/{genome}_chr_count.txt

2. `gff2bed` (optional)

Converts GFF3 annotation files to BED format
Applies the ID mapping from step 1 to normalize sequence names
Output: {genome}.bed, tmp/{genome}.gff3

3. `gff2fasta` (optional)

Extracts protein sequences from the GFF3 annotations using gffread
Output: tmp/{genome}.prot

4a. `nucmer` (SyRI path)

Aligns two genomes with nucmer (MUMmer4)
Filters the delta file with delta-filter (identity and length thresholds)
Generates a .coords file for SyRI
Output: tmp/mummer/{ref}_{qry}.filtered.delta, tmp/mummer/{ref}_{qry}.coords

4b. `minimap2` (SyRI path, alternative)

Aligns two genomes with minimap2 in SAM format
Output: tmp/minimap2/{ref}_{qry}.sam

5. `syri` (SyRI path)

Runs SyRI on the alignment output (nucmer or minimap2)
Detects syntenic regions, inversions, translocations, duplications
Restores original sequence IDs via restore_ids.py
Output: tmp/syri/{ref}_{qry}.out

6. `diamond_prepdb` + `diamond_blastp` (MCScanX path)

Builds a DIAMOND protein database from the reference proteome
Runs bidirectional BLASTP between both proteomes
Output: tmp/{genome}.prot.dmnd, tmp/diamond/{ref}_{qry}.out

7. `bbmh4mcsanx` (MCScanX path)

Identifies best bidirectional hits (BBH) from the DIAMOND results
Produces the .homology file and a combined .gff for MCScanX
Output: tmp/{ref}_{qry}/{ref}_{qry}.homology, tmp/{ref}_{qry}/{ref}_{qry}.gff

8. `mcscanx` (MCScanX path)

Runs MCScanX_h to detect collinear gene blocks
Falls back to a looser parameter set if the output is below a size threshold
Output: tmp/{ref}_{qry}/{ref}_{qry}.collinearity

9. `collinearity2bedpe` (MCScanX path)

Converts the MCScanX collinearity file to BEDPE format (SyRI-compatible)
Output: tmp/mcscanx/{ref}_{qry}.out

10. `collinearity2anchors` (optional, MCScanX path)

Extracts syntenic anchor gene pairs from the collinearity file
Output: {ref}_{qry}.anchors

11. `finalize_bedpe`

Copies the appropriate result (SyRI or MCScanX) to the final output
Output: {ref}_{qry}.out

Output files

Tip: All primary output files (.out, .bed, .anchors) can be directly loaded into the SynFlow web interface via the Upload Files section for interactive visualization.

Primary outputs

`{reference}_{query}.out` — Main synteny result (BEDPE format)

SyRI-style BEDPE output describing syntenic blocks and structural rearrangements between each genome pair.

Columns 1–3: coordinates on the reference genome
Columns 6–8: coordinates on the query genome
Columns 9–12: structural annotation (SYN, INV, TRANS, DUP, etc.) For the full format specification, see: https://schneebergerlab.github.io/syri/fileformat.html

Example:

chr01 33618167 33780988 - - chr01 39357103 39440924 SYN2 - SYN BLOCK
chr01 33618167 33647271 - - chr01 39357103 39380755 Eg01_t020990 Macma4_01_g26690.1 SYN2 -
chr01 33725928 33733383 - - chr01 39391015 39394907 Eg01_t021030 Macma4_01_g26700.1 SYN2 -

`{genome}.bed` — Gene coordinates (if GFF provided)

BED file with gene positions for each genome, with normalized sequence IDs.

`{reference}_{query}.anchors` — Syntenic anchor pairs (if GFF provided, MCScanX path)

Tab-separated file listing syntenic gene pairs detected by MCScanX.

Intermediate files (`tmp/`)

File	Description
`tmp/{genome}_processed.fasta`	Normalized FASTA
`tmp/{genome}_id_mapping.json`	ID mapping (original → normalized)
`tmp/{genome}_chr_count.txt`	Number of sequences
`tmp/{genome}.gff3`	Filtered/normalized GFF3
`tmp/{genome}.prot`	Extracted protein sequences
`tmp/mummer/{ref}_{qry}.coords`	Nucmer alignment coordinates
`tmp/mummer/{ref}_{qry}.filtered.delta`	Filtered nucmer delta
`tmp/minimap2/{ref}_{qry}.sam`	Minimap2 SAM alignment
`tmp/syri/{ref}_{qry}.out`	Raw SyRI output
`tmp/diamond/{ref}_{qry}.out`	DIAMOND BLASTP results
`tmp/{ref}_{qry}/{ref}_{qry}.collinearity`	MCScanX collinearity
`tmp/mcscanx/{ref}_{qry}.out`	MCScanX BEDPE output
---

Troubleshooting

Memory errors: Increase mem_mb for alignment rules in slurm/config.yaml
Missing GFF files: Required for MCScanX when genomes have different chromosome counts
Locked directory: Run sbatch snakemake.sh unlock or pixi run snakemake --unlock
Permission errors: Ensure write access to the working directory and logs/

Debugging

# Dry-run to check the workflow plan
pixi  run  snakemake  --configfile  config.yaml  --cores  1  -n

# Run with verbose shell commands
pixi  run  snakemake  --configfile  config.yaml  --cores  8  --printshellcmds

# Generate workflow DAG
pixi  run  snakemake  --configfile  config.yaml  --dag | dot  -Tpdf > workflow.pdf

Citation

If you use SynFlow, please cite the relevant tools: - SyRI: Goel M. et al. Genome Biol 20, 277 (2019). doi:10.1186/s13059-019-1911-0 - MCScanX: Wang Y. et al. Nucleic Acids Res. 40(7):e49 (2012). doi:10.1093/nar/gkr1293 - DIAMOND: Buchfink B. et al. Nature Methods 18, 366–368 (2021). doi:10.1038/s41592-021-01101-x - minimap2: Li H. Bioinformatics 34(18):3094–3100 (2018). doi:10.1093/bioinformatics/bty191 - MUMmer4: Marçais G. et al. PLoS Comput Biol 14(1):e1005944 (2018).

License

This workflow is distributed under the GNU General Public License v3.0.

SynFlow: A workflow for synteny and chromosomal rearrangement

Table of Contents

Overview

Software requirements

Installation

1. Install pixi

2. Clone the repository

3. Install dependencies

Pixi usage

SLURM execution

Using snakemake.sh

SLURM profile (slurm/config.yaml)

Configuration

Workflow config file (config.yaml)

Workflow steps

1. preprocess_fasta (checkpoint)

2. gff2bed (optional)

3. gff2fasta (optional)

4a. nucmer (SyRI path)

4b. minimap2 (SyRI path, alternative)

5. syri (SyRI path)

6. diamond_prepdb + diamond_blastp (MCScanX path)

7. bbmh4mcsanx (MCScanX path)

8. mcscanx (MCScanX path)

9. collinearity2bedpe (MCScanX path)

10. collinearity2anchors (optional, MCScanX path)

11. finalize_bedpe