all the scripts to reproduce the results in the splam paper
benchmark/src_generalization_test/
{#}_output/
{name}
refers to the name of the gene databasePositive Dataset
The positive data is extracted from the complete genomic GFF annotation files of each species.
To start, retrieve all introns from the GFF file:
$ python 1_get_pos_set.py
Inputs:
{name}.gff
= annotations corresponding to a genome{name}_genomic.fa
= genome’s fasta file{name}_annotation_report.txt
= annotation report
These were downloaded together from the NCBI Genome DatabaseOutputs:
databases/{name}.db
= sqlite3-style databases parsed from the .gff
annotation files{name}_introns.bed
= extracted intronsExtract all the sequences from the introns, with the tailored specifications of Splam. This step also performs checks and filters to ensure high quality positive data.
$ python 2_extract_pos_splam.py
Inputs:
{name}_introns.bed
{name}_genomic.fa
{name}_annotation_report.txt
Outputs:
donor.bed
, acceptor.bed
= bed files referring to the specific 400nt donor and acceptor sequencesdonor_seq.fa
, acceptor_seq.fa
= fasta files containing the 400nt sequencesd_a.bed
= bed file referring to the intron coordinates of the splice junction (midpoint of the donor and acceptor coords)input_neg_random.fa
= fasta file containing the 800nt sequence that is given to SplamExtract the same sequences, but with the specifications for SpliceAI.
$ python 3_extract_pos_spliceai.py
Inputs:
d_a.bed
{name}_genomic.fa
{name}_annotation_report.txt
Outputs:
coords.bed
= bed file referring to the start and end positions of the whole SpliceAI inputseq_noN.fa
= fasta file containing the SpliceAI input, with the full flanking sequence (coords refer to splice junction)seq_N.fa
= fasta file containing the SpliceAI input, with repeating N flanking sequence (coords refer to splice junction)Negative Dataset
You will need to randomly generate the negative splice junction dataset. This works by taking the existing protein-coding genes, then selecting the opposite strand to guarantee unique sequences, and creating pseudo-splice-junctions from GT-AG pairs found on this strand.
To start, retrieve the protein-coding genes from all four genomes, making use of the gffutils
library:
$ python 1_get_neg_set.py
Inputs:
{name}.gff
= annotations corresponding to a genome{name}_genomic.fa
= genome’s fasta file{name}_annotation_report.txt
= annotation report
These were downloaded together from the NCBI Genome DatabaseOutputs:
databases/{name}.db
= sqlite3-style databases parsed from the .gff
annotation files{name}_genes.bed
= protein-coding genesThen, generate the dataset of splice junctions, and process into a format readable by Splam.
$ python 2_extract_neg_splam.py
Inputs:
{name}_genes.bed
{name}_genomic.fa
{name}_annotation_report.txt
Outputs:
donor.bed
, acceptor.bed
= bed files referring to the specific 400nt donor and acceptor sequencesdonor_seq.fa
, acceptor_seq.fa
= fasta files containing the 400nt sequencesd_a.bed
= bed file referring to the intron coordinates of the splice junction (midpoint of the donor and acceptor coords)input_neg_random.fa
= fasta file containing the 800nt sequence that is given to SplamNow process the same splice junctions into a format readable by SpliceAI (it will take a subset of the junctions for efficiency).
$ python 3_extract_neg_spliceai.py
Inputs:
d_a.bed
{name}_genomic.fa
{name}_annotation_report.txt
Outputs:
coords.bed
= bed file referring to the start and end positions of the whole SpliceAI inputseq_noN.fa
= fasta file containing the SpliceAI input, with the full flanking sequence (coords refer to splice junction)seq_N.fa
= fasta file containing the SpliceAI input, with repeating N flanking sequence (coords refer to splice junction)Run the following three steps in both folders of the pipeline. They are essentially the same for both positive and negative datasets.
Run Splam.
$ ./4_splam_runner.sh
Inputs:
input_neg_random.fa
Outputs:
score.bed
= bed file containing the Splam-scored splice junctionsRun SpliceAI. Depending on your system, this may take several days, so you can run each dataset separately:
$ ./5_spliceai_prediction_wrapper.sh {name}
Inputs:
seq_noN.fa
seq_N.fa
Outputs: There are 5 model output folders, each containing 4 folders with the database names
spliceai_all_seq.name.noN.{name}.tsv
, spliceai_all_seq.name.N.{name}.tsv
= names and identifiers for the scored splice junctionsspliceai_all_seq.score.a.noN.{name}.tsv
, spliceai_all_seq.score.a.N.{name}.tsv
= acceptor site scores for every nt in sequencespliceai_all_seq.score.d.noN.{name}.tsv
, spliceai_all_seq.score.d.N.{name}.tsv
= donor site scores for every nt in sequencespliceai_all_seq.score.n.noN.{name}.tsv
, spliceai_all_seq.score.n.N.{name}.tsv
= neutral (neither) scores for every nt in sequencePost-process the Splam and SpliceAI scores into a single file for comparison:
$ 6_compile_data.py
Inputs:
seq_noN.fa
, seq_N.fa
spliceai_all_seq.name.noN.{name}.tsv
, spliceai_all_seq.name.N.{name}.tsv
spliceai_all_seq.score.a.noN.{name}.tsv
, spliceai_all_seq.score.a.N.{name}.tsv
spliceai_all_seq.score.d.noN.{name}.tsv
, spliceai_all_seq.score.d.N.{name}.tsv
score.bed
Outputs:
Splam/{name}.splam_data.csv
= the compiled Splam scores with various identifiers and metricsSpliceAI/spliceai_data.v{#}.noN.{name}.csv
, SpliceAI/spliceai_data.v{#}.noN.{name}.csv
= the compiled SpliceAI scores with various identifiers and metricscombine/aggregate_data.noN.{name}.csv
, combine/aggregate_data.N.{name}.csv
= the complete list of data compiled from both Splam and SpliceAI, with all data included (no averaging)combine/averaged_data.noN.{name}.csv
, combine/averaged_data.N.{name}.csv
= averaged list across the 5 models, most interfaceable formNavigate to the figures/
folder. Here, you have a subdirectory for every figure we generated. Simply navigate to the figure you want, and run the plot.py
function.
f1_scores
, run the stats.py
which will generate a results.csv
for various performance metrics at a specified threshold (that you can edit in the code).