Parameters & usage

Parameters & usage

The calculation of HGT is a resource-consuming task. Although HGTstart has built-in computation resources for calculating a small number of sequences, it is recommended to localize HGTstart for computing at the scale of multiple genomes. You can download the latest version of the Routinetree pipeline (for macOS or Linux) and the background database REFAL (may be over 60 GB in size) in the HGTstart platform from https://hgtstart.cn/download/.

In order to better deal with the complicated calculation process, the routine tree is divided into multiple steps, namely 0123456. Overall, the workflow of this pipeline involves several key steps: segmenting the FASTA file into individual sequence files, identifying homologous sequences, performing multiple sequence alignment, constructing phylogenetic trees, locating nested positions, validating horizontal gene transfers (HGTs) through AI, hU, and hBL metrics, and annotating the results.The functions of each part will be introduced in detail below. We recommend setting the steps as two parts, 01234 and 56, because the time for the fourth step will be very long.

step0 Split the seed fasta files to files containing one sequence and unify the numbering format of sequence.
step1 Search homologous against the bacground dababase REFAL.
step2 Multiple sequence alignment.
step3 Build gene tree
step4 Screen nested position in gene tree, assess HGT and give predict annotation
step5 Filter HGTs according to their flanking genes and collate the results
step6 Compare the gene sturcture and codon usage bias bewtween HGT and CORE genes

 

Quick browse for all parapeters:

Mandatory:

  -db,--database <string>             Path of database_dir used to query against 

  -fl,--file <string>                         Fasta file of sequeces

  -id,--taxid <numeric>                 NCBI taxnomy id of query species

One of the follow format of genome annotation file must be provied:

  -gff3,--gff3file<str>                     Genome annotaion file in gff3 format  

  -gtf,--gtffile<str>                         Genome annotaion file in gtf format     

  -gff,--gfffile<str>                         Genome annotaion file in gtf format

  -gnm,--genomefile<str>             Genome file

 

Optional:

  <parameters of step 0 to 2>

  -th,--threads <numeric>            number of threads to use (default=1)   

  -tt,--total <numeric>                  total number of seqs inclued in tree (default=60)

  -py,--phylum <numeric>            maximal number of seqs in each phylum (default=10)

  -cs,--class <numeric>               maximal number of seqs in each class (default=10)

  -sf,--self <numeric>                  maximal number of seqs in selfspecies (default=6)

  -le,--length <num>                    minimal sequence length to start with (default=40)

  -sst,--seqsearchtool <str>        tool used to search homologous <diamond|blastp,default=diamond>

  -tbt,--treebuildtool <str>           tool used to build tree<FastTree|iqtree, default=FastTree>

  <parameters of step 3>

  -don,--donor <str>                   Donor(s); separate multiple donors with comma

  -cut,--cutoff <num>                 Node support Cutoff (default=0)

  -opt,--optional <str>                Optional taxa allowed in monophyletic ingroup

  -ign,--ignore <str>                   taxa to be Ignored while screening trees

  -ssn,--ssnode <num>              minimal Strongly Supported Nodes uniting query and donors (default=1)

  -asn,--asnode <num>              minimal number of All Supporting Nodes uniting query and donors (default=2)

  -mrn,--minimalReceptorNumber <num>  minimal number of Receptors in a nested position (default =2)

  -mdn,--minimalDonorNumber <num>      minimal number of Donors in a nested position (default =2)                          

  -ogs,--outgroupsize <num>                      minimal OutGroup Size for a tree to be considered valid (default=5)

  <parameters of step 4>

  -minAI,--minimalAI <num>                   minimal value of Alien Index to filter a HGT events (default=0)

  -minHBL,--minmalBrchLnth <num>     minimal value of Index of Branch Length to filter a HGT events (default=2)

  -minHU,--minmalHU <num>                minimal value of BLAST Score Index to filter a HGT events (default=0)

  <parameters of step 5>

  -txr,--taxonresolution<str>                   three levels to deteimine the exogenous from blast hit;

                                                             1 ->phylum,2-> kingdom,3->superkingdom(default)

 

#########################################################################################

 

Details for all parapeters:

Mandatory parameter

--file (-fl)

To specify fasta file contains all protein sequences used to predict HGTs.

perl routinetree.pl --file example

--taxid (-id)

To specify the NCBI taxnomy id of query species.

perl routinetree.pl --file example --taxid <NCBITaxId>

--database (-db)

The path of database that used in prediction.

perl routinetree.pl --file example --taxid <NCBITaxId> --database database

Optional parameter

--threads (-th)

To specify the number of threads to use (default = 1).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 012345 --threads 3

This command executes a scan using three threads.

--total (-tt)

To specify the total number of sequences included in tree (default=60).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 012345 --total 50

This command stipulates 50 sequences included in tree.

--phylum (-py)

To specify the maximal number of sequences in each phylum (default = 10).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 012345 --phylum 8

This command stipulates no more than 8 sequences in each phylum included in tree.

--class (-cs)

To specify the maximal number of sequences in each class (default = 10).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 012345 --class 8

This command stipulates no more than 8 sequences in each class included in tree.

--self (-sf)

To specify the maximal number of sequences in each selfspecies (default = 6).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 012345 --self 4

This command stipulates no more than 4 sequences in each selfspecies included in tree.

--seqsearchtool (-sst)

We provided 2 methods to search homologous: diamond and blastp, and you can choose one according to your preference. The default tool is diamond.

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 012345 --seqsearchtool blastp

This command choose blastp to search homologous.

--treebuildtool (-tbt)

We provided 2 methods to build tree: FastTree and iqtree, and you can choose one according to your preference. The default tool is FastTree.

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 012345 -- treebuildtool iqtree

This command choose iqtree to build tree.

--minimumAI (-minAI)

To specify the minimum value of AI. The default value for this parameter is 0.0.

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 012345 --minimumAI 100.5

This command stipulates HGTs with AI more than 100.5 would be selected.

--minimumHU (-minHU)

To specify the minimum value of hU. The default value for this parameter is 0.0.

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 012345 --minimumHU 200.5

This command stipulates HGTs with hU more than 200.5 would be selected.

--minimumHBL (-minHBL)

To specify the minimum value of hBL. The default value for this parameter is 0.0.

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 012345 --minimumHBL 50.5

This command stipulates HGTs with hBL more than 50.5 would be selected.

-donor (-don)

To specify the HGT donor(the taxa doesn't contain the query species, isn't the biological donor which need to be judged based on time) taxa. When you don’t set this parameter, the pipeline would search for nested position that donor is Amoebozoa, Apusozoa, Bacteria, Excavata, Opisthokonta, Plantae, Chromalveolat, and Viruses respectively.

Also, you can set this parameter by your self. This parameter is case sensitive.

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --donor Bacteria

This command let the pipeline search for nested position that the donor only be Bacteria.

If there are more than one donor taxa, separate multiple donors with comma, e.g., "Bacteria,Archaea", type:

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --donor Bacteria,Archaea

--cutoff (-cut)

To specify cutoff to define strongly supported nodes (default = 0).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --cutoff 70

If it is set to 70, all interior nodes supporting query-donor monophyly with support values no less than 70 is considered strong supporting nodes.

--optional (-opt)

To specify the optional taxa allowed to present in the query-donor monophyletic ingroup.When you don’t set this parameter, it is the kingdom of query species.

Also, you can set this parameter by yourself. This parameter is case sensitive.

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --optional Cyanidioschyzon

If there are more than one optional taxa, separate multiple optionals with comma, e.g., "Cyanidioschyzon,Galderia", type:

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --optional Cyanidioschyzon,Galderia

This parameter allows to search for more ancient HGTs that were shared between query taxon and its closely related taxa. The sequences of optional taxa will be recorded and exported .

--ignore (-ign)

To specify the taxa to be ignored while screening trees.

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --ignore Xenopus

This parameter allows to ignore sequences from some taxa which they think might be problematic. The sequences of ignored taxa will be skipped while tree processing and will not be recorded by the program.

--ssnode (-ssn)

To specify minimal number of strongly supported nodes (supporting value > cutoff) that supports query-donor monophyly (default = 1).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --ssnode 2

This command scans for trees with two or more nodes supporting query-donor monophyly (enforced nested position requirement)

--asnode (-asn)

To specify minimal number of all supporting nodes (regarless of supporting value) that supports query-donor monophyly (default = 2).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --asnode 1

This command scans for trees with one or more interior nodes supporting query-donor monophyly (turing off nested position requirement).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --asnode 3

This command scans for trees with three or more interior nodes supporting query-donor monophyly (turing off nested position requirement).

--minimalReceptorNumber (-mrn)

To specify the minimal number of biological receptors judged based on time in a nested position (default = 2).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --mrn 3

This command scans for trees with three or more biological receptors judged based on time.

--minimalDonorNumber (-mdn)

To specify the minimal number of biological donors judged based on time in a nested position (default = 2).

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --mdn 3

This command scans for trees with three or more biological donors judged based on time.

--outgroupsize (-ogs)

To specify the minimal number of sequences in outgroup for a tree to be considered valid (default = 0).

To consider only valide tree with 4 or more sequences in the outgroup, type:

perl routinetree.pl --file example --taxid <NCBITaxId> --database database --step 4 --ogs 4