The command line interface
Note
In the following we have turned off progress bars for clarity in the documentation.
Creating a demo dataset
The exported fasta formatted sequence file demo.fa is a collection of 55 unaligned sequences. If you provide an alignment, diverse-seq applications will remove any gaps before processing.
$ dvs demo-data
Wrote 'demo.fa'
Displaying the first few lines of that file.
$ head demo.fa
>FlyingFox
TGTGGCACAAATGCTCATGCCAGCTCTTTACAGCATGAGAACAGTTTATTATACACTAAA
GACAGAATGAATGTAGAAAAGACTGACTTCTGTAATAAAAGCAAACAGCCTGGCTTAGCA
AGGAGCCAACAGAACAGATGGGTTGAAACTAAGGAAACATGTAACGATATGCAGACTTCC
AGCACAGAGAAAAAGGTAGTTCTGAATGCTGATCCCCTGAATGGGAGAATAGAACTGAAT
AAGCAGAAACCTCCATGCTCTGACAGTCCTAGAGATTCTCAAGATATTTCTTGGATAACA
CGGAATAGTAGCATACAGAAAGTTAATGAGTGGTTTTCCAGACGTGATGAAATATTAACT
TCTGATGTCTCACCTGATGGGAGGTCTGAATCAAATGTGGTAGAAGTTCCAAATGAAGTA
GATGGATACTCTGGTGCTTCAGAGAAAATAGCTTTAAAGGCCAATGATCCTCATGGTGCT
TTAATGTGCGAAAGAGTTCACTCCAAACTGGTAGAAAGTAATATTGAAGATAAAATATTT
The prep command
This command converts either the sequences in a single file, or a directory of files, into a HDF5 file format. This is more efficient for analysis.
$ dvs prep -s demo.fa -o demo.dvseqsz
Successfully created 'demo.dvseqsz'
The nmost command
This command selects the n most diverse sequences, outputting them to a .tsv file. We specify the k-mer size (-k 6), the value of n (-n 10).
$ dvs nmost -s demo.dvseqsz -o demo-nmost.tsv -k 6 -n 10
10 divergent sequences IDs written to 'demo-nmost.tsv'
The output file has two columns, the first is the name of the file the sequence came from, and the second is the delta_jsd value, the contribution of this sequence to the Jensen-Shannon Divergence of the final collection.
$ head demo-nmost.tsv
names delta_jsd
Madagascar 0.02453210566960351
Pangolin -0.0063949886714098625
Mouse 0.021825308704462643
Bandicoot 0.01877395697167472
Phascogale 0.020401650759581003
LesserEle 0.005735980102070215
Mole 0.001033600174714877
RoundEare 0.043475911914189425
TreeShrew 0.0020294328580998666
Selecting the maximally diverse sequences using max command
The max command maximises the standard deviation of delta_jsd in a collection. The user specifies the minimum (-z 5) and maximum (-zp 10) size of the final collection. We also set the random number seed (--seed 1741676171). If the verbose flag is set (-v), the command will show the random seed used (which defaults to the system time).
$ dvs max -s demo.dvseqsz -o demo-max.tsv -k 6 -z 5 -zp 10 --seed 1741676171
5 divergent sequences IDs written to 'demo-max.tsv'
Estimating a tree from mash distances using ctree
The ctree command produces an approximate tree from a collection of unaligned sequences using either the Euclidean distance or the Mash distance. We specify the k-mer size (-k 12), the sketch size (--sketch-size 3000), and the distance metric (-d mash). This command ouputs a Newick formatted tree string to file.
$ dvs ctree -s demo.dvseqsz -o demo-ctree.nwk -k 12 -d mash --sketch-size 3000
Newick tree written to 'demo-ctree.nwk'