The Python API
We provide programmatic access to diverse_seq functions as cogent3 apps. The dvs apps mirror the capabilities of their command line counterparts with two key differences: the input and outputs. There is no data transformation step. Just use cogent3 to load the sequence collection (aligned or otherwise) and pass it to an app instance. The dvs_nmost and dvs_max will identify the sequences to keep and return the same input data type that contains just those sequences.
What apps are available?¶
We use the cogent3 capabilities for displaying the installed apps and getting help on them.
import cogent3
cogent3.available_apps("dvs")
| package | name | composable | doc | input type | output type | licenses |
|---|---|---|---|---|---|---|
| diverse_seq | dvs_ctree | True | Create a cluster tree from kmer distances. | Alignment, SequenceCollection | PhyloNode | BSD |
| diverse_seq | dvs_delta_jsd | True | returns delta_jsd for a sequence | ByteSequence, DnaSequence, ProteinSequence, ProteinWithStopSequence, RnaSequence, Sequence | float, str | BSD |
| diverse_seq | dvs_max | True | select the maximally divergent seqs from a sequence collection | Alignment, SequenceCollection | Alignment, ByteSequence, DictArray, DistanceMatrix, DnaSequence, PhyloNode, ProteinSequence, ProteinWithStopSequence, RnaSequence, Sequence, SequenceCollection, SerialisableType, Table, bootstrap_result, generic_result, hypothesis_result, model_result, tabular_result | BSD |
| diverse_seq | dvs_nmost | True | select the n-most diverse seqs from a sequence collection | Alignment, SequenceCollection | Alignment, ByteSequence, DictArray, DistanceMatrix, DnaSequence, PhyloNode, ProteinSequence, ProteinWithStopSequence, RnaSequence, Sequence, SequenceCollection, SerialisableType, Table, bootstrap_result, generic_result, hypothesis_result, model_result, tabular_result | BSD |
| diverse_seq | dvs_par_ctree | False | Create a cluster tree from kmer distances in parallel. | Alignment, SequenceCollection | PhyloNode | BSD |
5 rows x 7 columns
Getting help on an app¶
We do this for the dvs_nmost app only.
cogent3.app_help("dvs_nmost")
Overview
--------
select the n-most diverse seqs from a sequence collection
Options for making the app
--------------------------
dvs_nmost_app = get_app(
'dvs_nmost',
n=10,
moltype='dna',
include=None,
k=6,
seed=None,
)
Parameters
----------
n
the number of divergent sequences
moltype
molecular type of the sequences
k
k-mer size
include
sequence names to include in the final result
seed
random number seed
Notes
-----
If called with an alignment, the ungapped sequences are used.
The order of the sequences is randomised. If include is not None, the
named sequences are added to the final result.
Returns
-------
The same type as the input sequence collection.
Input type
----------
Alignment, SequenceCollection
Output type
-----------
hypothesis_result, model_result, DistanceMatrix, PhyloNode, DnaSequence, RnaSequence, SequenceCollection, ProteinWithStopSequence, SerialisableType, ProteinSequence, generic_result, ByteSequence, Sequence, Alignment, bootstrap_result, tabular_result, Table, DictArray
Using dvs_nmost¶
Load the sample data as a cogent3 SequenceCollection of unaligned sequences.
Note The apps can use either an alignment or a
cogent3sequence collection (of unaligned sequences).
import diverse_seq
seqs = diverse_seq.load_sample_data()
seqs
| 0 | |
| FlyingFox | TGTGGCACAAATGCTCATGCCAGCTCTTTACAGCATGAGAACAGTTTATTATACACTAAA |
| DogFaced | TGTGGCACAAATACTCATGCCAACTCATTACAGCATGAGAACAGCAGTTTATTATACACT |
| FreeTaile | TGTGGCACAGATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTACTACTCACT |
| LittleBro | TGTGGCACAGATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTACTACTCACC |
| TombBat | TGTGGCACAAGTACTCATGCCAGCTCAGTACAGCATGAGAACAGCAGTTTACTACTCACT |
| RoundEare | NGCTCATTANAGCNTGAGAACAGCAGTTTACTGCTCACTGAGGACCAGATGAGTGTGGGA |
| FalseVamp | TGTGGCACAAATACTCATGCCAGCTCATTACAGCATGAGAACAGCAATTTATTACTGACT |
| LeafNose | TGTGGCACAAATACTCATGCCAGCTCTTTACATTATGAGCACAGCAGTTTATTACTCACT |
| Horse | TGTGGCACAAATACTCATGCCAGCTCATTGCAGCATGAGAACAGCAGTTTATTACTCACT |
| Rhino | TGTGGCACGAATACTCATGCCAGCTCATTGCAGCATGAGAACAGCAGTGTATTACTCACT |
55 x {min=2382, median=2789.0, max=2889} dna sequence collection
When we create an app, we can see all the parameter settings (including defaults) as follows.
nmost = cogent3.get_app("dvs_nmost", n=10, seed=123)
nmost
dvs_nmost(n=10, moltype='dna', include=None, k=6, seed=123)
result = nmost(seqs)
result
| 0 | |
| Anteater | TGTGGCACAAATATTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
| TombBat | TGTGGCACAAGTACTCATGCCAGCTCAGTACAGCATGAGAACAGCAGTTTACTACTCACT |
| Tenrec | TGTGGCACACGTACGCTTGCCAGCTCGGCACAGCGCGAGGACTGCAGCTTATTACTCACC |
| Madagascar | TGTGGAACAAATACGCTTGCCAACTCATTACAGCGTGAGAACTACAGTTTATTACTCACT |
| Galago | TGTGGCAAAAATACTCATGCCAGCTCATTACAGCATGAGAGCAGCAGTTTATTACTCACT |
| Bandicoot | NACTCATTAATGCTTGAAACCAGCAGTTTATTGTCCAACATAGACAGAATGACTACAAAA |
| RockHyrax | TGTGGCACAGATACTTGTGCCAGCTCGTTACAGCATGAGAACAGCAGTTTATTACTCACT |
| TreeShrew | TGTGGCATAAATACTTATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
| Cat | TGTGCCACAAATACTCGTGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
| Mole | TGTGGCATAAATACTCATGCCAGCTTATTACAGCATGAAAACAGCAGTTTATTACTCACT |
10 x {min=2533, median=2763.5, max=2844} dna sequence collection
Using dvs_max¶
dvs_max = cogent3.get_app("dvs_max", max_size=10, seed=123)
dvs_max
dvs_max(min_size=5, max_size=10, stat='stdev', moltype='dna', include=None, k=6, seed=123)
result = dvs_max(seqs)
result
| 0 | |
| Anteater | TGTGGCACAAATATTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
| Cat | TGTGCCACAAATACTCGTGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
| Mole | TGTGGCATAAATACTCATGCCAGCTTATTACAGCATGAAAACAGCAGTTTATTACTCACT |
| OldWorld | AGCCAACAGAGCAGGTGGGCTGAGAGCAAGGAGAGGTGCCATGACAGGCAGGCTCCTGGC |
| FlyingLem | TGTGGCACAAATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACGCACT |
| Bandicoot | NACTCATTAATGCTTGAAACCAGCAGTTTATTGTCCAACATAGACAGAATGACTACAAAA |
| Human | TGTGGCACAAATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
7 x {min=2533, median=2835.0, max=2847} dna sequence collection
Using dvs_delta_jsd¶
The dvs_delta_jsd app computes the delta JSD values for a single sequence against a reference set of sequences. It returns a tuple of sequence name, delta JSD value.
Note There is no command line interface for this app.
Say we have a reference group of sequences, ref_seqs. We want to evaluate each sequence in a set of query sequences to see what their delta JSD values are against the reference set. These values allow us, for example, to select a sequence that is highly diverged from all sequences in the reference set, or one which is very similar to a sequence in the reference set.
For this example, we define ref_seqs as the first 10 sequences in our sample data.
ref_seqs = seqs.take_seqs(seqs.names[:10])
ref_seqs
| 0 | |
| FlyingFox | TGTGGCACAAATGCTCATGCCAGCTCTTTACAGCATGAGAACAGTTTATTATACACTAAA |
| DogFaced | TGTGGCACAAATACTCATGCCAACTCATTACAGCATGAGAACAGCAGTTTATTATACACT |
| FreeTaile | TGTGGCACAGATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTACTACTCACT |
| LittleBro | TGTGGCACAGATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTACTACTCACC |
| TombBat | TGTGGCACAAGTACTCATGCCAGCTCAGTACAGCATGAGAACAGCAGTTTACTACTCACT |
| RoundEare | NGCTCATTANAGCNTGAGAACAGCAGTTTACTGCTCACTGAGGACCAGATGAGTGTGGGA |
| FalseVamp | TGTGGCACAAATACTCATGCCAGCTCATTACAGCATGAGAACAGCAATTTATTACTGACT |
| LeafNose | TGTGGCACAAATACTCATGCCAGCTCTTTACATTATGAGCACAGCAGTTTATTACTCACT |
| Horse | TGTGGCACAAATACTCATGCCAGCTCATTGCAGCATGAGAACAGCAGTTTATTACTCACT |
| Rhino | TGTGGCACGAATACTCATGCCAGCTCATTGCAGCATGAGAACAGCAGTGTATTACTCACT |
10 x {min=2738, median=2783.0, max=2841} dna sequence collection
We define our query group as the remaining sequences.
query_seqs = seqs.take_seqs(seqs.names[:10], negate=True)
query_seqs
| 0 | |
| Pangolin | TGTGGCACAAATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
| Cat | TGTGCCACAAATACTCGTGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
| Dog | TGTGGCACAAATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
| Llama | TGTGGCACAGATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
| Pig | TGTGGCACAGATACTCATGCCAGCTCGTTACAGCATGAGAACAGCAGTTTATTACTCACT |
| Cow | TGTGGCACAGATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTGCTCACT |
| Hippo | TGTGGCACAGATACTCGTGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
| SpermWhale | TGTGGCACAGATACTCATGCCAGCTCATTACAGCATGAAAACAGCAGTTTATTACTCACT |
| HumpbackW | TGTGGCACAGATACTCATGCCAGCTCATTACAACATGAAAACAGCAGTTTATTACTCACT |
| Mole | TGTGGCATAAATACTCATGCCAGCTTATTACAGCATGAAAACAGCAGTTTATTACTCACT |
45 x {min=2382, median=2792.0, max=2889} dna sequence collection
dvs_djsd = cogent3.get_app("dvs_delta_jsd", seqs=ref_seqs, moltype="dna", k=8)
dvs_djsd
dvs_delta_jsd(seqs=10x dna seqcollection: (RoundEare[NGCTCATTAN...], ..., Rhino[TGTGGCACGA...]), moltype='dna', k=8)
We now compute the delta JSD values for each sequence in query_seqs against ref_seqs and make a table from the results. We just display the top few records.
name_deltas = [dvs_djsd(seq) for seq in query_seqs.seqs]
table = cogent3.make_table(
header=["seqname", "delta_jsd"], data=name_deltas, index_name="seqname"
)
table = table.sorted(reverse="delta_jsd")
table.head()
| seqname | delta_jsd |
|---|---|
| Bandicoot | 1.7413 |
| Phascogale | 1.7402 |
| Wombat | 1.7383 |
| Caenolest | 1.7374 |
| Tenrec | 1.7354 |
Top 5 rows from 45 rows x 2 columns
And the conclusion is that the Bandicoot (a marsupial) sequence is the most divergent from the reference set (which are all Eutherian, or "placental", mammals).
Using dvs_ctree¶
dvs_ctree = cogent3.get_app("dvs_ctree")
dvs_ctree
dvs_ctree(k=12, sketch_size=3000, moltype='dna', distance_mode='mash', mash_canonical_kmers=None, show_progress=False)
result = dvs_ctree(seqs)
result
Tree("((Caenolest,(Bandicoot,(Phascogale,Wombat))),(OldWorld,((Madagascar,Tenrec),((Mouse,Rat),(RoundEare,((LesserEle,GiantElep),(Hedgehog,(Jackrabbit,((RockHyrax,TreeHyrax),(TreeShrew,(GoldenMol,(Mole,((Galago,(FlyingSqu,(HowlerMon,(Rhesus,(Orangutan,(Human,(Gorilla,Chimpanzee))))))),(((Anteater,(Sloth,(NineBande,HairyArma))),(Aardvark,((AfricanEl,AsianElep),(Dugong,Manatee)))),((Cat,Dog),((Cow,(Pig,(Llama,(Hippo,(SpermWhale,HumpbackW))))),((TombBat,(FreeTaile,LittleBro)),(FalseVamp,((FlyingFox,DogFaced),(FlyingLem,(Pangolin,(LeafNose,(Horse,Rhino)))))))))))))))))))))));")
dnd = result.get_figure()
dnd.show(renderer="svg", height=700, width=400)
dvs_par_ctree is for parallel processing¶
We don't use it here, but it is best suited to a sequence collection with a lot of sequences.