Genetic sequences mutation generator

Source code here.
Last update 03/13/2008.

Compile:
javac SequenceMutator.java

Run:
java SequenceMutator [seq] [gen]

Example:
java SequenceMutator gctacgc 100

Why?

In my studies on phylogenetics and now on sequence search methods, I miss a program/API/method to generate pseudo-random mutation on sequences and results a new set of similar sequences. There classes in the biojava library that does simple base substitution, biopython has similar classes and methods, but they do not do insertions, deletions, duplication, dislocation, inversion, and duplication with inversion, and all these operations based in probabilities that are given by the user.

How?

The algorithm is pretty straightforward. It has a set of methods, one for each possible mutation: change base, insertion, deletion, duplication, inversion, and duplication with inversion. The user gives (or uses the default) probability to occur each one of these mutation. The probabilities can be from 0.001 (0.00001%) to 10,000.00 (100%), and a probability vector, very similar with those that are used in genetics algorithms, is populated. The user gives the number of generations, that means, the iterations quantity that the algorithm will run over the given sequence. So, the algorithms take a pseudo-random number at each given generator and see at probability vector if the number is for some mutation, and if is, which one. If the value represents a mutation, the algorithm executes the mutation method over the sequences and the algorithm continue while it has remaining generations.

When?

The algorithm and its implementation is useful for studies on phylogenies and its methods and also for sequence search algorithms. Example: given five different sequences, for each one, is generated others five sequences based on different probabilities ration. After the mutation sequences be generated, the phylogenetic algorithm is executed, the expected result is a tress with five main groups, one for each primary sequences. Another example is: given a sequence data bank, choice randomly a sequence of this data bank to be searched over it. For each search, generate a new mutate sequence based on previous searched sequence, and do a new search using this new sequence. After a number of searches, do an analyse over all results to obtain the precision and the sensibility of the search algorithm.

Final considerations

The main idea, algorithm and its implementation are constantly being improved. They have bugs, can be optimized, the code commented, but they are working very well for me. If you have some suggestion, bug report, or critic, I will be glad to hear (or to read) it. The code is structured with a lot static methods, the objective is that you and me, can use the mutation methods individually, as all these methods follow the flow: sequence -> mutation -> new_sequence, without dependencies between they. I am planning some improvements, like to read fasta files as input and to save the results as fasta files too, but I prefer do not promise anything, since it is very easy using the biojava library. Anything, contact me at felipe.albrecht@gmail.com.
Back to the main page.