This thesis is dealing with mathematical solution of a biochemical problem – we wish to develop an algorithm, which will find optimal sequence of codons in a protein for expression in E. coli in real time. To be more exact, we wish to determine such sequence of codons that the difference between time values of experimentally obtained data of times of synthesis of a protein and time values of translation of each individual aminoacid, which are computed using formula of local average of three consecutive codons, would be minimal.
We are solving the problem with dynamic programming. We first derive recursive formula that solves the optimization problem. We then expand it into a computer algorithm, which we implement in Python programming language. We first check the functionality of algorithm on a small sample of three existing proteins. We get the sequence of target times of translation from the chosen proteins themselves – this way we can be sure, whether the inputed data that is being approximated matches results of the algorithm and therefore how successful it is. With this kind of sample the accuracy rate of algorithm is 100 %. We then expand our base of test data on 900 randomly generated proteins. Algorithm’s accuracy rate remains 100 %. We are also analyzing the running time of the algorithm, which is linear.
In the last part of our thesis we apply noise ratio (with rates of 1 %, 5 % and 10 %) to our data. This way the target times become different from the values algorithm can compute out of given translation times for translation of codons. We again test the accuracy of algorithm, which is getting lower as we are raising the values of noise ratio. We conclude that the key part in accuracy rate of algorithm is played by the choice of norm inside the recursive formula. We test first, second and
infinity norm. We find that algorithm has the highest accuracy rate when second norm is chosen, while first and infinity norm have identical accuracy.
|