Sunday, 14 November 2021

RNA Splicing (Rosalind | English)

rosalind


Hi everyone, how are you? This time, I want to discuss about one problem (bioinformatics problem) that exist in rosalind.info's web. The title is "RNA Splicing". For the reference, you can first check out the problem that will be discussed (here). 

Overview

In this problem, we will given some DNA strings based on FASTA format. Those strings are consist of one main string (first string) and the others are the intron strings (the amount is unknown). The intron itself is a substring derived from the main string along with exon. 

Our task is to translate the exon string into protein string. As we know, a main string is string consists of introns and exons. Hence, if we want to find the exon then we have to remove the intron from the main string. 


A main string is consist of introns and exons.

The Code 

This is the code for solving this problem (with java language): 
  1. static int tot = 0;
  2. static void makeItDot (int awal, int akhir, char[] pat) {
  3. for (int i = awal; i <= akhir; i++) {
  4. pat[i] = '.';
  5. }
  6. }
  7. static int[] makeRev (char[] sample) {
  8. int[] rev = new int[sample.length]; rev[0] = 0; rev[1] = 0;
  9. int pvt = 0;
  10. for (int i = 2; i < sample.length; i++) {
  11. if (sample[i] != sample[pvt + 1]) {
  12. if (sample[i] == sample[1]) pvt = 1;
  13. else pvt = 0;
  14. }
  15. else {
  16. pvt++;
  17. }
  18. rev[i] = pvt;
  19. }
  20. return rev;
  21. }
  22. static void match (char[] pat, char[] sample) {
  23. int[] rev = makeRev(sample);
  24. int pvt = 0;
  25. for (int i = 1; i < pat.length; i++) {
  26. while (pat[i] != sample[pvt + 1] && pvt != 0) {
  27. pvt = rev[pvt];
  28. }
  29. if (pat[i] == sample[pvt + 1]) pvt++;
  30. if (pvt == sample.length - 1) {
  31. makeItDot(i - (sample.length - 1) + 1, i, pat);
  32. tot += sample.length - 1;
  33. break;
  34. }
  35. }
  36. }
  37.     static void solve() {
  38. HashMap<String, String> codonTable = new HashMap<>();
  39. codonTable.put("UUU", "F");      
  40. codonTable.put("UUC", "F");      
  41. codonTable.put("UUA", "L");      
  42. codonTable.put("UUG", "L");      
  43. codonTable.put("UCU", "S");      
  44. codonTable.put("UCC", "S");      
  45. codonTable.put("UCA", "S");      
  46. codonTable.put("UCG", "S");      
  47. codonTable.put("UAU", "Y");      
  48. codonTable.put("UAC", "Y");      
  49. codonTable.put("UAA", "Stop");   
  50. codonTable.put("UAG", "Stop");   
  51. codonTable.put("UGU", "C");      
  52. codonTable.put("UGC", "C");      
  53. codonTable.put("UGA", "Stop");   
  54. codonTable.put("UGG", "W");      
  55. codonTable.put("CUU", "L");      
  56. codonTable.put("CUC", "L");      
  57. codonTable.put("CUA", "L");      
  58. codonTable.put("CUG", "L");      
  59. codonTable.put("CCU", "P");      
  60. codonTable.put("CCC", "P");      
  61. codonTable.put("CCA", "P");      
  62. codonTable.put("CCG", "P");      
  63. codonTable.put("CAU", "H");      
  64. codonTable.put("CAC", "H");      
  65. codonTable.put("CAA", "Q");      
  66. codonTable.put("CAG", "Q");      
  67. codonTable.put("CGU", "R");      
  68. codonTable.put("CGC", "R");      
  69. codonTable.put("CGA", "R");      
  70. codonTable.put("CGG", "R");      
  71. codonTable.put("AUU", "I");      
  72. codonTable.put("AUC", "I");      
  73. codonTable.put("AUA", "I");      
  74. codonTable.put("AUG", "M");      
  75. codonTable.put("ACU", "T");      
  76. codonTable.put("ACC", "T");      
  77. codonTable.put("ACA", "T");      
  78. codonTable.put("ACG", "T");      
  79. codonTable.put("AAU", "N");      
  80. codonTable.put("AAC", "N");      
  81. codonTable.put("AAA", "K");      
  82. codonTable.put("AAG", "K");      
  83. codonTable.put("AGU", "S");      
  84. codonTable.put("AGC", "S");      
  85. codonTable.put("AGA", "R");      
  86. codonTable.put("AGG", "R");      
  87. codonTable.put("GUU", "V");
  88. codonTable.put("GUC", "V");
  89. codonTable.put("GUA", "V");
  90. codonTable.put("GUG", "V");
  91. codonTable.put("GCU", "A");
  92. codonTable.put("GCC", "A");
  93. codonTable.put("GCA", "A");
  94. codonTable.put("GCG", "A");
  95. codonTable.put("GAU", "D");
  96. codonTable.put("GAC", "D");
  97. codonTable.put("GAA", "E");
  98. codonTable.put("GAG", "E");
  99. codonTable.put("GGU", "G");
  100. codonTable.put("GGC", "G");
  101. codonTable.put("GGA", "G");
  102. codonTable.put("GGG", "G"); 
  103. Scanner sc = new Scanner(System.in);
  104. Vector<String> dnas = new Vector<>();
  105. while (sc.hasNext()) {
  106. String s = "";
  107. while (sc.hasNext()) {
  108. String masuk = sc.next();
  109. if (masuk.charAt(0) == '>') break;
  110. s += masuk;
  111. }
  112. if (s != "") dnas.add(s); 
  113. }
  114. Collections.sort(dnas, new Comparator<String>(){
  115. public int compare (String a, String b) {
  116. return b.length() - a.length();
  117. }
  118. });
  119. char[] pat = ("." + dnas.remove(0)).toCharArray();
  120. while (!dnas.isEmpty()) {
  121. char[] sample = ("." + dnas.remove(0)).toCharArray();
  122. match(pat, sample);
  123. }
  124. String s = "";
  125. for (char ch : pat) {
  126. if (ch != '.') s += ch;
  127. }
  128. for (int i = 0; i < s.length(); i+= 3) {
  129. String prot = codonTable.get(s.substring(i, i + 3).replace('T', 'U'));
  130. if (prot != "Stop") out.print(prot);
  131. }
  132.     }
  133.      
First, we make a RNA (or DNA) codon table in HashMap form (line 38-102). In sake of efficiency, we could use a DNA codon table; thus, we can directly translate the DNA into the protein string. Here, I chose to use RNA codon table. 

(In line 115 to 119 I used the "sort" function. This sort function is actually not needed here because the problem setter already guaranteed there is no intron relates each other [so that the answer is only one]. I used this kind of function just for anticipation.) 

The next step is matching those introns with the main string (line 120-124). For matching, I used Knuth-Morris-Pratt algorithm that written in line 2 to 36. Here, I was matching those strings and replacing the suitable substring (in the main string) with '.' (dot) in the same time. 

That dot character was removed in line 125 to 128 (I made a new main-string without '.' (dot), replacing the previous one). 

The last is translating the new main-string into a protein string, just like in "Translating RNA into Protein" problem. 

Input and Output

In the code above, I used next() function for entering the string-form dataset (line 109). 

I also used another input function called hasNext() (line 108). That function is very useful especially if we need to process an unknown-amount of data, just like a FASTA format data.

And for the output I used out.print() function (line 131). That function is a modification from System.out.print() function which is very familiar in java. You can see the additional code for that modification (input and output) in my complete code at github

That's it. If you want to ask something, you can write it in the comment section below. I hope this article is useful and see you in the next article!


Reference :
Source of Image 1 :https://www.facebook.com/ProjectRosalind/
Source of Image 2 :https://commons.wikimedia.org/wiki/File:DNA_exons_introns.gif

No comments:

Post a Comment