Monday 15 November 2021

Finding a Shared Motif (Rosalind | English)

rosalind


Hi everyone, how are you? This time, I want to discuss with you about one problem (bioinformatics problem) that exist in rosalind.info's web. The title is "Finding a Shared Motif". For the reference, you can first check out the problem that will be discussed (here). 

Overview

In this problem, we will given some DNA strings based on FASTA format. Our task is to find a longest common substring obtained from those strings. A longest common substring itself is a subtring that located in all given strings. 

The Code 

This is the code for solving this problem (with java language): 
  1. static void solve() {
  2. Scanner sc = new Scanner(System.in);
  3. sc.next();
  4. int flow = 0;
  5. TreeMap<String, Integer> subString = new TreeMap<>();
  6. while (sc.hasNext()) {
  7. flow++;
  8. HashMap<String, Boolean> substringYgUdah = new HashMap<>();
  9. String s = "";
  10. while (sc.hasNext()) {
  11. String masuk = sc.next();
  12. if (masuk.charAt(0) == '>') break;
  13. s += masuk;
  14. }
  15. if (flow == 1) {  
  16. for (int i = 0; i < s.length(); i++) {
  17. String walk = "";
  18. for (int j = i; j >= 0; j--) {
  19. walk = s.charAt(j) + walk;
  20. subString.put(walk, 1);
  21. }
  22. }
  23. }
  24. {
  25. for (int i = 0; i < s.length(); i++) {
  26. String walk = "";
  27. for (int j = i; j >= 0; j--) {
  28. walk = s.charAt(j) + walk;
  29. if (!substringYgUdah.containsKey(walk)) {
  30. if (subString.containsKey(walk)) subString.put(walk, subString.get(walk) + 1);
  31. substringYgUdah.put(walk, true);
  32. }
  33. }
  34. }
  35. }
  36. }
  37. String res = "";
  38. while (!subString.isEmpty()) {
  39. String s = subString.firstKey();
  40. if (subString.get(s) == flow) {
  41. if (s.length() > res.length()) {
  42. res = s;
  43. }
  44. }
  45. subString.pollFirstEntry();
  46. }
  47. out.println(res);
  48. }
  49.      
Because of what we looking for is a longest common substring, so the first thing what we should do is finding all of the candidate-answer substring. For easier, it is enough to pick a string from the first sample, let's call it l string. Here, we pick all substrings that located in l string (line 15-23). 

The second step is, after picking all l string's substring, then we just make sure whether those substrings are being located in all given strings (line 25-34). 

The last step is to find the satisfying substrings (substrings that present in each string), and then pick the longest one (line 37-47). 

Input and Output

In the code above, I used next() function for entering the string-form dataset (line 11). 

I also used another input function called hasNext() function (line 10). I used that because that function is very useful especially if we need to process an unknown-amount of data, just like a FASTA format data. 

And for the output I used out.println() function (line 47). That function is a modification from System.out.println() function which is very familiar in java. You can see the additional code for that modification (input and output) in my complete code at github

That's it. If you want to ask something, you can write it in the comment section below. I hope this article is useful and see you in the next article!


Reference :
Source of Image 1 :https://www.facebook.com/ProjectRosalind/

No comments:

Post a Comment