Sunday 22 May 2022

Genome Assembly as Shortest Superstring (Rosalind | English)

     

   rosalind


Hi everyone, how are you? This time, I want to discuss about one problem that exists in rosalind.info's web. The title is "Genome Assembly as Shortest Superstring". For the reference, you can first check out the problem that will be discussed (here). 




Fragment assembly: rearrange the separated DNA.


Overview 

In this problem we will given a several strings (let's call them array string S[]) in FASTA format. Our task is to find a superstring from the strings that exist in S[]. The superstring is, in analogy, if a collection of substring concatenates into a string, then a collection of string concatenates into a superstring. 

The idea for solving it is, first, we ordering the strings in S[] based on their prefix and suffix. After it ordered, remove the overlapping parts on S[] to prevent double writing. 

The Code 

This is the code for solving this problem using java language: 
  1. static class Data {
  2. int ances = -1;
  3. int success = -1;
  4. int diambil = 0;
  5. String strand;
  6. Data (String s) {
  7. this.strand = s;
  8. }
  9. }
  10. static void dirajang (Vector<Data> vStrand, Map<String, Vector<Integer>> cekPrev, Map<String, Vector<Integer>> cekSuff, int giliran) {
  11. String s = vStrand.lastElement().strand;
  12. int n = s.length();
  13. String prev = "";
  14. String suff = "";
  15. int diambil = 1;
  16. for (int i = 0; i < n / 2; i++) {
  17. prev += s.charAt(i);
  18. suff = s.charAt(n - 1 - i) + suff;
  19. }
  20. for (int i = n / 2; i < n; i++) {
  21. if (cekSuff.containsKey(prev += s.charAt(i))) {
  22. int cekSuffInt = cekSuff.get(prev).get(0);
  23. vStrand.lastElement().ances = cekSuffInt;
  24. vStrand.get(cekSuffInt).success = giliran;
  25. vStrand.lastElement().diambil = i + 1;
  26. cekSuff.get(prev).remove(0);
  27. }
  28. if (cekPrev.containsKey(suff = s.charAt(n - 1 - i) + suff)) {
  29. int cekPrevInt = cekPrev.get(suff).get(0);
  30. vStrand.lastElement().success = cekPrevInt;
  31. vStrand.get(cekPrevInt).ances = giliran;
  32. vStrand.get(cekPrevInt).diambil = i + 1;
  33. cekPrev.get(suff).remove(0);
  34. }
  35. if (!cekPrev.containsKey(prev)) cekPrev.put(prev, new Vector<>());
  36. cekPrev.get(prev).add(giliran);
  37. if (!cekSuff.containsKey(suff)) cekSuff.put(suff, new Vector<>());
  38. cekSuff.get(suff).add(giliran);
  39. diambil++;
  40. }
  41. }
  42. static void slashAndPrint (Data data) {
  43. for (int i = data.diambil; i < data.strand.length(); i++) {
  44. out.print(data.strand.charAt(i));
  45. }
  46. }
  47. static void solve() {
  48. Scanner sc = new Scanner(System.in);
  49. Map<String, Vector<Integer>> cekPrev = new HashMap<>();
  50. Map<String, Vector<Integer>> cekSuff = new HashMap<>();
  51. Vector<Data> vStrand = new Vector<>();
  52. int giliran = 0;
  53. sc.next();
  54. while (sc.hasNext()) {
  55. String s = "";
  56. while (sc.hasNext()) {
  57. String masuk = sc.next();
  58. if (masuk.charAt(0) == '>') break;
  59. s += masuk;
  60. }
  61. vStrand.add(new Data(s));
  62. dirajang(vStrand, cekPrev, cekSuff, giliran);
  63. giliran++;
  64. }
  65. for (int i = 0; i < vStrand.size(); i++) {
  66. if (vStrand.get(i).ances == -1) {
  67. int pvt = i;
  68. do {
  69. slashAndPrint(vStrand.get(pvt));
  70. pvt = vStrand.get(pvt).success;
  71. } while (pvt != -1);
  72. }
  73. }
  74.     
The Code Description

We start from line 47 (static void solve()). First, we enter the string S[] (line 54-64). While entering the string, we can also find a suitable prefix & suffix and combine them (line 62). 

In static void dirajang() function (line 10-41) I am giving the sequence mark (1, 2, 3, ...) to those string in string array S[]. That mark is used for marking their sequence while printing the answer. 

The last step is printing the answer. Put the strings of S[] based on their mark order. Combine them into one superstring and don't forget to erase the overlapping part of that. 

Input dan Output

As you can see, in the code above I used next() function for entering the string-form dataset (line 57). 

I also used another input function called hasNext() (line 56) because of that ability to entering the unknown amount of data. That function is very suitable to be used for entering a FASTA format-form data. 

While for the output I used out.print() function (line 44). That function is a modification function from System.out.print() which is the default function in java. For more details, you can see the additional code for that modification (input and output) in my complete code at github

That's it. If you want to ask something, you can write it in the comment section below. I hope this article is useful and see you in the next article! 


Reference :
Source image 1: https://www.facebook.com/ProjectRosalind/
Source image 2: https://www.researchgate.net/figure/DNA-assemblyDNA-assembly-1st-step-The-DNA-is-purified-and-extracted-2nd-step-DNA-is_fig9_252323690

No comments:

Post a Comment