Monday 8 November 2021

Consensus and Profile (Rosalind | English)

rosalind


Hi everyone, how are you? This time, I want to discuss about one problem (which is bioinformatics problem) that exist in rosalind.info's web. The title is "Consensus and Profile". For the reference, you can first check out the problem that will be discussed (here). 

Overview

In this problem, we will given some strings based on FASTA format that represent the unknown-amount of DNA strains. Our task is to determine the consensus and the profile matrix from those DNA strains. 

The consensus is a string formed from "the most often character" from each position.

While profile matrix is a matrix that represent how often DNA characters (A, C, G, T) appear in each position. There is how to find the consensus and the profile matrix: 

Sample case, 

A T C C A G C T
G G G C A A C T
A T G G A T C T
DNAA A G C A A C C
T T G G A A C T
A T G C C A T T
A T G G C A C T

A   5 1 0 0 5 5 0 0
Profile     C   0 0 1 4 2 0 6 1
G   1 1 6 3 0 1 0 0
T   1 5 0 0 0 1 1 6

ConsensusA T G C A A C T

The Code 

This is the code for solving this problem (with java language): 
  1. static void solve() {
  2. Scanner sc = new Scanner(System.in);
  3. Vector<String> vs = new Vector<>();
  4. sc.next();
  5. while (sc.hasNext()) {
  6. String s = "";
  7. while (sc.hasNext()) {
  8. String line = sc.next();
  9. if (line.charAt(0) == '>') break;
  10. s += line;
  11. }
  12. vs.add(s);
  13. }
  14. int n = vs.size();
  15. int m = vs.get(0).length();
  16. int[][] karung = new int[4][m];
  17. for (int i = 0; i < n; i++) {
  18. for (int j = 0; j < m; j++) {
  19. if (vs.get(i).charAt(j) == 'A') karung[0][j]++;
  20. else if (vs.get(i).charAt(j) == 'C') karung[1][j]++;
  21. else if (vs.get(i).charAt(j) == 'G') karung[2][j]++;
  22. else if (vs.get(i).charAt(j) == 'T') karung[3][j]++;
  23. }
  24. }
  25. char[] mol = {'A', 'C', 'G', 'T'};
  26. String con = "";
  27. for (int i = 0; i < m; i++) {
  28. int Max = 0;
  29. char winner = '.';
  30. for (int j = 0; j < 4; j++) {
  31. if (karung[j][i] > Max) {
  32. Max = karung[j][i];
  33. winner = mol[j];
  34. }
  35. }
  36. con += winner;
  37. }
  38. out.println(con);
  39. for (int i = 0; i < 4; i++) {
  40. out.print(mol[i] + ": ");
  41. for (int j = 0; j < m; j++) {
  42. out.print(karung[i][j] + " ");
  43. }
  44. out.println();
  45. }
  46. } 
  47.   
First, enter the input data (line 2-13). 

Then, make a profile matrix by grouping them based on their character (line 14-24). 

Make a consensus string based on the profile matrix data; find the most frequently occuring character (line 25-37). 

Print the consensus string and the profile matrix (line 38-46). 

Input and Output

In the code above, I used next() function for entering the string-form dataset (line 8). 

I also used another input function called hasNext() (line 5). That function is very useful especially if we need to process an unknown-amount of data, just like a FASTA format data. 

And for the output I used out.print() & out.println() function (line 38 & 40). That function is a modification from System.out.print() & System.out.println() function which is very familiar in java. You can see the additional code for that modification (input and output) in my complete code at github.

That's it. If you want to ask something, you can write it in the comment section below. I hope this article is useful and see you in the next article!


Reference :
Source of image 1 :https://www.facebook.com/ProjectRosalind/

No comments:

Post a Comment