module ProteinTranslation
let proteins (rna: string): string list =
let rec doProteins (rna: string) (proteins: string list): string list =
match rna[0..2] with
| "AUG" -> doProteins rna[3..] ("Methionine" :: proteins)
| "UUC" | "UUU" -> doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUA" | "UUG" -> doProteins rna[3..] ("Leucine" :: proteins)
| "UCU" | "UCC" | "UCA" | "UCG" -> doProteins rna[3..] ("Serine" :: proteins)
| "UAU" | "UAC" -> doProteins rna[3..] ("Tyrosine" :: proteins)
| "UGU" | "UGC" -> doProteins rna[3..] ("Cysteine" :: proteins)
| "UGG" -> doProteins rna[3..] ("Tryptophan" :: proteins)
| "UAA" | "UAG" | "UGA" | "" -> List.rev proteins
| _ -> failwith "Unknown coding"
doProteins rna []
In this approach, we'll define a recursive function that will recursively process the RNA sequence and keep track of the translated proteins.
Recursive translation
To use (tail call) recursion to translate the RNA to proteins, we'll introduce a helper function: doProteins
.
This function takes the remaining, unprocessed RNA and a list of translated proteins (the accumulator value):
let rec doProteins (rna: string) (proteins: string list): string list
We'll define this function inside the proteins
function (also known as a nested function), but it could just as well have been defined outside the proteins
function.
That said, its implementation is merely a helper to the proteins
function and is thus tied to that function, so to have it be close to where it is called often makes sense (it signals to the reader that the function should only be used within its parent function).
To allow a function to recursively call itself, the rec
modified must be added.
In other words: by default, functions cannot call themselves.
Translating
As each codon is three letters long, the doProteins
function looks at the first three letters of its codons
parameter.
For each translateable codon, we recursively call the doProteins
function, with the remainder of the codons (skipping the first three letters) and the codon's protein added to the proteins accumulator value as arguments.
match rna[0..2] with
| "AUG" -> doProteins rna[3..] ("Methionine" :: proteins)
| "UUC" -> doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUU" -> doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUA" -> doProteins rna[3..] ("Leucine" :: proteins)
| "UUG" -> doProteins rna[3..] ("Leucine" :: proteins)
| "UCU" -> doProteins rna[3..] ("Serine" :: proteins)
| "UCC" -> doProteins rna[3..] ("Serine" :: proteins)
| "UCA" -> doProteins rna[3..] ("Serine" :: proteins)
| "UCG" -> doProteins rna[3..] ("Serine" :: proteins)
| "UAU" -> doProteins rna[3..] ("Tyrosine" :: proteins)
| "UAC" -> doProteins rna[3..] ("Tyrosine" :: proteins)
| "UGU" -> doProteins rna[3..] ("Cysteine" :: proteins)
| "UGC" -> doProteins rna[3..] ("Cysteine" :: proteins)
| "UGG" -> doProteins rna[3..] ("Tryptophan" :: proteins)
Stopping
Next up is to handle the "STOP" proteins. We'll add branches for each of the three "STOP" proteins, which stop the recursion and instead return the (reversed) accumulator value:
| "UAA" -> List.rev proteins
| "UAG" -> List.rev proteins
| "UGA" -> List.rev proteins
There is one additional case we need to process, and that is when there are no codons left to process:
| "" -> List.rev proteins
We need to use List.rev
to reverse the proteins, as translated proteins are added at the head of the list (in the front).
Prepending an element to a list is much faster than appending an element.
In fact, it is so much faster that the penalty of having to reverse the list ends up being well worth it.
Unknown input
At this point, the F# compiler lets us know that we haven't handled all possible cases, which is true. The easiest thing to do is to throw an error if none of the previous branches matched:
| _ -> failwith "Unknown coding"
Combining patterns
You might have noticed that many of the branches end up doing the exact same thing. F# allows one to "chain" multiple patterns to reduce the duplication:
match rna[0..2] with
| "AUG" -> doProteins rna[3..] ("Methionine" :: proteins)
| "UUC" | "UUU" -> doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUA" | "UUG" -> doProteins rna[3..] ("Leucine" :: proteins)
| "UCU" | "UCC" | "UCA" | "UCG" -> doProteins rna[3..] ("Serine" :: proteins)
| "UAU" | "UAC" -> doProteins rna[3..] ("Tyrosine" :: proteins)
| "UGU" | "UGC" -> doProteins rna[3..] ("Cysteine" :: proteins)
| "UGG" -> doProteins rna[3..] ("Tryptophan" :: proteins)
| "UAA" | "UAG" | "UGA" | "" -> List.rev proteins
| _ -> failwith "Unknown coding"
Alignment
While definitely not needed, aligning the code vertically makes it more clear that the codon patterns all end up doing basically the same thing, but with a different protein:
match rna[0..2] with
| "AUG" -> doProteins rna[3..] ("Methionine" :: proteins)
| "UUC" | "UUU" -> doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUA" | "UUG" -> doProteins rna[3..] ("Leucine" :: proteins)
| "UCU" | "UCC" | "UCA" | "UCG" -> doProteins rna[3..] ("Serine" :: proteins)
| "UAU" | "UAC" -> doProteins rna[3..] ("Tyrosine" :: proteins)
| "UGU" | "UGC" -> doProteins rna[3..] ("Cysteine" :: proteins)
| "UGG" -> doProteins rna[3..] ("Tryptophan" :: proteins)
| "UAA" | "UAG" | "UGA" | "" -> List.rev proteins
| _ -> failwith "Unknown coding"
A downside of vertical alignment is that changes to the code require more work, as you'll need to ensure everything is still aligned. For this particular case, it isn't really an issue, as the codons are fixed and the code is thus unlikely to change.
Putting it all together
The final step is to call our recursive helper function:
doProteins rna []
And with that, we have a working, tail recursive implementation that translates the RNA to proteins.
Tail recursion prevents stack overflows when a recursive function is called many times. While the exercise does not have large test cases that would cause a stack overflow, it is good practice to always use using tail recursion when implementing a recursive functions. If you'd like to read more about tail recursion, this MSDN article goes into more detail. Another good resource on tail recursion is this blog post.