module ProteinTranslation
open System
let rec private doProteins (rna: ReadOnlySpan<char>) (proteins: string list): string list =
if rna.StartsWith("AUG") then doProteins (rna.Slice(3)) ("Methionine" :: proteins)
elif rna.StartsWith("UUC") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
elif rna.StartsWith("UUU") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
elif rna.StartsWith("UUA") then doProteins (rna.Slice(3)) ("Leucine" :: proteins)
elif rna.StartsWith("UUG") then doProteins (rna.Slice(3)) ("Leucine" :: proteins)
elif rna.StartsWith("UCU") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UCC") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UCA") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UCG") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UAU") then doProteins (rna.Slice(3)) ("Tyrosine" :: proteins)
elif rna.StartsWith("UAC") then doProteins (rna.Slice(3)) ("Tyrosine" :: proteins)
elif rna.StartsWith("UGU") then doProteins (rna.Slice(3)) ("Cysteine" :: proteins)
elif rna.StartsWith("UGC") then doProteins (rna.Slice(3)) ("Cysteine" :: proteins)
elif rna.StartsWith("UGG") then doProteins (rna.Slice(3)) ("Tryptophan" :: proteins)
elif rna.StartsWith("UAA") then List.rev proteins
elif rna.StartsWith("UAG") then List.rev proteins
elif rna.StartsWith("UGA") then List.rev proteins
elif rna.IsEmpty then List.rev proteins
else failwith "Unknown coding"
let proteins (rna: string): string list =
doProteins (rna.AsSpan()) []
In this approach, we'll define a recursive function that will recursively process the codons and keep track of the translated proteins.
The codons will be passed as ReadOnlySpan<char>
instances instead of string
instances, to prevent re-allocating a new string for each recursive call.
Recursive translation
To use (tail call) recursion to translate the RNA to proteins, we'll introduce a helper function: doProteins
.
This function takes the remaining, unprocessed RNA as a ReadOnlySpan<char>
and a list of translated proteins strings (the accumulator value):
let rec doProteins (rna: ReadOnlySpan<char>) (proteins: string list): string list
We'll define this function inside the proteins
function (also known as a nested function), but it could just as well have been defined outside the proteins
function.
That said, its implementation is merely a helper to the proteins
function and is thus tied to that function, so to have it be close to where it is called often makes sense (it signals to the reader that the function should only be used within its parent function).
To allow a function to recursively call itself, the rec
modified must be added.
In other words: by default, functions cannot call themselves.
Translating
As each codon is three letters long, the doProteins
function looks at the first three letters of its codons
parameter via its StartsWith()
method.
For each translateable codon, we recursively call the doProteins
function, with the remainder of the codons (skipping the first three letters) and the codon's protein added to the proteins accumulator value as arguments.
We skip over the first three letters via the Slice()
method, which does not allocate a new string
but only a new ReadOnlySpan
.
The underlying string
remains the same, but the view of that string is offset by 3.
if rna.StartsWith("AUG") then doProteins (rna.Slice(3)) ("Methionine" :: proteins)
elif rna.StartsWith("UUC") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
elif rna.StartsWith("UUU") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
elif rna.StartsWith("UUA") then doProteins (rna.Slice(3)) ("Leucine" :: proteins)
elif rna.StartsWith("UUG") then doProteins (rna.Slice(3)) ("Leucine" :: proteins)
elif rna.StartsWith("UCU") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UCC") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UCA") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UCG") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UAU") then doProteins (rna.Slice(3)) ("Tyrosine" :: proteins)
elif rna.StartsWith("UAC") then doProteins (rna.Slice(3)) ("Tyrosine" :: proteins)
elif rna.StartsWith("UGU") then doProteins (rna.Slice(3)) ("Cysteine" :: proteins)
elif rna.StartsWith("UGC") then doProteins (rna.Slice(3)) ("Cysteine" :: proteins)
elif rna.StartsWith("UGG") then doProteins (rna.Slice(3)) ("Tryptophan" :: proteins)
Stopping
Next up is to handle the "STOP" proteins. We'll add conditions for each of the three "STOP" proteins, which stop the recursion and instead return the (reversed) accumulator value:
elif rna.StartsWith("UAA") then List.rev proteins
elif rna.StartsWith("UAG") then List.rev proteins
elif rna.StartsWith("UGA") then List.rev proteins
There is one additional case we need to process, and that is when there are no codons left to process (for which we use its IsEmpty
property):
elif rna.IsEmpty then List.rev protein
We need to use List.rev
to reverse the proteins, as translated proteins are added at the head of the list (in the front).
Prepending an element to a list is much faster than appending an element.
In fact, it is so much faster that the penalty of having to reverse the list ends up being well worth it.
Unknown input
At this point, the F# compiler lets us know that we haven't handled all possible cases, which is true. The easiest thing to do is to throw an error if none of the previous branches matched:
else failwith "Unknown coding"
Alignment
While definitely not needed, aligning the code vertically makes it more clear that the codon patterns all end up doing basically the same thing, but with a different protein:
if rna.StartsWith("AUG") then doProteins (rna.Slice(3)) ("Methionine" :: proteins)
elif rna.StartsWith("UUC") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
elif rna.StartsWith("UUU") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
elif rna.StartsWith("UUA") then doProteins (rna.Slice(3)) ("Leucine" :: proteins)
elif rna.StartsWith("UUG") then doProteins (rna.Slice(3)) ("Leucine" :: proteins)
elif rna.StartsWith("UCU") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UCC") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UCA") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UCG") then doProteins (rna.Slice(3)) ("Serine" :: proteins)
elif rna.StartsWith("UAU") then doProteins (rna.Slice(3)) ("Tyrosine" :: proteins)
elif rna.StartsWith("UAC") then doProteins (rna.Slice(3)) ("Tyrosine" :: proteins)
elif rna.StartsWith("UGU") then doProteins (rna.Slice(3)) ("Cysteine" :: proteins)
elif rna.StartsWith("UGC") then doProteins (rna.Slice(3)) ("Cysteine" :: proteins)
elif rna.StartsWith("UGG") then doProteins (rna.Slice(3)) ("Tryptophan" :: proteins)
elif rna.StartsWith("UAA") then List.rev proteins
elif rna.StartsWith("UAG") then List.rev proteins
elif rna.StartsWith("UGA") then List.rev proteins
elif rna.IsEmpty then List.rev proteins
else failwith "Unknown coding"
A downside of vertical alignment is that changes to the code require more work, as you'll need to ensure everything is still aligned. For this particular case, it isn't really an issue, as the codons are fixed and the code is thus unlikely to change.
Putting it all together
The final step is to call our recursive helper function, converting our RNA string
to a ReadOnlySpan<char>
using its AsSpan()
method:
doProteins (rna.AsSpan()) []
And with that, we have a working, tail recursive implementation that translates the RNA to proteins whilst minimizing string allocations.
Tail recursion prevents stack overflows when a recursive function is called many times. While the exercise does not have large test cases that would cause a stack overflow, it is good practice to always use using tail recursion when implementing a recursive functions. If you'd like to read more about tail recursion, this MSDN article goes into more detail. Another good resource on tail recursion is this blog post.