Span<T>

Protein Translation
Protein Translation in F#
module ProteinTranslation

open System

let rec private doProteins (rna: ReadOnlySpan<char>) (proteins: string list): string list =
    if   rna.StartsWith("AUG") then doProteins (rna.Slice(3)) ("Methionine"    :: proteins)
    elif rna.StartsWith("UUC") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
    elif rna.StartsWith("UUU") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
    elif rna.StartsWith("UUA") then doProteins (rna.Slice(3)) ("Leucine"       :: proteins)
    elif rna.StartsWith("UUG") then doProteins (rna.Slice(3)) ("Leucine"       :: proteins)
    elif rna.StartsWith("UCU") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
    elif rna.StartsWith("UCC") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
    elif rna.StartsWith("UCA") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
    elif rna.StartsWith("UCG") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
    elif rna.StartsWith("UAU") then doProteins (rna.Slice(3)) ("Tyrosine"      :: proteins)
    elif rna.StartsWith("UAC") then doProteins (rna.Slice(3)) ("Tyrosine"      :: proteins)
    elif rna.StartsWith("UGU") then doProteins (rna.Slice(3)) ("Cysteine"      :: proteins)
    elif rna.StartsWith("UGC") then doProteins (rna.Slice(3)) ("Cysteine"      :: proteins)
    elif rna.StartsWith("UGG") then doProteins (rna.Slice(3)) ("Tryptophan"    :: proteins)
    elif rna.StartsWith("UAA") then List.rev proteins
    elif rna.StartsWith("UAG") then List.rev proteins
    elif rna.StartsWith("UGA") then List.rev proteins
    elif rna.IsEmpty           then List.rev proteins
    else failwith "Unknown coding"

let proteins (rna: string): string list =
    doProteins (rna.AsSpan()) []

In this approach, we'll define a recursive function that will recursively process the codons and keep track of the translated proteins. The codons will be passed as ReadOnlySpan<char> instances instead of string instances, to prevent re-allocating a new string for each recursive call.

Recursive translation

To use (tail call) recursion to translate the RNA to proteins, we'll introduce a helper function: doProteins. This function takes the remaining, unprocessed RNA as a ReadOnlySpan<char> and a list of translated proteins strings (the accumulator value):

let rec doProteins (rna: ReadOnlySpan<char>) (proteins: string list): string list

We'll define this function inside the proteins function (also known as a nested function), but it could just as well have been defined outside the proteins function. That said, its implementation is merely a helper to the proteins function and is thus tied to that function, so to have it be close to where it is called often makes sense (it signals to the reader that the function should only be used within its parent function).

Note

To allow a function to recursively call itself, the rec modified must be added. In other words: by default, functions cannot call themselves.

Translating

As each codon is three letters long, the doProteins function looks at the first three letters of its codons parameter via its StartsWith() method. For each translateable codon, we recursively call the doProteins function, with the remainder of the codons (skipping the first three letters) and the codon's protein added to the proteins accumulator value as arguments.

Note

We skip over the first three letters via the Slice() method, which does not allocate a new string but only a new ReadOnlySpan. The underlying string remains the same, but the view of that string is offset by 3.

if   rna.StartsWith("AUG") then doProteins (rna.Slice(3)) ("Methionine"    :: proteins)
elif rna.StartsWith("UUC") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
elif rna.StartsWith("UUU") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
elif rna.StartsWith("UUA") then doProteins (rna.Slice(3)) ("Leucine"       :: proteins)
elif rna.StartsWith("UUG") then doProteins (rna.Slice(3)) ("Leucine"       :: proteins)
elif rna.StartsWith("UCU") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
elif rna.StartsWith("UCC") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
elif rna.StartsWith("UCA") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
elif rna.StartsWith("UCG") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
elif rna.StartsWith("UAU") then doProteins (rna.Slice(3)) ("Tyrosine"      :: proteins)
elif rna.StartsWith("UAC") then doProteins (rna.Slice(3)) ("Tyrosine"      :: proteins)
elif rna.StartsWith("UGU") then doProteins (rna.Slice(3)) ("Cysteine"      :: proteins)
elif rna.StartsWith("UGC") then doProteins (rna.Slice(3)) ("Cysteine"      :: proteins)
elif rna.StartsWith("UGG") then doProteins (rna.Slice(3)) ("Tryptophan"    :: proteins)

Stopping

Next up is to handle the "STOP" proteins. We'll add conditions for each of the three "STOP" proteins, which stop the recursion and instead return the (reversed) accumulator value:

elif rna.StartsWith("UAA") then List.rev proteins
elif rna.StartsWith("UAG") then List.rev proteins
elif rna.StartsWith("UGA") then List.rev proteins

There is one additional case we need to process, and that is when there are no codons left to process (for which we use its IsEmpty property):

elif rna.IsEmpty then List.rev protein
Note

We need to use List.rev to reverse the proteins, as translated proteins are added at the head of the list (in the front). Prepending an element to a list is much faster than appending an element. In fact, it is so much faster that the penalty of having to reverse the list ends up being well worth it.

Unknown input

At this point, the F# compiler lets us know that we haven't handled all possible cases, which is true. The easiest thing to do is to throw an error if none of the previous branches matched:

else failwith "Unknown coding"

Alignment

While definitely not needed, aligning the code vertically makes it more clear that the codon patterns all end up doing basically the same thing, but with a different protein:

if   rna.StartsWith("AUG") then doProteins (rna.Slice(3)) ("Methionine"    :: proteins)
elif rna.StartsWith("UUC") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
elif rna.StartsWith("UUU") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
elif rna.StartsWith("UUA") then doProteins (rna.Slice(3)) ("Leucine"       :: proteins)
elif rna.StartsWith("UUG") then doProteins (rna.Slice(3)) ("Leucine"       :: proteins)
elif rna.StartsWith("UCU") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
elif rna.StartsWith("UCC") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
elif rna.StartsWith("UCA") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
elif rna.StartsWith("UCG") then doProteins (rna.Slice(3)) ("Serine"        :: proteins)
elif rna.StartsWith("UAU") then doProteins (rna.Slice(3)) ("Tyrosine"      :: proteins)
elif rna.StartsWith("UAC") then doProteins (rna.Slice(3)) ("Tyrosine"      :: proteins)
elif rna.StartsWith("UGU") then doProteins (rna.Slice(3)) ("Cysteine"      :: proteins)
elif rna.StartsWith("UGC") then doProteins (rna.Slice(3)) ("Cysteine"      :: proteins)
elif rna.StartsWith("UGG") then doProteins (rna.Slice(3)) ("Tryptophan"    :: proteins)
elif rna.StartsWith("UAA") then List.rev proteins
elif rna.StartsWith("UAG") then List.rev proteins
elif rna.StartsWith("UGA") then List.rev proteins
elif rna.IsEmpty           then List.rev proteins
else failwith "Unknown coding"
Note

A downside of vertical alignment is that changes to the code require more work, as you'll need to ensure everything is still aligned. For this particular case, it isn't really an issue, as the codons are fixed and the code is thus unlikely to change.

Putting it all together

The final step is to call our recursive helper function, converting our RNA string to a ReadOnlySpan<char> using its AsSpan() method:

doProteins (rna.AsSpan()) []

And with that, we have a working, tail recursive implementation that translates the RNA to proteins whilst minimizing string allocations.

Note

Tail recursion prevents stack overflows when a recursive function is called many times. While the exercise does not have large test cases that would cause a stack overflow, it is good practice to always use using tail recursion when implementing a recursive functions. If you'd like to read more about tail recursion, this MSDN article goes into more detail. Another good resource on tail recursion is this blog post.

30th Oct 2024 · Found it useful?