Seq module

Protein Translation
Protein Translation in F#
module ProteinTranslation

let private codonToProtein (codon: string): string =
    match codon with
    | "AUG"                         -> "Methionine"
    | "UUC" | "UUU"                 -> "Phenylalanine"
    | "UUA" | "UUG"                 -> "Leucine"
    | "UCU" | "UCC" | "UCA" | "UCG" -> "Serine"
    | "UAU" | "UAC"                 -> "Tyrosine"
    | "UGU" | "UGC"                 -> "Cysteine"
    | "UGG"                         -> "Tryptophan"
    | "UAA" | "UAG" | "UGA"         -> "STOP"
    | _ -> failwith "Invalid codon"

let proteins (rna: string): string list =
    rna
    |> Seq.chunkBySize 3
    |> Seq.map System.String
    |> Seq.map codonToProtein
    |> Seq.takeWhile (fun protein -> protein <> "STOP")
    |> Seq.toList

This approach combines a number of functions from the Seq module to build up a pipeline to translate the RNA to proteins.

Translating

Let's define a codonToProtein function that takes a string parameter representing the codon and returns the translated protein, also as a string:

let private codonToProtein (codon: string): string
Note

We could have defined this function as a nested function within the proteins function, but as it could reasonably be used outside the proteins function, we chose not to.

Within the function, we simply pattern match on the codon to translate it to its corresponding protein:

match codon with
| "AUG" -> "Methionine"
| "UUC" -> "Phenylalanine"
| "UUU" -> "Phenylalanine"
| "UUA" -> "Leucine"
| "UUG" -> "Leucine"
| "UCU" -> "Serine"
| "UCC" -> "Serine"
| "UCA" -> "Serine"
| "UCG" -> "Serine"
| "UAU" -> "Tyrosine"
| "UAC" -> "Tyrosine"
| "UGU" -> "Cysteine"
| "UGC" -> "Cysteine"
| "UGG" -> "Tryptophan"
| "UAA" -> "STOP"
| "UAG" -> "STOP"
| "UGA" -> "STOP"
| _ -> failwith "Invalid codon"

Combining patterns

You might have noticed that many of the branches end up doing the exact same thing. F# allows one to "chain" multiple patterns to reduce the duplication:

match codon with
| "AUG" -> "Methionine"
| "UUC" | "UUU" -> "Phenylalanine"
| "UUA" | "UUG" -> "Leucine"
| "UCU" | "UCC" | "UCA" | "UCG" -> "Serine"
| "UAU" | "UAC" -> "Tyrosine"
| "UGU" | "UGC" -> "Cysteine"
| "UGG" -> "Tryptophan"
| "UAA" | "UAG" | "UGA" -> "STOP"
| _ -> failwith "Unknown coding"

Alignment

While definitely not needed, aligning the code vertically makes it more clear that the codon patterns all end up doing basically the same thing, but with a different protein:

match codon with
| "AUG"                         -> "Methionine"
| "UUC" | "UUU"                 -> "Phenylalanine"
| "UUA" | "UUG"                 -> "Leucine"
| "UCU" | "UCC" | "UCA" | "UCG" -> "Serine"
| "UAU" | "UAC"                 -> "Tyrosine"
| "UGU" | "UGC"                 -> "Cysteine"
| "UGG"                         -> "Tryptophan"
| "UAA" | "UAG" | "UGA"         -> "STOP"
| _ -> failwith "Unknown coding"

Using function instead of match

As the codonToProtein function has one argument on which the function then immediately starts pattern matching, one could rewrite it using the function keyword:

let private codonToProtein =
    function
    | "AUG"                         -> "Methionine"
    | "UUC" | "UUU"                 -> "Phenylalanine"
    | "UUA" | "UUG"                 -> "Leucine"
    | "UCU" | "UCC" | "UCA" | "UCG" -> "Serine"
    | "UAU" | "UAC"                 -> "Tyrosine"
    | "UGU" | "UGC"                 -> "Cysteine"
    | "UGG"                         -> "Tryptophan"
    | "UAA" | "UAG" | "UGA"         -> "STOP"
    | _ -> failwith "Invalid codon"

What function does is to return a one-parameter function that pattern matches on that parameter. The following definitions are thus functional equivalent:

let private codonToProtein codon =
    match codon with
    | "AUG" -> "Methionine"

let private codonToProtein =
    function
    | "AUG" -> "Methionine"

let private codonToProtein =
    fun codon ->
        match codon with
        | "AUG" -> "Methionine"

Whilst they are all valid, the first option (function with parameter and explicit match) is preferred for readability.

Putting it all together

Now that we can translated RNA to proteins, let's build up a working solution.

Split RNA sequence into codons

First, let's split our rna parameter into chunks of three letters:

rna
|> Seq.chunkBySize 3
|> Seq.map System.String

The Seq.chunkBySize function will split the string into a sequence of three letters. We then use Seq.map to convert those three letter sequences into strings (basically: concatenating the letters), which represent the codons.

Converting to proteins

Next up is converting the codons to proteins, once again using Seq.map:

|> Seq.map codonToProtein

Stopping

Then we'll need to stop processing when we encounter a "STOP" protein. For that we can use Seq.takeWhile, where we pass in a lambda function that checks if the protein is not the "STOP" protein. It is isn't the element is preserved, and the next element is checked, until either there are no elements or a "STOP" protein is found (this protein is not included in the results).

|> Seq.takeWhile (fun protein -> protein <> "STOP")
Note

One could also write the above as:

|> Seq.takeWhile ((<>) "STOP")

However, this is arguably less readable.

Converting to a list

Finally, we convert the sequence to a list via Seq.toList:

|> Seq.toList

Combining it all

This gives us the following pipeline:

let proteins (rna: string): string list =
    rna
    |> Seq.chunkBySize 3
    |> Seq.map System.String
    |> Seq.map codonToProtein
    |> Seq.takeWhile (fun protein -> protein <> "STOP")
    |> Seq.toList

We now have a working implementation that translates the RNA to proteins.

30th Oct 2024 · Found it useful?

Other Approaches to Protein Translation in F#

Other ways our community solved this exercise
match rna[0..2] with
| "AUG" ->                         doProteins rna[3..] ("Methionine"    :: proteins)
| "UUC" | "UUU" ->                 doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUA" | "UUG" ->                 doProteins rna[3..] ("Leucine"       :: proteins)
| "UCU" | "UCC" | "UCA" | "UCG" -> doProteins rna[3..] ("Serine"        :: proteins)
| "UGU" | "UGC" ->                 doProteins rna[3..] ("Cysteine"      :: proteins)
| "UGG" ->                         doProteins rna[3..] ("Tryptophan"    :: proteins)
| "UAA" | "UAG" | "UGA" | "" ->    List.rev proteins
Recursion

Use recursion to translate the RNA to proteins.

let doProteins (rna: string): (string * string) option =
    match rna[0..2] with
    | "AUG" ->                      Some ("Methionine",    rna[3..])
    | "UUC" | "UUU" ->              Some ("Phenylalanine", rna[3..])
    | "UGU" | "UGC" ->              Some ("Cysteine",      rna[3..])
    | "UAA" | "UAG" | "UGA" | "" -> None

let proteins (rna: string): string list = List.unfold doProteins rna
Unfold

Use List.unfold to translate the RNA to proteins

let rec doProteins (rna: ReadOnlySpan<char>) (proteins: string list): string list =
    if   rna.StartsWith("AUG") then doProteins (rna.Slice(3)) ("Methionine"    :: proteins)
    elif rna.StartsWith("UUC") then doProteins (rna.Slice(3)) ("Phenylalanine" :: proteins)
    elif rna.StartsWith("UAC") then doProteins (rna.Slice(3)) ("Tyrosine"      :: proteins)
    elif rna.StartsWith("UGU") then doProteins (rna.Slice(3)) ("Cysteine"      :: proteins)
    elif rna.StartsWith("UGC") then doProteins (rna.Slice(3)) ("Cysteine"      :: proteins)
    elif rna.StartsWith("UGA") then List.rev proteins
    elif rna.IsEmpty           then List.rev proteins
Span<T>

Use Span<T> to efficiently translate the RNA to proteins