import scala.annotation.tailrec
object ProteinTranslation {
def proteins(input: String): Seq[String] = {
codonsToProteins(input, Seq())
}
@tailrec
private def codonsToProteins(
input: String,
proteins: Seq[String]
): Seq[String] = {
if (input.length < 3)
proteins
else
codonToProtein(input.take(3)) match {
case "STOP" => proteins
case protein =>
codonsToProteins(input.drop(3), proteins :+ protein)
}
}
private def codonToProtein(codon: String): String = {
codon match {
case "AUG" => "Methionine"
case "UUU" | "UUC" => "Phenylalanine"
case "UUA" | "UUG" => "Leucine"
case "UCU" | "UCC" | "UCA" | "UCG" => "Serine"
case "UAU" | "UAC" => "Tyrosine"
case "UGU" | "UGC" => "Cysteine"
case "UGG" => "Tryptophan"
case "UAA" | "UAG" | "UGA" => "STOP"
}
}
}
This approach starts by importing from packages for what is needed.
The proteins()
method starts by having calling the codonsToProteins()
method,
passing in the input String
and en empty Sequence.
The codonsToProteins()
method is annotated with the @tailrec
annotation to verify that the method can be compiled
with tail call optimization.
A tail call is a particular form of recursion where the last call in the method is a call to the same method and nothing else.
In other words, if the last call in recurMe()
is recurMe(arg1, arg2) + 1
, the + 1
makes the recursion non-tail recursive.
If the last call in recurMe()
is recurMe(arg1, arg2, acc + 1)
, then the recursion is a tail call, because only the method is being called
with no other operation being peformed on it.
If the length of the input String
is less than 3
, the method returns the Sequence
of proteins.
Otherwise, a match
expression is used to perform pattern matching on the result of passing the codon
to the codonToProtein()
method.
The codon is set by using the take()
method to get the first three Char
s of the input String
.
The codonToProtein()
method uses a match
to look up the protein for the codon and returns the protein from the method.
If the protein is for a STOP
codon, the match
returns the Sequence
of proteins from the method.
Otherwise, the method calls itself, using the drop()
method to pass in all but the first 3
Char
s of the input String
,
and passing in the existing Sequence
of proteins with the protein added to the end of it with the :+
operator.
The take()
and drop()
methods treat a string as a plain sequence of Char code units and makes no attempt to keep surrogate pairs or codepoint sequences together.
The user is responsible for making sure such cases are handled correctly.
Failing to do so may result in an invalid Unicode string.
The take()
and drop()
methods work here because all of the characters are ASCII.