Approach: use re.sub
import re
def abbreviate_regex_sub(to_abbreviate):
pattern = re.compile(r"(?<!_)\B[\w']+|[ ,\-_]")
return re.sub(pattern, "", to_abbreviate).upper()
###OR###
def abbreviate_regex_sub(to_abbreviate):
return re.sub(r"(?<!_)\B[\w']+|[ ,\-_]", "", to_abbreviate.upper())
This approach begins by using the re.sub()
method from the re module to "scrub" (remove) unwanted characters such as '
,-
,_
, white space, and all but the first letters of each word from to_abbreviate
.
Python's re
module provides support for regular expressions within the language, and has many useful methods for searching, parsing, and modifying text.
sub()
searches text for all matching patterns, substituting a replacement string (in our case, an empty string).
Regular expression matching starts at the left-hand side of the input and travels toward the right.
While it is a fun experiment to see if the entire problem can be more or less solved with a single regex, the excessive backtracking used in this solution slows down performance considerably. This solution tested the slowest of all solutions during benchmarking, taking 652 steps in the regex engine to find and replace 82 matches.
A more performant method of cleaning would be to use re.findall()
or re.finditer()
to scrub the phrase of unwanted characters, and then process the results with a list-comprehension
or loop
to extract the first letters of words.
to_abbreviate.replace("_", " ").replace("-", " ").upper().split()
can also be used, and is even more performant here for cleaning test inputs.
However, if nothing but a regular expression will do, the third-party regex module provides more tools for lookarounds, recursion, partial matches, and nested sets. Experimenting with that third-party library on your local environment (the exercism Python track does not support third-party libraries) could aid in optimizing this complicated regular expression and help with extracting first letters to form acronyms.
The regular expression (?<!_)\B[\w']+|[ ,\-_]
in the code example above has two alternatives for matching.
For convenience and reuse, the regex is compiled using re.compile()
.
Alternatives are seperated with the pipe (|
) symbol:
-
(?<!_)
is a negative lookbehind, which ensures that_
followed by letter characters (see the pattern explanation below) is not matched (for example,_none
is not matched, but_
with a preceding space is matched). -
\B[\w']+
, which starts searching at a non-word boundary, looks for any character in the groupabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_'
. The+
operator is a 'greedy' modifier that matches a character in the previous group one to unlimited times. This means that this expression will match any collection or repeat of the letters (plus'
), but will not match on anything else. -
[ ,\-_]
matches any of the characters-_,
(space, hyphen, underscore, comma) once.
Because these matches are used in the re.sub()
method, an empty string is substituted - so the matches are removed from the result.
As an example, for the input phrase The Road _Not_ Taken
, the regex will match he
,
, oad
,
, -
, ot
, -
,
, and aken
, replacing each match with ''.
The result is the string TRNT
.
To ensure that all results are capitalized for any input, the approach then chains .upper()
to re.sub()
on the return
line to produce the final acronym.
To play with this regex and see a more in-depth explanation, you can use it on regex101.