Regex Sub

Acronym
Acronym in Python

Approach: use re.sub

import re


def abbreviate_regex_sub(to_abbreviate):
    pattern = re.compile(r"(?<!_)\B[\w']+|[ ,\-_]")
 
    return  re.sub(pattern, "", to_abbreviate).upper()
    
###OR###

def abbreviate_regex_sub(to_abbreviate):
    return  re.sub(r"(?<!_)\B[\w']+|[ ,\-_]", "", to_abbreviate.upper())

This approach begins by using the re.sub() method from the re module to "scrub" (remove) unwanted characters such as ',-,_, white space, and all but the first letters of each word from to_abbreviate. Python's re module provides support for regular expressions within the language, and has many useful methods for searching, parsing, and modifying text.

sub() searches text for all matching patterns, substituting a replacement string (in our case, an empty string). Regular expression matching starts at the left-hand side of the input and travels toward the right.

Caution

While it is a fun experiment to see if the entire problem can be more or less solved with a single regex, the excessive backtracking used in this solution slows down performance considerably. This solution tested the slowest of all solutions during benchmarking, taking 652 steps in the regex engine to find and replace 82 matches.

A more performant method of cleaning would be to use re.findall() or re.finditer() to scrub the phrase of unwanted characters, and then process the results with a list-comprehension or loop to extract the first letters of words. to_abbreviate.replace("_", " ").replace("-", " ").upper().split() can also be used, and is even more performant here for cleaning test inputs.

However, if nothing but a regular expression will do, the third-party regex module provides more tools for lookarounds, recursion, partial matches, and nested sets. Experimenting with that third-party library on your local environment (the exercism Python track does not support third-party libraries) could aid in optimizing this complicated regular expression and help with extracting first letters to form acronyms.

The regular expression (?<!_)\B[\w']+|[ ,\-_] in the code example above has two alternatives for matching. For convenience and reuse, the regex is compiled using re.compile(). Alternatives are seperated with the pipe (|) symbol:

  1. (?<!_) is a negative lookbehind, which ensures that _ followed by letter characters (see the pattern explanation below) is not matched (for example, _none is not matched, but _ with a preceding space is matched).
  2. \B[\w']+, which starts searching at a non-word boundary, looks for any character in the group abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_'. The + operator is a 'greedy' modifier that matches a character in the previous group one to unlimited times. This means that this expression will match any collection or repeat of the letters (plus '), but will not match on anything else.
  3. [ ,\-_] matches any of the characters -_, (space, hyphen, underscore, comma) once.

Because these matches are used in the re.sub() method, an empty string is substituted - so the matches are removed from the result.

As an example, for the input phrase The Road _Not_ Taken, the regex will match he, , oad, , -, ot, -, , and aken, replacing each match with ''. The result is the string TRNT.

To ensure that all results are capitalized for any input, the approach then chains .upper() to re.sub() on the return line to produce the final acronym.

To play with this regex and see a more in-depth explanation, you can use it on regex101.

20th Nov 2024 · Found it useful?

Other Approaches to Acronym in Python

Other ways our community solved this exercise
from functools import reduce

def abbreviate(to_abbreviate):
    phrase = to_abbreviate.replace("_", " ").replace("-", " ").upper().split()

    return reduce(lambda start, word: start + word[0], phrase, "")
Functools Reduce

Use functools.reduce() to form an acronym from text cleaned using str.replace().

def abbreviate(to_abbreviate):
    phrase = to_abbreviate.replace('-', ' ').replace('_', ' ').upper().split()

    # note the lack of square brackets around the comprehension.
    return ''.join(word[0] for word in phrase)
Generator Expression

Use a generator expression with str.join() to form an acronym from text cleaned using str.replace().

def abbreviate(to_abbreviate):
    phrase = to_abbreviate.replace('-', ' ').replace('_', ' ').upper().split()

    return ''.join([word[0] for word in phrase])
List Comprehension

Use a list comprehension with str.join() to form an acronym from text cleaned using str.replace().

def abbreviate(to_abbreviate):
    phrase = to_abbreviate.replace('-', ' ').replace('_', ' ').upper().split()
    acronym = ''

    for word in phrase:
        acronym += word[0]

    return acronym
Loop

Use str.replace() to clean the input string and a loop with string concatenation to form the acronym.

def abbreviate(to_abbreviate):
    phrase = to_abbreviate.replace("_", " ").replace("-", " ").upper().split()
    
    return ''.join(map(lambda word: word[0], phrase))
Map Built-in

Use the built-in map() function to form an acronym after cleaning the input string with str.replace().

import re

def abbreviate(phrase):
    removed = re.findall(r"[a-zA-Z']+", phrase)

    return ''.join(word[0] for word in removed).upper()
Regex join

Use regex to clean the input string and form the acronym with str.join().