Regex join

Acronym
Acronym in Python
import re

###re.findall###

def abbreviate(to_abbreviate):
    #Capitalize the input before cleaning.
    removed = re.findall(r"[a-zA-Z']+", to_abbreviate.upper())
    
    return ''.join(word[0] for word in removed)

#OR#

def abbreviate(to_abbreviate):
    #Capitalize the result after joining.
    return ''.join(word[0] for word in
                   re.findall(r"[a-zA-Z']+", to_abbreviate)).upper()
                   
###re.finditer###

def abbreviate(to_abbreviate):
    #Capitalize the input before cleaning.
    removed = re.finditer(r"[a-zA-Z']+", to_abbreviate.upper())

    #word.group(0)[0] (first letter of Matched word) can also be written as
    #word[0][0], with the first bracketed number referring to Match group 0.
    return ''.join(word.group(0)[0] for word in removed)

#OR#

def abbreviate(to_abbreviate):
    #Capitalize the output after joining.
    #Use bracket notation for Match group.
    return ''.join(word[0][0] for word in
                   re.finditer(r"[a-zA-Z']+", to_abbreviate)).upper()                          

This approach begins by using re.findall() method from the re module to "scrub" (remove) non-letter characters such as ',-,_, and white space from to_abbreviate. Python's re module provides support for regular expressions within the language, and has many useful methods for searching, parsing, and modifying text. Regular expression matching starts at the left-hand side of the input and travels toward the right.

re.findall() searches text for all matching patterns, returning results (including 'empty' matches) in a list of strings.

The re.finditer() method works in the same fashion as re.findall(), but returns results as a lazy iterator over Match objects. This means that re.finditer() produces matches on demand instead of saving them to memory, but needs to have both the iterator and the Match objects unpacked.

The regular expression r[a-zA-Z']+ in the code example looks for any single character in the range a-z lowercase and A-Z uppercase, plus the ' (apostrophe) character. The + operator is a 'greedy' modifier that matches the previous range one to unlimited times. This means that the expression will match any collection or repeat of letters (word), but will omit matching on any sort of space or 'non-letter' character, such as \t, \n, , _, or -.

For example, in Complementary metal-oxide semiconductor, the regex will match Complementary, metal, oxide, and semiconductor. The regex will not match on or -. The result returned by findall() will then be ['Complementary', 'metal', 'oxide', 'semiconductor'].

Note

to_abbreviate.replace("_", " ").replace("-", " ").upper().split() can also be used to 'scrub' to_abbreviate and turn the results into a list. The .replace() approach benchmarked faster than using re.findall()/re.finditer() to 'scrub', most likely due to overhead in importing the re module and in the backtracking behavior of regex searching and matching.

Once findall() or finditer() completes, a generator-expression is used to iterate through the results and select the first letters of each word via bracket notation. Note that when using finditer(), the Match object has to be unpacked via match.group(0)/match[0] before the first letter can be selected.

Generator expressions are short-form generators - lazy iterators that produce their values on demand, instead of saving them to memory. This generator expression is consumed by str.join(), which joins the generated letters together using an empty string. Other "separator" strings can be used with str.join() - see string-methods for some additional examples.

Finally, the result of .join() is capitalized using the chained .upper(). Alternatively, .upper() can be used on to_abbreviate within findall()/finditer(), to uppercase the input before cleaning. Since the generator expression + join + upper is fairly succinct, they can be placed directly on the return line rather than assigning and returning an intermediate variable for the acronym.

This approach was less performant in benchmarks than those using loop, map, list-comprehension, and reduce.

20th Nov 2024 · Found it useful?

Other Approaches to Acronym in Python

Other ways our community solved this exercise
from functools import reduce

def abbreviate(to_abbreviate):
    phrase = to_abbreviate.replace("_", " ").replace("-", " ").upper().split()

    return reduce(lambda start, word: start + word[0], phrase, "")
Functools Reduce

Use functools.reduce() to form an acronym from text cleaned using str.replace().

def abbreviate(to_abbreviate):
    phrase = to_abbreviate.replace('-', ' ').replace('_', ' ').upper().split()

    # note the lack of square brackets around the comprehension.
    return ''.join(word[0] for word in phrase)
Generator Expression

Use a generator expression with str.join() to form an acronym from text cleaned using str.replace().

def abbreviate(to_abbreviate):
    phrase = to_abbreviate.replace('-', ' ').replace('_', ' ').upper().split()

    return ''.join([word[0] for word in phrase])
List Comprehension

Use a list comprehension with str.join() to form an acronym from text cleaned using str.replace().

def abbreviate(to_abbreviate):
    phrase = to_abbreviate.replace('-', ' ').replace('_', ' ').upper().split()
    acronym = ''

    for word in phrase:
        acronym += word[0]

    return acronym
Loop

Use str.replace() to clean the input string and a loop with string concatenation to form the acronym.

def abbreviate(to_abbreviate):
    phrase = to_abbreviate.replace("_", " ").replace("-", " ").upper().split()
    
    return ''.join(map(lambda word: word[0], phrase))
Map Built-in

Use the built-in map() function to form an acronym after cleaning the input string with str.replace().

import re

def abbreviate_regex_sub(to_abbreviate):
    pattern = re.compile(r"(?<!_)\B[\w']+|[ ,\-_]")

    return  re.sub(pattern, "", to_abbreviate.upper())
Regex Sub

Use re.sub() to clean the input string and create the acronym in one step.