Filter with All on a HashSet

Isogram
Isogram in Rust
use std::collections::HashSet;

pub fn check(candidate: &str) -> bool {
    let mut hs = HashSet::new();
    candidate
        .bytes()
        .filter(|c| c.is_ascii_alphabetic())
        .map(|c| c.to_ascii_lowercase())
        .all(|c| hs.insert(c))
}

With this approach you will instantiate and update a HashSet to keep track of the used letters.

A use declaration allows directly calling HashSet instead of calling it with its entire namespace. Without the use declaration, the HashSet would be instantiated like so

let mut hs = std::collections::HashSet::new();

After the HashSet is instantiated, a series of functions are chained from the candidate &str.

  • Since all of the characters are ASCII, they can be iterated with the bytes method. Each byte is iterated as a u8, which is an unsigned 8-bit integer.
  • The filter method borrows each byte as a reference to a u8 (&u8). Inside of its closure it tests each byte to see if it is_ascii_alphabetic. Only bytes which are ASCII letters will survive the filter to be passed on to the map method.
  • The map method calls to_ascii_lowercase on each byte.
  • Each lowercased byte is then tested by the all method by using the insert method of HashSet. all will return true if every call to insert returns true. If a call to insert returns false then all will "short-circuit" and immediately return false. The insert method returns whether the value is newly inserted. So, for the word "alpha", insert will return true when the first a is inserted, but will return false when the second a is inserted.

Refactoring

using the str method to_ascii_lowercase and no map

You might want to to call the str method to_ascii_lowercase and save calling map, like so

candidate
    .to_ascii_lowercase()
    .bytes()
    .filter(|c| c.is_ascii_alphabetic())
    .all(|c| hs.insert(c))

However, changing the case of all characters in a str raised the average benchmark a few nanoseconds. It is a bit faster to filter out non-ASCII letters and to change the case of each surviving byte. Since the performance is fairly close, either may be prefered.

using filter_map

Since filter and map are used, this approach could be refactored using the filter_map method.

use std::collections::HashSet;

pub fn check(candidate: &str) -> bool {
    let mut hs = HashSet::new();
    candidate
        .bytes()
        .filter_map(|c| c.is_ascii_alphabetic().then(|| c.to_ascii_lowercase()))
        .all(|c| hs.insert(c))
}

By chaining the then method to the result of is_ascii_alphabetic, and calling to_ascii_lowercase in the closure for then, the filter map passes only lowercased ASCII letter bytes to the all method. In benchmarking, this approach was slightly slower, but its style may be prefered.

supporting Unicode

By substituting the chars method for the bytes method, and by using the char methods is_alphabetic and to_lowercase, this approach can support Unicode characters.

use std::collections::HashSet;

pub fn check(candidate: &str) -> bool {
    let mut hs = std::collections::HashSet::new();
    candidate
        .chars()
        .filter(|c| c.is_alphabetic())
        .map(|c| c.to_lowercase().to_string())
        .all(|c| hs.insert(c))
}

Usually an approach that supports Unicode will be slower than one that supports only bytes. However the benchmark for this approach was significantly slower, taking more than twice as long as the bytes approach. It can be further refactored to use the str to_lowercase method and remove the map method to cut the benchmark down closer to the byte approach.

use std::collections::HashSet;

pub fn check(candidate: &str) -> bool {
    let mut hs = std::collections::HashSet::new();
    candidate
        .to_lowercase()
        .chars()
        .filter(|c| c.is_alphabetic())
        .all(|c| hs.insert(c))
}

To more completely support Unicode, an external crate, such as unicode-segmentation, could be used. This is becasue the std::char can not fully handle things such as grapheme clusters.

11th Sep 2024 · Found it useful?