Counting characters

Today, we are interested into counting the number of occurrences of every character in a text. For example, in the string "foobar!!", the characters 'f', 'b', 'a', and 'r' appear once each, while the characters 'o' and '!' each appear twice.

Exercise 0.a: Create a new Rust binary project named char-counter in which our code will be placed.

The count_chars() function

A text is usually made of several lines. We need a function which takes those lines and returns the occurrences of the various characters.

Exercise 0.b: Write a function with the following prototype:

fn count_chars(input: &[&str]) -> HashMap<char, usize>

This function will build a HashMap whose keys are the various characters and the values are the number of occurrences of each character. To do so, it will loop over the &str of input, and over the characters of each &str.

Tips for implementing count_chars()

The str::chars() method may be useful: it returns an iterator over the characters of a string.

Also, if h is a HashMap<K, V> and V implements Default, the following idiomatic construct may be useful:

  h.entry(some_key).or_default()

It returns a &mut value where value is the existing value for key some_key; if the entry for some_key does not exist in the map, it is created with V::default(), the default value for the type V, and a mutable reference to this new inserted value is returned.

So when implementing a counter, the following construct can often be found, since all integer types have a default value of 0:

  *h.entry(some_key).or_default() += 1;

This increments the value corresponding to some_key, or set it to 1 if it didn't exist in the map before (0 which is the default value + 1).

Note: do not stay stuck on this one. If after 15 minutes you feel that you will take too much time to write this function, you can get inspiration from this code and go on with the lab.

Testing the function

Exercise 0.b: Print the occurrences of every character in the text made from the two lines "Hello, world!", and "I am happy to be here.".

Check by hand that you get the expected result.

Given that HashMap<K, V> implements the Debug trait if K and V do, you can use (:#? pretty-prints the debug information):

fn main() {
    let text = ["Hello world!", "I am happy to be here."];
    let freq = count_chars(&text);
    println!("{freq:#?}");
}

You should obtain something like:

{
    'H': 1,
    'b': 1,
    'y': 1,
    'a': 2,
    ' ': 6,
    '!': 1,
    't': 1,
    'l': 3,
    'd': 1,
    'I': 1,
    'p': 2,
    'o': 3,
    'r': 2,
    'w': 1,
    'm': 1,
    'e': 4,
    '.': 1,
    'h': 2,
}

You will probably agree that this is not very pleasant to read. We can do better.

Prettier display

Note: if you feel that you do not have the time to implement this part, feel free to get some code from here (this is not the point of this lab).

In order to make our program useful, we want to:

  • sort the characters by occurrences;
  • display only the 10 most frequent characters (in this two lines the choice will be arbitrary, with larger text it will get interesting).

How can we do that? Learn or remember that:

  • Iterating over a HashMap<K, V> gives you an iterator over pairs (K, V). Those pairs could be collected into a Vec<(K, V)>.
  • Vectors can be sorted. By using the sort_by_key() method, we can sort it by the key we choose. For example, vec.sort_by_key(|(_k, v)| v) will sort the vector elements according to the second element v of the pair.
  • Iterating over a vector can be reversed: vec.into_iter().rev() gives an iterator over the vector from the end to the beginning.
  • An iterator can be limited to its first N elements using the .take(N) method.

Exercise 0.c: display only the 10 most frequent characters in the text. Keeping our example, it should print something like:

  - ' ': 6 occurrences
  - 'e': 4 occurrences
  - 'l': 3 occurrences
  - 'o': 3 occurrences
  - 'a': 2 occurrences
  - 'r': 2 occurrences
  - 'h': 2 occurrences
  - 'p': 2 occurrences
  - 'y': 1 occurrences
  - 'w': 1 occurrences

And no, we do not care about the extra "s" in "1 occurrences". Fix it if you want but this has 0 importance.