1.6 example most common words in a text

Here is a programming task.

Find the n most common words in a given text.

Input: Text which is a list of Chars (so a String), where spaces (' ') and newlines ('\n') are treated as blank characters.
Output: You should output a sorted list of the number of times a word occurs in a the input text. More specifically, if there are n unique words, there are n lines of the format f"{word}: {occurrences}."

In python, see counter.py:

"""
This module demonstrates the use of the Counter class
from the collections module to count the occurrences
of words in a given input string.
"""

from collections import Counter

N = 5
HILL = """
When the Anglo-saxons invaded Britain it is clear that
they took over many place names as names, without understanding 
their meaning. The evidence is to be found in names like Penhill, 
where Old English hyll was added unnecessarily to a word which 
represented Old Welsh pann, hill. A Penhill in Lancashire developed 
into Pendle Hill, a name which means hill-hill-hill. England also has 
a Torpenhow Hill, or hill-hill-hill-hill."""

data = HILL.split()
HILL = HILL.lower()
numerical_data = Counter(data)
sorted_data = sorted(
    list(numerical_data.keys()),
    key = lambda x : numerical_data[x],
    reverse = True
)

for word in sorted_data[:max(N, len(sorted_data))]:
    print(f"{word}: {numerical_data[word]}")

Haskell doesn't let you abstract-ify away what's actually going on, but this python code is still useful to have a rough understanding of how to approach it.

Ultimately, whatever program we write is of the form:

mostCommon :: Int -> [Char] -> [Char] -- typedef
mostCommon n input_string -- using it

For pseduocode:

  1. convert all Chars to lower case
  2. split by all whitespace characters to get all words
  3. sort the list into alphabetical order
    1. In the python code this is abstracted away by the Counter instance.
  4. split the list into sections, where each section has the same word
  5. replace each section by the number of times each word appears in it
  6. sort this in reverse descending order
  7. take the first n and convert each one of these into the output format.

Great! So we should first go off with defining some types, which will be borrowed from the course notes:

> type Text = [Char]
> type Word = [Char]
> type Run = (Int, Word)

We utilize these Runs as (occurrence, word) pairs. Then, we want a lowercase function, which will act as a map onto the [Char]s:

map canonical :: Text -> Text

Then we want a way to convert text into words via breaking of whitespace:

words :: Text -> [Word]

We then want to sort it, so we utilize the built-in function

sort :: Ord a => [a] -> [a]

Notice this new notation here. Everything to the left of the => is a condition on the type class a, namely that there has to exist an Ordering on the type.

Then we want a function to convert runs of identical words to their corresponding Run type, returning the length of the run and the actual word. Again, we have a built-in function:

group :: Eq a => [a] -> [[a]]

This function requires an equality on the type a, and returns a list of lists where each sub-list is a collection of those types a. Specifically, group splits up [a] into maximal lists of adjacent equal elements.

So, we need our function to ultimately look like:

codeRun :: [Word] -> Run

To convert each sub-list to its Run type.

Then we sort the runs in descending order, as Haskell has a reverse function built in. Take the first n elements with the take function, and then handle output via a function showRun :: Run -> String and utilize concat . map showRun on the Run's and hey presto! That's the program.

In the code from Oxford, it looks like:

mostCommon :: Int -> Text -> String
mostCommon n =  concat . -- :: [String] -> String
        map showRun .    -- :: [Run] -> [String]
        take n .         -- :: [Run] -> [Run]
        reverse .        -- :: [Run] -> [Run]
        sort .           -- :: [Run] -> [Run]
        map codeRun .    -- :: [Word] -> [Run]
        group .          -- :: [Word] -> [[Word]]
        sort .           -- :: [Word] -> [Word]
        words .          -- :: Text -> [Word]
        map canonical    -- :: Text -> Text

Using the power of hlint, we can simplify this down to:

mostCommon :: Int -> Text -> String
mostCommon n =  concatMap showRun . take n . -- :: [Run] -> String
    sortBy (comparing Data.Ord.Down) . -- :: [Run] -> [Run]
    map codeRun . -- :: [Word] -> [Run]
    group . -- :: [Word] -> [[Word]]
    sort . -- :: [Word] -> [Word]
    words . -- :: Text -> [Word]
    map canonical -- :: Text -> Text

The simplifications here are the utilization of concatMap and sortBy (comparing Data.Ord.Down), which simplify the concat . map and reverse . sort respectively. Then, we need the canonical function, which will act on characters:

canonical :: Char -> Char
canonical c | isLower c = c
            | isUpper c = toLower c
            | otherwise = ' '

This uses some more built-ins, but it is also an example of a switch statement in Haskell, with the | acting as case statements, and otherwise acting as an else. The codeRun function looks like:

codeRun :: [Word] -> (Int, Word)
codeRun ws = (length ws, head ws)

Here we see that the length keyword acts on element of the [a] type, giving us their length, and head gives us the first element in the list. As we never input an empty list type, head is always well defined. The last function is straightforward enough:

showRun :: Run -> String
showRun (n, w) = concat [w, ": ", show n, "\n"]

Putting it all together inside of counter.hs, and handling some specific imports:

import Data.Char (isLower, isUpper, toLower)
import Data.List (group, sort, sortBy)
import Data.Ord (comparing, Down (Down))
import Prelude hiding (Word)

type Text = String
type Word = String
type Run = (Int, Word)

canonical :: Char -> Char
canonical c | isLower c = c
            | isUpper c = toLower c
            | otherwise = ' '

codeRun :: [Word] -> (Int, Word)
codeRun ws = (length ws, head ws)

showRun :: Run -> String
showRun (n, w) = concat [w, ": ", show n, "\n"]

mostCommon :: Int -> Text -> String
mostCommon n =  concatMap showRun . take n . -- :: [Run] -> String
    sortBy (comparing Data.Ord.Down) . -- :: [Run] -> [Run]
    map codeRun . -- :: [Word] -> [Run]
    group . -- :: [Word] -> [[Word]]
    sort . -- :: [Word] -> [Word]
    words . -- :: Text -> [Word]
    map canonical -- :: Text -> Text

To run this code, open up your terminal and do:

ghci ./counter.hs
ghci> let hill = "When the Anglo-saxons invaded Britain it is clear that they took over many place names as names, without understanding their meaning. The evidence is to be found in names like Penhill, where Old English hyll was added unnecessarily to a word which represented Old Welsh pann, hill. A Penhill in Lancashire developed into Pendle Hill, a name which means hill-hill-hill. England also has a Torpenhow Hill, or hill-hill-hill-hill."
ghci > putStr (mostCommon 5 hill)
hill: 10
a: 4
names: 3
which: 2
to: 2

You've run your first 'nontrivial' haskell program!