1.6 example most common words in a text
Here is a programming task.
Find the
most common words in a given text. Input: Text which is a list of Chars (so a String), where spaces (' ') and newlines ('\n') are treated as blank characters.
Output: You should output a sorted list of the number of times a word occurs in a the input text. More specifically, if there areunique words, there are lines of the format f"{word}: {occurrences}."
In python, see counter.py:
"""
This module demonstrates the use of the Counter class
from the collections module to count the occurrences
of words in a given input string.
"""
from collections import Counter
N = 5
HILL = """
When the Anglo-saxons invaded Britain it is clear that
they took over many place names as names, without understanding
their meaning. The evidence is to be found in names like Penhill,
where Old English hyll was added unnecessarily to a word which
represented Old Welsh pann, hill. A Penhill in Lancashire developed
into Pendle Hill, a name which means hill-hill-hill. England also has
a Torpenhow Hill, or hill-hill-hill-hill."""
data = HILL.split()
HILL = HILL.lower()
numerical_data = Counter(data)
sorted_data = sorted(
list(numerical_data.keys()),
key = lambda x : numerical_data[x],
reverse = True
)
for word in sorted_data[:max(N, len(sorted_data))]:
print(f"{word}: {numerical_data[word]}")
Haskell doesn't let you abstract-ify away what's actually going on, but this python code is still useful to have a rough understanding of how to approach it.
Ultimately, whatever program we write is of the form:
mostCommon :: Int -> [Char] -> [Char] -- typedef
mostCommon n input_string -- using it
For pseduocode:
- convert all Chars to lower case
- split by all whitespace characters to get all words
- sort the list into alphabetical order
- In the python code this is abstracted away by the
Counter
instance.
- In the python code this is abstracted away by the
- split the list into sections, where each section has the same word
- replace each section by the number of times each word appears in it
- sort this in reverse descending order
- take the first
and convert each one of these into the output format.
Great! So we should first go off with defining some types, which will be borrowed from the course notes:
> type Text = [Char]
> type Word = [Char]
> type Run = (Int, Word)
We utilize these Runs
as (occurrence, word)
pairs. Then, we want a lowercase function, which will act as a map onto the [Char]
s:
map canonical :: Text -> Text
Then we want a way to convert text into words via breaking of whitespace:
words :: Text -> [Word]
We then want to sort it, so we utilize the built-in function
sort :: Ord a => [a] -> [a]
Notice this new notation here. Everything to the left of the => is a condition on the type class
Then we want a function to convert runs of identical words to their corresponding Run
type, returning the length of the run and the actual word. Again, we have a built-in function:
group :: Eq a => [a] -> [[a]]
This function requires an equality on the type
So, we need our function to ultimately look like:
codeRun :: [Word] -> Run
To convert each sub-list to its Run type.
Then we sort the runs in descending order, as Haskell has a reverse
function built in. Take the first take
function, and then handle output via a function showRun :: Run -> String
and utilize concat . map showRun
on the Run
's and hey presto! That's the program.
In the code from Oxford, it looks like:
mostCommon :: Int -> Text -> String
mostCommon n = concat . -- :: [String] -> String
map showRun . -- :: [Run] -> [String]
take n . -- :: [Run] -> [Run]
reverse . -- :: [Run] -> [Run]
sort . -- :: [Run] -> [Run]
map codeRun . -- :: [Word] -> [Run]
group . -- :: [Word] -> [[Word]]
sort . -- :: [Word] -> [Word]
words . -- :: Text -> [Word]
map canonical -- :: Text -> Text
Using the power of hlint
, we can simplify this down to:
mostCommon :: Int -> Text -> String
mostCommon n = concatMap showRun . take n . -- :: [Run] -> String
sortBy (comparing Data.Ord.Down) . -- :: [Run] -> [Run]
map codeRun . -- :: [Word] -> [Run]
group . -- :: [Word] -> [[Word]]
sort . -- :: [Word] -> [Word]
words . -- :: Text -> [Word]
map canonical -- :: Text -> Text
The simplifications here are the utilization of concatMap
and sortBy (comparing Data.Ord.Down)
, which simplify the concat . map
and reverse . sort
respectively. Then, we need the canonical function, which will act on characters:
canonical :: Char -> Char
canonical c | isLower c = c
| isUpper c = toLower c
| otherwise = ' '
This uses some more built-ins, but it is also an example of a switch statement in Haskell, with the codeRun
function looks like:
codeRun :: [Word] -> (Int, Word)
codeRun ws = (length ws, head ws)
Here we see that the length
keyword acts on element of the
showRun :: Run -> String
showRun (n, w) = concat [w, ": ", show n, "\n"]
Putting it all together inside of counter.hs, and handling some specific imports:
import Data.Char (isLower, isUpper, toLower)
import Data.List (group, sort, sortBy)
import Data.Ord (comparing, Down (Down))
import Prelude hiding (Word)
type Text = String
type Word = String
type Run = (Int, Word)
canonical :: Char -> Char
canonical c | isLower c = c
| isUpper c = toLower c
| otherwise = ' '
codeRun :: [Word] -> (Int, Word)
codeRun ws = (length ws, head ws)
showRun :: Run -> String
showRun (n, w) = concat [w, ": ", show n, "\n"]
mostCommon :: Int -> Text -> String
mostCommon n = concatMap showRun . take n . -- :: [Run] -> String
sortBy (comparing Data.Ord.Down) . -- :: [Run] -> [Run]
map codeRun . -- :: [Word] -> [Run]
group . -- :: [Word] -> [[Word]]
sort . -- :: [Word] -> [Word]
words . -- :: Text -> [Word]
map canonical -- :: Text -> Text
To run this code, open up your terminal and do:
ghci ./counter.hs
ghci> let hill = "When the Anglo-saxons invaded Britain it is clear that they took over many place names as names, without understanding their meaning. The evidence is to be found in names like Penhill, where Old English hyll was added unnecessarily to a word which represented Old Welsh pann, hill. A Penhill in Lancashire developed into Pendle Hill, a name which means hill-hill-hill. England also has a Torpenhow Hill, or hill-hill-hill-hill."
ghci > putStr (mostCommon 5 hill)
hill: 10
a: 4
names: 3
which: 2
to: 2
You've run your first 'nontrivial' haskell program!