Alice in wonder

Analysing Alice in wonderland with C#

One of the most recent programming interview tasks I had was to develop a program that looked at the frequency of words used in a book. In particular, we were looking at analysing Alice in wonderland with C# for the most commonly occurring words. Additionally, I had thirty minutes to complete the challenge.

The approach

The first thought was to re-use the code I had. The programming language frequency project at had a ton of crossover. It had word identification and word frequency counts, so could have been fantastic. Unfortunately, the test banned code re-use. It’s not as if we want programmers to re-use code after all…

Alternatively, the solution was to fall back on one of our old friends, Regex, to do the word extraction.

Regex in wonderland

Initially, we want to read the text from a text file.

var data = File.ReadAllText(_filePath);

Luckily, the System.IO.File library lets us use “ReadAllText” to access the content of a file and load it into a string. Perfect!

Next, we look to split the chapter of data into different words. Initially, the easiest way to do this is to use the String.Split function like so:

data.Split(' ');

Unfortunately, this leaves us with a lot of words like “(Dinah”. To try and sanitise this data, I created a method that would remove any special characters in a string.

  private string cleanString(String value)
            Regex reg = new Regex("[*\",_&#^@]");
            var cleanedWord = reg.Replace(value, string.Empty);
            return cleanedWord;

This method takes a string, looks for any occurrence of a special char and replaces it with an empty string.

Linqing it all together

Sorry, but I’m proud of that pun…

Now that we’ve got some code to clean a string, we need to build it into the splitting method.

 var cleanedWords = data.Split(' ')
                .Select(p => cleanString(p))              
                .Where(r => !r.Equals(String.Empty)); //Remove any empty occurrences

This code, splits the string, removes all the special characters and removes any string that are completely empty.

Then, we group the data by the string, prior to aggregating the code together in a dictionary:

var groupedWords = cleanedWords.GroupBy(p=>p).ToDictionary(b => b.Key ,b => b.Count());   

Finally, we sort the dictionary by the frequency of the words:

var wordsByPopularity = groupedWords.OrderByDescending(p => p.Value);

Conclusion to Analysing Alice in wonderland

The full source code can be found over on git:

In the 30 minutes, I had to pout this code together, I think it made an OK tool. However, there are definitely improvements that could have been made. If you’ve got any suggestions, leave me a comment below. I’d love to know your thoughts on Analysing Alice in wonderland with C#.

Leave a Comment

Your email address will not be published. Required fields are marked *