AlgoDaily - Find Duplicate Words

Good morning! Here's our prompt for today.

A very common interview challenge is to determine how often words appear in a given string or set of strings. Here's a version of this: return a list of duplicate words in a sentence.

For example, given 'The dog is the best', returns ["the"].

Likewise, given 'Happy thanksgiving, I am so full', you would return an empty array. This is because there are no duplicates in the string.

Constraints

The total length of the string <= 100000
The string will consist of ASCII characters (may be some or all)
Expected time complexity : O(n) where n are the number of words in the string
Expected space complexity : O(n)

Try to solve this here or in Interactive Mode.

How do I practice this challenge?

xxxxxxxxxx
​
 
using NUnit.Framework;
using System.Collections.Generic;
​
public static class Test
{
    public static string[] FindDuplicates(string input)
    {
        // your code here
        return new string[] { };
    }
}
​
[TestFixture]
class Tests
{
    [Test]
    public void TestOne()
    {
​
        Assert.AreEqual(Test.FindDuplicates("The dog is the best"), new string[] { "the" });
    }
​
    [Test]
    public void TestTwo()
    {
​
        Assert.AreEqual(Test.FindDuplicates("Happy thanksgiving, I am so full"), new string[] { });
    }
}

Here's how we would solve this problem...

How do I use this guide?

The Foundation: Understanding the Problem

To find duplicate words in a sentence, we need a systematic approach for analyzing the occurrences of each word. Essentially, our task is to examine each word in the given string and keep a tab on how often it shows up.

Step 1: Tokenizing the Sentence into Words

What Does Tokenizing Mean?

Tokenization is the process of converting a sequence of text into individual "tokens" or units. In our context, this means splitting the given sentence into individual words.

How to Tokenize?

We can start by converting the entire string to lowercase to make our function case-insensitive. Then, we'll use the split function to break it into an array of words.

1string[] split_s = s.ToLower().Split(' ');

xxxxxxxxxx
 
using System;
using System.Collections.Generic;
​
class Program {
  public static void Main(string[] args) {
    string s = "Original String";
    string[] splitS = s.ToLower().Split(' ');
    Dictionary<string, int> occurrences = new Dictionary<string, int>();
​
    foreach (var word in splitS) {
      if (!occurrences.ContainsKey(word)) {
        occurrences[word] = 1;
      } else {
        occurrences[word]++;
      }
    }
​
    foreach (KeyValuePair<string, int> entry in occurrences) {
      Console.WriteLine(entry.Key + ": " + entry.Value);
    }
  }
}

Step 2: Counting the Words Using a HashMap

Why a HashMap?

A HashMap provides an efficient way to store key-value pairs. In our case, the word will be the key, and its frequency will be the value. This will enable us to look up the frequency of a word in constant time.

How to Count?

We'll initialize an empty HashMap (or an object in JavaScript, or a dictionary in Python) to store the occurrences of each word. Let's call this HashMap occurrences.

Code Snippet for Counting

Here's how you can populate the occurrences HashMap:

1Dictionary<string, int> occurrences = new Dictionary<string, int>();
2foreach(var word in split_s) {
3    if (occurrences.ContainsKey(word)) {
4        occurrences[word]++;
5    } else {
6        occurrences[word] = 1;
7    }
8}

Step 3: Identifying Duplicate Words

What are Duplicate Words?

Duplicate words are the keys in our HashMap with a value greater than 1, as these words have appeared more than once in the sentence.

How to Find Them?

We can loop through our occurrences HashMap to find these duplicates.

1List<string> duplicates = new List<string>();
2foreach (KeyValuePair<string, int> entry in occurrences)
3{
4    if (entry.Value > 1){
5        duplicates.Add(entry.Key);
6    }
7}

Step 4: Unveiling the Duplicates

Identifying Duplicates Made Easy

After painstakingly counting the frequency of each word, finding duplicates becomes a cakewalk. All we have to do is sift through our occurrences HashMap and pinpoint the words that have a frequency of 2 or more. These are our coveted duplicates.

Code Snippet for Identifying Duplicates

Here's how you can extract the duplicates from our occurrences HashMap:

1List<string> duplicates = new List<string>();
2foreach (KeyValuePair<string, int> occurrence in occurrences)
3{
4    if (occurrence.Value > 1)
5    {
6        duplicates.Add(occurrence.Key);
7    }
8}

Analyzing the Complexity of Our Final Solution

Time Complexity Revisited

Let n be the number of words in our input string s.

Populating the occurrences HashMap took O(n) time.
Traversing the occurrences HashMap to identify duplicates also takes O(n) time.

Summing these up, our algorithm works in linear O(n) time, which is remarkably efficient.

Space Complexity Revisited

We've used a HashMap (occurrences) to store the frequency of each word, and a list (duplicates) to store the duplicate words. In a worst-case scenario, each word in the sentence is unique, making our space complexity linear O(n).

xxxxxxxxxx
 
using System;
using System.Collections.Generic;
public class Program {
    public static void Main() {
        string s = "Original String Original String";
        string[] split_s = s.ToLower().Split(' ');
​
        Dictionary<string, int> occurrences = new Dictionary<string, int>();
​
        foreach (string word in split_s) {
            if (!occurrences.ContainsKey(word)) {
                occurrences[word] = 1;
            } else {
                occurrences[word] += 1;
            }
        }
​
        List<string> dupes = new List<string>();
        foreach (string k in occurrences.Keys) {
            if (occurrences[k] == 2) {
                dupes.Add(k);
            }
        }
​
        Console.WriteLine(string.Join(", ", dupes));
    }
}

One Pager Cheat Sheet

This challenge requires returning a list of duplicate words in a sentence that consists of ASCII characters with a time complexity of O(n) and a space complexity of O(n).
We can use a hashmap (or JS object or Python dict) to store each word's frequency by looping through each one in the array and recording its occurrence.
We can easily find the duplicates by counting the occurrences of each word and looking for those with 2!

This is our final solution.

To visualize the solution and step through the below code, click Visualize the Solution on the right-side menu or the VISUALIZE button in Interactive Mode.

xxxxxxxxxx
}
 
using NUnit.Framework;
using System.Collections.Generic;
​
public static class Test
{
    public static string[] FindDuplicates(string input)
    {
        input = input.ToLower();
​
        if (input == null || string.IsNullOrEmpty(input))
        {
            return new string[] { };
        }
        HashSet<string> duplicates = new HashSet<string>();
​
        string[] words = input.Split();
        HashSet<string> set = new HashSet<string>();
​
        foreach (string word in words)
        {
            if (!set.Add(word))
            {
                duplicates.Add(word);
            }
        }
​
        string[] result = new string[duplicates.Count];
        duplicates.CopyTo(result);
​
        return result;
    }
}

Alright, well done! Try another walk-through.

If you had any problems with this tutorial, check out the main forum thread here.