AlgoDaily - Detect Substring in String

Good evening! Here's our prompt for today.

How would you write a function to detect a substring in a string?

If the substring can be found in the string, return the index at which it starts. Otherwise, return -1.

JAVASCRIPT

1function detectSubstring(str, subStr) {
2  return -1;
3}

Important-- do not use the native String class's built-in substring or substr method. This exercise is to understand the underlying implementation of that method.

Constraints

Length of both the given strings <=100000
The strings would never be null
The strings will only consist of lowercase letters
Expected time complexity : O(n)
Expected space complexity : O(1)

Try to solve this here or in Interactive Mode.

How do I practice this challenge?

xxxxxxxxxx
 
def detectSubstring(txt, subStr):
    # fill in
    return -1
​
​
import unittest
​
​
class Test(unittest.TestCase):
    def test_1(self):
        assert detectSubstring("thepigflewwow", "flew") == 6
        print("PASSED: `detectSubstring('thepigflewwow', 'flew')` should return `6`")
​
    def test_2(self):
        assert detectSubstring("twocanplay", "two") == 0
        print("PASSED: `detectSubstring('twocanplay', 'two')` should return `0`")
​
    def test_3(self):
        assert detectSubstring("wherearemyshorts", "pork") == -1
        print(
            "PASSED: `detectSubstring('wherearemyshorts', 'pork')` should return `-1`"
        )
​
​
if __name__ == "__main__":
    unittest.main(verbosity=2)
    print("Nice job, 3/3 tests passed!")
​

Tired of reading? Watch this video explanation!

To change the speed of the video or see it in full screen, click the icons to the right of the progress bar.

Here's how we would solve this problem...

How do I use this guide?

We should begin by reasoning out what we need to do in plain English before diving into the logic.

Let's take a simple example:

If we wanted to know if car was in carnival, that's easy. The first few letters are the same.

SNIPPET

1c a r
2c a r n i v a l
3^ ^ ^

So we'd write a for-loop and check that each index is equal until the shorter word is completed.

What if we wanted to know if graph was in geography? We would still write a for-loop that compares the characters at each index, but we need to account for if there is a match later on.

Here, we can use the two-pointer technique we learned about prior! Why don't we use another pointer for the purpose of seeing how far we can "extend our match" when we find an initial one?

xxxxxxxxxx
 
idx_of_start = 0
j = 0
s = 'digginn'
sub_s = 'inn'
​
for i in range(len(s)):
  # if match, compare next character of sub_s with next of string
  if s[i] == sub_s[j]:
    j += 1
  else:
    j = 0

So given string geography and substring graph, we'd step it through like this:

g matches g, so j = 1
e does not match r, so j = 0
o does not match g, so j = 0
g does match g, so j = 1
r does match r, so j = 2
a does match a, so j = 3 and so on...

At the same time, we also need to keep track of the actual index that the match starts. The j index will help us get the matching, but we need a idxOfStart variable for just the start.

Putting it all together:

xxxxxxxxxx
 
idx_of_start = 0
j = 0
s = 'digginn'
sub_s = 'inn'
​
for i in range(len(s)):
  # if match, compare next character of sub_s with next of string
  if s[i] == sub_s[j]:
    j += 1
  else:
    j = 0
​
  # keep track of where match starts and ends
  if j == 0:
    idx_of_start = i
  elif j == len(sub_s):
    print(idx_of_start)
    # we know this is the start of match

There are still two slight issues with this code. Can you see what they are?

Let's take the examples of a string ggraph and substring graph. On the first iteration, g will match g, which is what we want. However, on the next iteration, we will try to match the second g in ggraph with r in graph.

What should happen is if we don't find a match, we ought to reset i a bit as well-- this is to accommodate duplicate characters. We can reset it specifically to the length of the last processed substring (or j).

The other thing is regarding the index of the start of the match-- we can simply get it from i - (subStr.length - 1). Note that you need to subtract 1 from the substring's length in order to get to the correct index.

Complexity of Final Solution

Let n be the length of the string. In the worst-case scenario, we iterate through all n characters of the string and do not find a match. Many of us will then declare that the time complexity is linear O(n). But think carefully, for each partial matching of substrings, we are traversing through the substring, and then again resetting to the initial pointer i.

To get a more clear insight into this, let's consider an example:

TEXT

1str = "aaaaaa"
2subStr = "aab"

The final solution code of the method explained above is given.

xxxxxxxxxx
 
function detectSubstring(str, subStr) {
  let idxOfStart = 0,
    j = 0;
​
  for (i = 0; i < str.length; i++) {
    // if match, compare next character of subStr with next of string
    if (str[i] == subStr[j]) {
      j++;
      if (j == subStr.length) {
        return i - (subStr.length - 1);
      }
    } else {
      i -= j;
      j = 0;
    }
  }
​
  return -1;
}

Let's test your knowledge. Click the correct answer from the options.

How many comparisons will happen for the given code in the given test case?

TEXT

1str = "aaaaaa"
2subStr = "aab"

Click the option that best answers the question.

As you can see, there are total 15 comparisons for the given test case. This is not close to the linear 6 comparison time complexity. In fact, this is a complexity of close to 6*3 comparisons. The reason for this is for each starting letter of the main string, we are comparing the whole substring and then resetting back to the initial character. See the image below:

So the time complexity will be O(mn) where m and n are the lengths of the string and substring. For memory, using the two-pointer method, we use a constant O(1) space for our solution.

There is another established algorithm named Rabin-Karp algorithm which uses hashing and two-pointer technique to solve this problem in O(n) time complexity. Feel free to try to implement that algorithm and tell us in the discussion about it.

One Pager Cheat Sheet

Write a function detectSubstring to detect a substring within a given string, avoiding String class's built-in substring or substr method, with O(n) time complexity and O(1) space complexity, and handle strings of length ≤ 100000.
We should reason out what we need to do before diving into the logic of solving a simple example like whether a car is in carnival.
We can use the two-pointer technique to compare two strings and check if one is a substring of the other by writing a for-loop to compare the characters at each index and discover how far we can extend our match when we find an initial one.
We can use an idxOfStart variable and a j index to determine if a string contains a substring by checking character-by-character whether they match or not.
The code needs two slight adjustments to handle duplicate characters and accurately obtain the index of the start of matches, and its complexity is more accurately expressed as O(n*m), where m is the length of the substring.
The number of comparisons made for the given code in the given test case is 12, which is n * m (6 * 3).
With the two-pointer method, the time complexity for solving the given test case problem is O(mn) and a constant O(1) space is used, although the Rabin-Karp algorithm may provide a more optimal O(n) time complexity solution.

This is our final solution.

To visualize the solution and step through the below code, click Visualize the Solution on the right-side menu or the VISUALIZE button in Interactive Mode.

xxxxxxxxxx
    print("Nice job, 3/3 tests passed!")
 
def computeLPS(subStr, subStrLength, lps):
    length = 0  # length of previous longest prefix suffix
​
    lps[0] = 0  # lps[0] is always 0
    i = 1
​
    # the loop calculates lps[i] for i = 1 to subStrLength-1
    while i < subStrLength:
        if subStr[i] == subStr[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
​
​
def detectSubstring(txt, subStr):
    subStrLength = len(subStr)
    strLength = len(txt)
​
    lps = [0] * subStrLength  # longest prefix suffix values for subStr
    j = 0  # index for subStr
​
    computeLPS(subStr, subStrLength, lps)  # preprocess the subStr
​
    i = 0  # index for str
    while i < strLength:
        if subStr[j] == txt[i]:

Alright, well done! Try another walk-through.

If you had any problems with this tutorial, check out the main forum thread here.