Jake's CS Notes - Data Structures / Algorithms

//*******************************************************************//
//***********Rabin Karp String Search - April 5th, 2017*************//
//*****************************************************************//

-Well, when it rains, it pours...and generates tornado warnings
    -In other news, CS 2051 test: I HAS TO STUDY NOWS!!!!
    -...but, gripped by the endless curse of colossal laziness, I'm kinda hoping for a storm delay (which, c'mon, won't happen)
        -Note From Future Me: It did not happen (the storm delay, I mean; the test very much did happen)
-Test 3 is in ONE WEEK
    -AVL Coding questions ARE FAIR GAME! Will probably focus on rotations
    -Will include matching questions, a few where you have to trace pattern matching, and instead of 1 big coding question, 2 or 3 smaller coding questions
        -...of course, be prepared for anything
--------------------------------------------------------------

-Well, let's start with the last pattern-matching algorithm: Rabin-Karp!
    -Hey, I spelled it right this time!

-The 3 previous searches we did:
    -BRUTE FORCE
        -The ONLY one we went over that has no pre-processing step
        -It's considered a "naive" algorithm; it takes the most obvious, simple approach
    -BOYER-MOORE
        -Uses a last-occurrence table
        -Starts from the back of the pattern, uses LOT to shift efficiently up
        -Lines up matching characters, but still has to check them
    -KMP
        -Generates a failure table
        -Uses prefixes / suffixes

-RABIN-KARP is different from these 3 since it does NOT directly compare characters; instead, it uses a "rolling hash" to calculate a hash pattern for the text substring we're searching through, and we compare the HASHCODES
    -NOTE: If the hashes match, it's possible we still have a collision, so we STILL have to check the characters to verify it's a valid match
    -O(n) to generate the initial hash, but O(1) to shift it over

    -To generate the initial hash for the pattern, we take the sum from 0 to m-1 of character(i) * BASE^((m-1) - i)
        -In code:
            String pattern;
            int hash = 0;
            int base = 10;
            for (int i = 0; i < m.length; i++) {
                hash += pattern.getCharAt(i) * pow(base, m.length - 1 - i);
            }
        -e.g. for the message "347," it would be hash = 3*100 + 4*10 + 7 = 347
            -This function is sometimes called "Rabin's fingerprint function"
    -We then calculate this hash for the substring from 0 to pattern.length - 1 in the text we're searching through
        -To "roll" this hash to the next character, we say:
            newHash = base * (oldHash - (oldChar * base^(pattern.length - 1))) + newChar
    -So, let's do an example, with CHARACTERS this time around
        -Let's say "a"  =1, "b" = 2, "c" = 3, etc., to make the math easier
            -In your actual code, you'll just use the ASCII values of each character
        -Let's also say our "base" is 2 (for example's sake)
            pattern:    bac
            text:       abcbabac
        -The FIRST THING we should check is to see if pattern.length > text.length; if it is, then we'll NEVER find a match, so we shouldn't waste time preprocessing and should just return "false" right away
            -In this case, "bac" is smaller than the text, so let's move on
        -To generate the hash for "bac", we say:
            -"hash" = "b" * 2^(3-1) + "a" * 2^(3-2) + "c" * 2^(3-3)
                = 2*4 + 1*2 + 3*1 = 13
        -We then do the same thing for the first 3 characters in the text, "abc," and get 11
            -This doesn't match our pattern's hash, so we "roll" the hash over by 1
        -"bcb"s hash = (11 - 1*2^(3-1))*2 + 2 = 16
            -Still not match, so we keep going
        -Once we DO find a match, then we are NOT done; we might have just happened to find a collision
            -So, we go in and compare the characters old-fashioned style; if all the characters match, YAY! Otherwise, we have to keep looking
        -We keep going until we hit the end of the text; at this point, the substring will be smaller than our pattern, so we're done

    -So, what's performance for Rabin-Karp?
        -The best case is O(m), where "m" is the length of the pattern string
            -This is in the event that the first "m" characters match; this best case exists for pretty much all the sorts that we just looked through
        -In the WORST-CASE, where we have a bad hash function and all of the hashes match, we end up with O(m*n); the algorithm basically is just brute-force search at this point
        -In the AVERAGE case, Rabin-Karp is O(m + n)

    -Another example (illustrate the worst case):
        pattern: c a b
        text:    a b c b a c
    -With a base of 2, the "abc" hash is 1*4 + 2*2 + 3*1 = 11, and "cab" is