# Regular Expressions and NFAs

## February 11th, 2020

- Alright, today's plan:
    - Convert regular expressions to NFAs
    - Convert NFAs to regular languages
    - Pumping Lemma
        - "This is getting very far into proof-territory, and you really can't think in terms of input examples or the input alphabet, since they'll be infinite; instead, you have to think of these in terms of your known states"
        - With a small number of states, we can accept any number of strings

- Also, don't forget homework 3 is due this Thursday!
--------------------------------------------------------------------------------

- As we mentioned, regular expressions are defined recursively as regular languages with 3 operations:
    - UNION, accepting either of 2 possible strings (e.g. accept "(b or c)at")
    - CONCATENATION, sticking 2 strings together
    - KLEEN STAR (*), which means there are 0+ copies of the preceding character/characters
        - "L*" means 0 or more copies of a word in some language "L"
        - "$\Sigma$*" means the language also containing the characters from $\Sigma$
            - Essentially, $\Sigma$ is the set of characters, but we treat each character as its own string so it's technically treated as a language
            - So, if L = {000}, then L* = {$\epsilon$, 000, 000000, ...}

- Alright; let's keep trying to prove that regular expressions give regular languages, meaning that they can be accepted by an NFA/DFA!
    - We do this by starting out with the base case of the empty string $\epsilon$, a single character $\Sigma$, or the empty set (which always rejects, since there are NO states/characters now)
    - Then, by induction, we assume the hypothesis is true and show that this holds as we add more characters!
        - For Kleen Star, in words, the proof is like this:
            - Let M be an NFA such that L(M) = L with starting state q0 and some accepting states
            - Then, create M' for L* by:
                - Creating a new starting state q'
                - Adding $\epsilon$ transitions from:
                    - q' to q0
                    - All the accepting states to q0
            - However, we still need to prove that if w is accepted by M', then w can be written as a sequence w1 ... wk, such that each is in "L"
                - We ALSO need to show the other way: if a string "w" is in L*, that it WILL be accepted by M'
            - "You CAN do this proof by bijections instead, but I wouldn't do it, because it's confusing"
        - For union, it's simpler; you just take Q_start, then split into 2 separate branches, one using L1's machine "M1" and the other using L2's machine "M2"
        - For concatenation, you go from your accepting state to the start of the "L2" string (using M2s accepting machine) via $\epsilon$-transitions
            - There's 1 tricky case that's similar to our 1st homework: if L1 = {a, aa} and L2 = {ab}, then you don't want to falsely reject "aab"
            - To deal with this, you have to go from ALL accepting states in the 1st language to the starting state of the second language
    - So, in all cases, we can convert it into a DFA - so they're equivalent!

- Okay, let's look at an example!
    - Let's suppose we have a fairly simple regular expression:

            (a or b)*aba

        - So, to start off with our union, we'd do an $\epsilon$-transition split from our starting state to either "a" or "b," and go to an accepting state like so (no character for transition means $\epsilon$-transition):

                q1--a->q2_accept
                |
                q0
                |
                q3--b->q4_accept

        - To handle the Kleen Star "*", then, we need to create a new accepting state "q5" and take an $\epsilon$-transition from our current accepting states to it - and make our old accepting states no longer accept - to make that part optional:

                q1--a----->q2
                |           |
                q0<---q5_accept
                |           |
                q3--b----->q4

            - "Note that, right now, q2/q4 are kinda redundant, since you can skip it and go straight to q5"
                - So, we could simplify this NFA, but that would make our transformations we're doing less clear
        - Finally, to handle concatenation, we need to go from our current accepting states (i.e. q5) to the set of new states that'll accept the concatenated string "aba":

                q1--a----->q2
                |           |
                q0<---q5_accept--->q6--a->q7--b->q8--a->q9_accept
                |           |
                q3--b----->q4

- Okay, let's now cover the other direction and convert from NFAs to regular expressions
    - First, we can simplify our NFA to a single accepting state by adding $\epsilon$-transitions, making it a GENERALIZED NFA
        - This means that instead of single characters being our transitions, we can think of entire regular expressions being our transitions
        - e.g.:

                q0--x->q1--y->q2
                       ||
                        z

            - The "z" here is a self-loop on q1 when we see a "z" character
        - Goes to:

                q0---xz*y---->q2

            - "Implicitly, there's also the union here of all other transitions from q0 to q2 that we DON'T see"
            - Side-note: (a or b)* is NOT the same thing as a\*b\*, since in the latter, we can't have strings like "aba" where the characters are mixed
    - Once we've done this, we can keep eliminating edges more and more until we eventually just end up with 2 states: the starting state, the accepting state, and the regular expression for the NFA connecting them

- "Okay - what I want to talk about for the rest of this class is how to prove a language is NOT regular"
    - Here's an example of a non-regular language:

            L = {0^n1^n | n >= 0}

        - How can we do this? Well, first, let's assume "L" is regular, meaning it has a DFA "M" that can accept L
            - Remember: DFAs by definition have a FINITE number of states
        - Let M have "p" number of states
        - Then, consider the string 0^(p+1)1^(p+1)
            - Since there are p+1 "0" characters in this string, but only "p" states, it MUST have looped at some point - in fact, it must have looped when it was ONLY reading 0s
            - THEREFORE, we CAN'T keep track of the exact number of 0s we've seen using only our "p" states, meaning the DFA can't verify that the number of 0s matches the number of 1s
                - TMs can handle this since they can write stuff down/edit the string, but DFAs ONLY have states - they're stuck!
        - Therefore, no DFA "M" can decide this language, so it isn't regular!
            - "Notice, here, that you don't get to design your machine around some known input - instead, I get to be mean and AFTER you've designed your machine, I can pick ANY input that causes it to fail!"
    - So, we've got a proof by contradiction that this language can't be regular!

- This looping idea is at the core of something super important, called the PUMPING LEMMA!
    - ...which states: "If A is regular, then there exists some p such that for any w $\in$ A with |w| > p, there is a split of w = xyz with |xy| <= p such that for all i >= 0, $xy^iz$ $\in$ A"
        - Breaking this down:
            - If A is regular,
            - then there exists some p such that
            - for any w $\in$ A with |w| >= p
            - there is a split w = xyz with |xy| <= p, |y| > 0
            - such that for all i >= 0
            - $xy^iz$ $\in$ A
        - TODO: Double-check I copied this correctly?

- We'll go through PLENTY more descriptions of this over the next week; stick with me, and let's go!