# Regular Expressions and NFAs
## February 11th, 2020
- Alright, today's plan:
- Convert regular expressions to NFAs
- Convert NFAs to regular languages
- Pumping Lemma
- "This is getting very far into proof-territory, and you really can't think in terms of input examples or the input alphabet, since they'll be infinite; instead, you have to think of these in terms of your known states"
- With a small number of states, we can accept any number of strings
- Also, don't forget homework 3 is due this Thursday!
--------------------------------------------------------------------------------
- As we mentioned, regular expressions are defined recursively as regular languages with 3 operations:
- UNION, accepting either of 2 possible strings (e.g. accept "(b or c)at")
- CONCATENATION, sticking 2 strings together
- KLEEN STAR (*), which means there are 0+ copies of the preceding character/characters
- "L*" means 0 or more copies of a word in some language "L"
- "$\Sigma$*" means the language also containing the characters from $\Sigma$
- Essentially, $\Sigma$ is the set of characters, but we treat each character as its own string so it's technically treated as a language
- So, if L = {000}, then L* = {$\epsilon$, 000, 000000, ...}
- Alright; let's keep trying to prove that regular expressions give regular languages, meaning that they can be accepted by an NFA/DFA!
- We do this by starting out with the base case of the empty string $\epsilon$, a single character $\Sigma$, or the empty set (which always rejects, since there are NO states/characters now)
- Then, by induction, we assume the hypothesis is true and show that this holds as we add more characters!
- For Kleen Star, in words, the proof is like this:
- Let M be an NFA such that L(M) = L with starting state q0 and some accepting states
- Then, create M' for L* by:
- Creating a new starting state q'
- Adding $\epsilon$ transitions from:
- q' to q0
- All the accepting states to q0
- However, we still need to prove that if w is accepted by M', then w can be written as a sequence w1 ... wk, such that each is in "L"
- We ALSO need to show the other way: if a string "w" is in L*, that it WILL be accepted by M'
- "You CAN do this proof by bijections instead, but I wouldn't do it, because it's confusing"
- For union, it's simpler; you just take Q_start, then split into 2 separate branches, one using L1's machine "M1" and the other using L2's machine "M2"
- For concatenation, you go from your accepting state to the start of the "L2" string (using M2s accepting machine) via $\epsilon$-transitions
- There's 1 tricky case that's similar to our 1st homework: if L1 = {a, aa} and L2 = {ab}, then you don't want to falsely reject "aab"
- To deal with this, you have to go from ALL accepting states in the 1st language to the starting state of the second language
- So, in all cases, we can convert it into a DFA - so they're equivalent!
- Okay, let's look at an example!
- Let's suppose we have a fairly simple regular expression:
(a or b)*aba
- So, to start off with our union, we'd do an $\epsilon$-transition split from our starting state to either "a" or "b," and go to an accepting state like so (no character for transition means $\epsilon$-transition):
q1--a->q2_accept
|
q0
|
q3--b->q4_accept
- To handle the Kleen Star "*", then, we need to create a new accepting state "q5" and take an $\epsilon$-transition from our current accepting states to it - and make our old accepting states no longer accept - to make that part optional:
q1--a----->q2
| |
q0<---q5_accept
| |
q3--b----->q4
- "Note that, right now, q2/q4 are kinda redundant, since you can skip it and go straight to q5"
- So, we could simplify this NFA, but that would make our transformations we're doing less clear
- Finally, to handle concatenation, we need to go from our current accepting states (i.e. q5) to the set of new states that'll accept the concatenated string "aba":
q1--a----->q2
| |
q0<---q5_accept--->q6--a->q7--b->q8--a->q9_accept
| |
q3--b----->q4
- Okay, let's now cover the other direction and convert from NFAs to regular expressions
- First, we can simplify our NFA to a single accepting state by adding $\epsilon$-transitions, making it a GENERALIZED NFA
- This means that instead of single characters being our transitions, we can think of entire regular expressions being our transitions
- e.g.:
q0--x->q1--y->q2
||
z
- The "z" here is a self-loop on q1 when we see a "z" character
- Goes to:
q0---xz*y---->q2
- "Implicitly, there's also the union here of all other transitions from q0 to q2 that we DON'T see"
- Side-note: (a or b)* is NOT the same thing as a\*b\*, since in the latter, we can't have strings like "aba" where the characters are mixed
- Once we've done this, we can keep eliminating edges more and more until we eventually just end up with 2 states: the starting state, the accepting state, and the regular expression for the NFA connecting them
- "Okay - what I want to talk about for the rest of this class is how to prove a language is NOT regular"
- Here's an example of a non-regular language:
L = {0^n1^n | n >= 0}
- How can we do this? Well, first, let's assume "L" is regular, meaning it has a DFA "M" that can accept L
- Remember: DFAs by definition have a FINITE number of states
- Let M have "p" number of states
- Then, consider the string 0^(p+1)1^(p+1)
- Since there are p+1 "0" characters in this string, but only "p" states, it MUST have looped at some point - in fact, it must have looped when it was ONLY reading 0s
- THEREFORE, we CAN'T keep track of the exact number of 0s we've seen using only our "p" states, meaning the DFA can't verify that the number of 0s matches the number of 1s
- TMs can handle this since they can write stuff down/edit the string, but DFAs ONLY have states - they're stuck!
- Therefore, no DFA "M" can decide this language, so it isn't regular!
- "Notice, here, that you don't get to design your machine around some known input - instead, I get to be mean and AFTER you've designed your machine, I can pick ANY input that causes it to fail!"
- So, we've got a proof by contradiction that this language can't be regular!
- This looping idea is at the core of something super important, called the PUMPING LEMMA!
- ...which states: "If A is regular, then there exists some p such that for any w $\in$ A with |w| > p, there is a split of w = xyz with |xy| <= p such that for all i >= 0, $xy^iz$ $\in$ A"
- Breaking this down:
- If A is regular,
- then there exists some p such that
- for any w $\in$ A with |w| >= p
- there is a split w = xyz with |xy| <= p, |y| > 0
- such that for all i >= 0
- $xy^iz$ $\in$ A
- TODO: Double-check I copied this correctly?
- We'll go through PLENTY more descriptions of this over the next week; stick with me, and let's go!