Github link: https://github.com/hamdaankhalid/Compilers/blob/main/regex/regex.cpp
If you have worked on any sort of validation logic in web services or frontend forms you are bound to have come across Regular Expressions. We use them for all sorts of string matching tasks such as checking if a given string matches a phone number pattern, if a string is a valid email address, etc. Not known to many regular expressions are also commonly used for lexical analysis and breaking a string written into your IDE into small "tokens" that have a meaning to further parts in a compiler.
Fun side quest: Looks at Lex and Flex (spoiler alert: they let you create a lexical analyzer by letting you define your regular expressions for different parts of your language.
The common problem category that all above examples have is a problem of pattern matching. You define a pattern, something like this:
\d{3}-\d{3}-\d{4}
and then you take a string "224-607-4296" and check if it matches against the pattern you described.
TLDR:
The regex interpreter takes your regex definition and constructs an Automata from it. An Automata is a model of states and transitions, where the transitions help you move from state A to state B given an input (or no input in what is called an epsilon transition). Think of it like a state machine that is dynamically constructed based on your input. We then iterate over you candidate string and model each character transitioning over the states. At the end of the input and when we have run out of state transitions we check if we happened to land on an accepting state (a state symbolizing the end of a valid input).
Now if you read the TLDR and you know a bit about automatas and interpreters on automatas you would want to throw a punch at me, for skipping over too much. Well don't get excited I don't know enough Automata theory to sit and write a paper on it, but I do know just enough to help make you see how regular expressions depend on Automatas and run them to tell you whether or your javascript function should show your user an error for their username with an emoji in the middle lol.
We don't need to be an expert but we do need to have an understanding of the following concepts to understand how regular expressions work:
Automata:
Specific kind of state machine used to model a computation wiht states, and transitions.
Deterministic Finite Automata:
An Automata where every state given an input will only have one state it can transition into next. So if we are state0, there can only be on transition if we see a symbol "a" to state1, we cannot have any other transitions from state0 given the single transition.
The finite stands for "A finite set of states with transitions based on input symbols".
Non-Deterministic Finite Automata:
An Automata that confirms to the same concept of finiteness we described above, but given a symbol a state may or may not have multiple transitions.
Language Definition:
- Literal: any symbol that is not already a reserverd character is treated as a literal. Example 'a', 'c', 'p'. A language such as "c" matches a string that is the exact character as the literal 'c'.
- Concatenation "+": This operator is used to express combining literals or "sub-expressions" sequentially. For example "c+a+t" is a concatenation of the literals 'c', 'a', 't', and it matches a string that has the characters c,a,t all together sequentially -> cat.
- Alternator "|": This one is a logical "or" on two subexpressions or literals. for example the expression "a|b" will match any string that is either a single character "a" or a single character "b". We can also use Alternators on subexpressions.
- Parenthesis "(..)": Parenthesis are used similar to how we use them in math to create subexpressions that take the highest precedence in evaluation. For example the expression "(A|B)+C" will match the strings "AC" or "BC", because the first character must comply to the the subexpression "(A|B)" and then the result must be followed by a concatenation of the symbol C/
- Kleene "\*": The kleene closure is preceeded by a literal or a subexpression and it expresses the occurence of what it is preceeded by 0 or as many times as possible. Example "A\*" matches -> "", "A", "AA", "AAAAAAAA....A".
Similarity with Arithmetic:
When I started working on this problem, I realized that evaluating the language was very similar to performing Arithmetic operations on an expression. Operations "concatenation" & "alterantors" were binary operators (2 arguments to the function), and the "Kleene" operator was an unary operator. The Parenthesis definitely behaved the same way they do in math. Handling Parenthesis for subexpressions is a commmon problem that can be solved using a well known notation format that removes the need for Parenthesis altogether, the Reverse-Polish-Notation format or the post fix format. The Post Fix format just puts the operand(arguments) before the operator. So "a+b", becomes "ab+", or "(c|b)+(a|b)" becomes "c b | a b + |". For the unary operator of a kleene we can express it pretty as we do it already "a\*" -> "a\*".
This gets clearer in the function responsible for turning infix to postfix format which is very creatively named as "infixToPostfixTranslation" by me.
Now from post fix notation we just need to start going left to right, and when we come across an operator we just need to pop the last two elements and apply the operator on them, and continue evaluating. This will get clearer in the code under the function "buildNfa" which processes the postfix translated expression.
Well we know how to evaluate an expression but how do we handle these expressions? What does concatenating and alternting, or kleen"ing" really do? For that we are going to be diving deeper into the world of Nondeterministic Finite Automata representation of these operations.
NFA representation:
* Quick note: You will see the word "epsilon transition" being used below, this just means that you can move from stateA to stateB without needing any symbol.
* Empty expression
"State 0" --- epsilon transition ----> "State 1 (Acceptable Ending State)"
* Literal expression (using the literal 'a' as an example)
"State 0" --- "a" ---> "State 1 (Acceptable Ending State)"
* Concatenation expression
Given two subexpressions that have a start state and end state. Literal expression 1: "State 0" --- "a" ---> "State 1 (Acceptable Ending State)" Literal expression 2: "State 2" --- "b" ---> "State 3 (Acceptable Ending State)" Concatenation of Expr1 and Expr2 "State 0" --- "a" ---> "State 1" --- epsilon transition ---> "State 2" --- "b" ---> "State 3 (Accept)" )"
* Alternator expression
Given two subexpressions that have a start state and end state. Literal expression 1: "State 1" --- "a" ---> "State 2 (Acceptable Ending State)" Literal expression 2: "State 3" --- "b" ---> "State 4 (Acceptable Ending State)" Alternator of Expr1 and Expr2 -----epsilon transition----> "State 1" ----- "a" ----> "State 2"--epsilon transition | | "State 0" | "State 5 (Acceptable Ending)" | | -----epsilon transition----> "State 3" ----- "b" ----> "State 4"--epsilon transition
* Kleene expression
Literal expression 1: "State 1" --- "a" ---> "State 2 (Acceptable Ending State)" Kleene of expression 1 ------<------Epsilon Transition----<- | | "State 0" -----"a" ---> "State 1 (Acceptable Ending State)" | | ------->----Epsilon Transition----->
Now that we have our primitive operators combined creating more complex ones on top of these base abstraction layers is fairly simple the pseudo-code looks something like this:
func process_post_fixed_regular_language(Array<Symbol> postfixed): symbolmapping = Map<string, tuple<int, int> for item in postfixed: if item is literal: create a graph that satisfies the single literal graph store the start and end states of the above graph in symbolMapping stack = [] for item in postfixed: if item is literal or subexpression: push literal or subexpression on stack else if item is concatenation operator: pop the last two items on stack get their start and end state from symbol mapping use their start and end state to make a new graph for the subexpression save the subexpression and its start and end state into symbol mapping pusgh subexpression on stack else if item is alternator operator: ....same as concatenation but but instead of making subexpression for concatenation, make graph that complies with alternator else if item is kleene closure: ....same as concatenation but but instead of making subexpression for concatenation, pop only one item from current stack and create kleene's graph from it. return top most item in stack along with the graphs attached to it
```
The actual implementation of the Non Finite State Automata is as a class that store the states and their relations in a hashmaps and vectors. Look at the class "NFA" in the source code. Once we go throught the above procedure ther NFA class is populated with all necessary states and transitions for the entire graph of the given regular expression. Now we can evaluate candidate string against it and tell whether a given string matches the pattern defined by the regular expression or not!
## Evaluation:
The evaluation of a string against the NFA(Non-deterministic fintie automata) is performed by an depth frist search on the NFA as we iterate over each character of the candidate string using it as a transition or using any epsilon transitions as we go over different states in our NFA. At the very end we check if we finish on an Acceptable state, and that is our answer to whether the candidate matches or not! Look at the method runSimulation in the class NFA to see the iterative implementation of Depth first search for this problem.
## Improvements:
Going back to our discussion of Deterministic vs Nondeterministic Finite Automatas we can see the computational cost of simulating state transitions on NFAs being much higher than on DFA's for complex enough expressions. Based on the need of the hour if the cost of constructing the graph is something we are willing to incur, in order for faster evaluation on candidate strings, we can convert NFAs into DFAs using different algorithms! I will leave this up to you to explore further! Based on the regular expressions we built today we can build other syntactic sugar oeprators such as "?" and "[]" to make our regular expressions less verbose.
CODE:
#include <algorithm> #include <cassert> #include <iostream> #include <memory> #include <stdexcept> #include <string> #include <unordered_map> #include <unordered_set> #include <utility> #include <vector> /* * Primitives: * - For now if something is not an operator we treat it as a literal * * Concatenation: * - "+" * * Alternation: * - "|" * * Kleene Star: * - "*" * * Grouping & Paranthesis: * - "( )" * */ /* * Epsilon NFA Explanation Copy Pasta: * *An ε-NFA (epsilon-Nondeterministic Finite Automaton) is a type of finite automaton used in theoretical computer science and formal language theory to describe and recognize regular languages. It is an extension of a traditional NFA (Nondeterministic Finite Automaton) with the addition of epsilon transitions (also known as ε-transitions or lambda transitions). Let's break down the components and concepts of an ε-NFA in absolute detail: States: An ε-NFA consists of a finite set of states, often denoted by Q. Each state represents a specific condition or configuration of the automaton. Alphabet: There is a finite input alphabet Σ, which consists of symbols or characters that the automaton reads from an input string. These symbols are the basic building blocks of the language. Transitions: Unlike a traditional DFA (Deterministic Finite Automaton), an ε-NFA can have multiple transitions for a given state and input symbol. These transitions are non-deterministic, meaning that the automaton can move to multiple states at once when processing a particular symbol. Additionally, ε-NFAs can have epsilon transitions (ε-transitions), which are transitions that occur without consuming any input symbol. An ε-transition allows the automaton to move from one state to another without reading any input, effectively adding an extra dimension to its computation. Start State: There is a designated start state (often denoted as q0) where the automaton begins its computation. Accept States: The ε-NFA also has a set of accept states (or final states), which are states that, when reached after processing the entire input string, indicate that the automaton recognizes the input string as a valid member of the language. Epsilon Transitions: Epsilon transitions (ε-transitions) are transitions that are not associated with any input symbol. They are represented using the ε symbol. When the automaton encounters an ε-transition, it can move from the current state to the target state without consuming any input symbol. Epsilon transitions allow for non-determinism, as the automaton can make choices without needing any input cues. Language Recognition: To determine whether a given input string belongs to the language recognized by the ε-NFA, the automaton explores all possible paths through its states, considering both regular transitions (based on input symbols) and ε-transitions. It accepts the input string if there exists at least one path that leads to an accept state after processing the entire input. Equivalence to Regular Languages: ε-NFAs are capable of recognizing the same class of languages as DFAs and NFAs. This means that any language recognized by an ε-NFA can also be recognized by a regular NFA or DFA. Similarly, any language recognized by a regular NFA or DFA can be recognized by an ε-NFA. * */ char groupingAndRelations[5] = {'*', '|', '+', '(', ')'}; bool isOperatorOrGroups(char a) { for (int i = 0; i < 5; i++) { if (groupingAndRelations[i] == a) { return true; } } return false; } bool isOperatorOrGroups(const std::string &x) { if (x.size() != 1) { return false; } return isOperatorOrGroups(x[0]); } class Nfa { private: std::unordered_set<int> finalStates; std::unordered_map<int, std::unordered_set<int>> epsilonTransitions; std::unordered_map<int, std::unordered_map<std::string, std::unordered_set<int>>> transitions; int stateCounter = 0; public: const int startState = 0; // Based on how states are create 0 is always the init state const int endAcceptanceState = 1; // Based on how states are created 1 is always the accepting state static std::unique_ptr<Nfa> createEpsilonNfa() { std::unique_ptr<Nfa> base = std::make_unique<Nfa>(); base->addEpsilonTransition(base->createNewState(), base->createNewState()); base->addFinalState(base->endAcceptanceState); return base; } Nfa() {} int createNewState() { this->stateCounter++; return this->stateCounter - 1; } void addTransition(int fromState, const std::string &symbol, int toState) { this->transitions[fromState][symbol].insert(toState); } void addEpsilonTransition(int fromState, int toState) { this->epsilonTransitions[fromState].insert(toState); } bool hasEpsilonTransitions(int state) { std::unordered_map<int, std::unordered_set<int>>::iterator found = this->epsilonTransitions.find(state); if (found == this->epsilonTransitions.end()) { return false; } return !this->epsilonTransitions[state].empty(); } void transferEpsilonTransitions(int owner, int transferTo) { std::unordered_map<int, std::unordered_set<int>>::iterator found = this->epsilonTransitions.find(owner); if (found == this->epsilonTransitions.end()) { std::cout << "nothing transfered" << std::endl; return; } std::unordered_set<int> epsilons = this->epsilonTransitions[owner]; this->epsilonTransitions[transferTo] = epsilons; this->epsilonTransitions[owner] = std::unordered_set<int>(); } void removeEpsilonTransition(int fromState, int toState) { std::unordered_map<int, std::unordered_set<int>>::iterator found = this->epsilonTransitions.find(fromState); if (found == this->epsilonTransitions.end()) { return; } this->epsilonTransitions[fromState].erase(toState); } void addFinalState(int state) { finalStates.insert(state); } /* * Given an input string iterate over each character * for each character we are going to explore a next path * if there are any explorations that result in an accepted state * we return true, otherwise false. */ bool runSimulation(const std::string &input) { int hardLimitSelfLoop = 100; std::unordered_map<int, int> selfLoopTransitions; int transitionEndsAt = input.size(); // stack stores the current state and the current character index being // explored std::vector<std::pair<int, int>> stack; // add the intial state onto the stack std::pair<int, int> initial = std::make_pair(this->startState, 0); stack.push_back(initial); while (!stack.empty()) { std::pair<int, int> candidate = stack[stack.size() - 1]; stack.pop_back(); if (candidate.second == transitionEndsAt && candidate.first == this->endAcceptanceState) { return true; } // epsilon transitions std::unordered_map<int, std::unordered_set<int>>::iterator found = this->epsilonTransitions.find(candidate.first); if (found != this->epsilonTransitions.end()) { std::unordered_set<int> epsilonTransitions = this->epsilonTransitions[candidate.first]; for (int nextState : epsilonTransitions) { std::pair<int, int> toExplore = std::make_pair(nextState, candidate.second); // Check if already looped before and if at max limit std::unordered_map<int, int>::iterator loopedBefore = selfLoopTransitions.find(nextState); if (loopedBefore != selfLoopTransitions.end() && selfLoopTransitions[nextState] > hardLimitSelfLoop) { continue; } if (loopedBefore != selfLoopTransitions.end()) { selfLoopTransitions[nextState] += 1; } else { selfLoopTransitions[nextState] = 1; } stack.push_back(toExplore); } } // when we are on last index of input string only epsilon transitions are // allowed if (candidate.second == transitionEndsAt) { continue; } // action based transitions std::unordered_map< int, std::unordered_map<std::string, std::unordered_set<int>>>::iterator foundTransitionAbleState = this->transitions.find(candidate.first); if (foundTransitionAbleState != this->transitions.end()) { std::unordered_map<std::string, std::unordered_set<int>> actionTransitions = this->transitions[candidate.first]; std::string symbol = std::string(1, input.at(candidate.second)); std::unordered_map<std::string, std::unordered_set<int>>::iterator foundSymbol = actionTransitions.find(symbol); if (foundSymbol != actionTransitions.end()) { std::unordered_set<int> nextStates = actionTransitions[symbol]; int nextIdx = candidate.second + 1; for (int nextState : nextStates) { std::pair<int, int> exploration = std::make_pair(nextState, nextIdx); stack.push_back(exploration); } } } } return false; } void print() { std::cout << "NFA { \n"; std::cout << "Final States: \n"; for (int finalState : this->finalStates) { std::cout << "+ " << finalState << std::endl; } std::cout << "EpsilonTransitions \n"; for (std::pair<int, std::unordered_set<int>> epsilonTransition : this->epsilonTransitions) { std::cout << "+ " << epsilonTransition.first << " -> [ "; for (const int &state : epsilonTransition.second) { std::cout << state << ", "; } std::cout << " ] \n"; } std::cout << "Transitions \n"; for (std::pair<int, std::unordered_map<std::string, std::unordered_set<int>>> transition : this->transitions) { std::cout << "+ From " << transition.first << std::endl; std::cout << "\t"; for (std::pair<std::string, std::unordered_set<int>> symbolToStates : transition.second) { std::cout << "Given Symbol " << symbolToStates.first << ": [ "; for (const int &toState : symbolToStates.second) { std::cout << toState << ", "; } std::cout << " ] \n"; } } std::cout << "} \n"; } }; std::vector<std::string> preProcessRegex(const std::string ®ex) { std::vector<std::string> splitted; std::string builder; for (char curr : regex) { if (isOperatorOrGroups(curr)) { if (!builder.empty()) { splitted.push_back(builder); } builder.clear(); // insert the delimitter splitted.push_back(std::string(1, curr)); } else { builder += curr; } } if (!builder.empty()) { splitted.push_back(builder); } int uniqueness = 0; for (std::string §ion : splitted) { if (!isOperatorOrGroups(section)) { section += "_" + std::to_string(uniqueness); uniqueness++; } } return splitted; } std::vector<std::string> infixToPostFixTranslation(std::vector<std::string> infix) { std::vector<std::string> pushers = std::vector<std::string>{"+", "|", "(", "*"}; std::string popOn = ")"; std::vector<std::string> outputQueue; std::vector<std::string> stack; for (const std::string &candidate : infix) { std::vector<std::string>::iterator it = std::find(pushers.begin(), pushers.end(), candidate); if (it != pushers.end()) { stack.push_back(candidate); } else if (candidate == popOn) { // while the stack is not empty and the current top most element is an // operator of higher than the left one keep popping the top of while (!stack.empty()) { std::string top = stack[stack.size() - 1]; if (top == "(") { stack.pop_back(); break; } else { outputQueue.push_back(top); stack.pop_back(); } } } else { outputQueue.push_back(candidate); } } // Anything remaining on stack? while (!stack.empty()) { if (stack[stack.size() - 1] == "(") { std::__throw_invalid_argument("Invalid regex"); } outputQueue.push_back(stack[stack.size() - 1]); stack.pop_back(); } std::cout << "POSTFIX: "; for (const std::string &i : outputQueue) { std::cout << i; } std::cout << "\n"; return outputQueue; } // Create an NFA and change its topology as we iterate through the regular // expression return the NFA back to the invoker. This Nfa now has all the // states and transitions needed to be able to match a pattern std::unique_ptr<Nfa> buildNfa(std::vector<std::string> postFixed) { // all our primitive operators can be treated as binary so I will // solve them the same way we do mathematical binary operators at // this point we can only have primitive operators of Concatenation // and alternations infix to postfix took care of handling groupings std::unique_ptr<Nfa> nfa = Nfa::createEpsilonNfa(); const int lastStateStart = nfa->startState; int lastStateEnd = nfa->endAcceptanceState; std::unordered_map<std::string, std::pair<int, int>> symbolMapping; int i = 0; for (const std::string &candidate : postFixed) { if (!isOperatorOrGroups(candidate)) { int startState = nfa->createNewState(); int endState = nfa->createNewState(); std::pair<int, int> startEndStatePair = std::make_pair(startState, endState); // remove the issue of having a candidate get demolished by the index of // duplicates std::string uniqueCandidate = candidate + std::to_string(i); symbolMapping.insert(std::make_pair(candidate, startEndStatePair)); // these are the actual non epsilon transitions // in the actual transition remove the uniqueness postfix std::string postFixedRemoved = candidate.substr(0, 1); nfa->addTransition(startState, postFixedRemoved, endState); // when we iterate over our actual regex we will be adding these real // transitions between the existing start and end state to create // topologies that reflect the expressions and sub-expressions i++; } } std::vector<std::string> stack; for (const std::string &primitiveCandidate : postFixed) { if (primitiveCandidate == "|") { // pop the last two std::string y = stack.at(stack.size() - 1); stack.pop_back(); std::string x = stack.at(stack.size() - 1); stack.pop_back(); // make a fork connecting the alternator // x and y represent a state, this state is already in our symbol mapping std::pair<int, int> statesForMatchX = symbolMapping.at(x); std::pair<int, int> statesForMatchY = symbolMapping.at(y); // when doing a fork we basically break the connection between the current // start state and end state then we add our fork in between int preForkCommonState = nfa->createNewState(); int postForkCommonState = nfa->createNewState(); // create the alternator nfa->addEpsilonTransition(preForkCommonState, statesForMatchX.first); nfa->addEpsilonTransition(preForkCommonState, statesForMatchY.first); if (nfa->hasEpsilonTransitions(statesForMatchX.second)) { std::cout << "Attempt transfer from " << statesForMatchX.second << " to " << postForkCommonState << std::endl; nfa->transferEpsilonTransitions(statesForMatchX.second, postForkCommonState); } // when we do a concat, and then another concat and then do an alternator // the concats are sequenced.... this aint good if (nfa->hasEpsilonTransitions(statesForMatchY.second)) { std::cout << "Attempt transfer from " << statesForMatchY.second << " to " << postForkCommonState << std::endl; nfa->transferEpsilonTransitions(statesForMatchY.second, postForkCommonState); } nfa->addEpsilonTransition(statesForMatchX.second, postForkCommonState); nfa->addEpsilonTransition(statesForMatchY.second, postForkCommonState); // find the x and y states start and ends for both and then create an // alternator then take their combined start and end and store in the // mapping std::string alternated = "(" + x + ")|(" + y + ")"; // add the state and metadata back into mappings symbolMapping.insert(std::make_pair( alternated, std::make_pair(preForkCommonState, postForkCommonState))); stack.push_back(alternated); } else if (primitiveCandidate == "+") { // pop the last two std::string y = stack.at(stack.size() - 1); stack.pop_back(); std::string x = stack.at(stack.size() - 1); stack.pop_back(); // x and y represent a state, this state is already in our symbol mapping std::pair<int, int> statesForMatchX = symbolMapping.at(x); std::pair<int, int> statesForMatchY = symbolMapping.at(y); if (nfa->hasEpsilonTransitions(statesForMatchX.second)) { nfa->transferEpsilonTransitions(statesForMatchX.second, statesForMatchY.second); } // connect x and y nfa->addEpsilonTransition(statesForMatchX.second, statesForMatchY.first); std::string concatenated = "(" + x + ")+(" + y + ")"; symbolMapping.insert( std::make_pair(concatenated, std::make_pair(statesForMatchX.first, statesForMatchY.second))); stack.push_back(concatenated); } else if (primitiveCandidate == "*") { // Kleene is a unary and is to be treated differently std::string x = stack.at(stack.size() - 1); stack.pop_back(); std::pair<int, int> statesForMatchX = symbolMapping.at(x); // construct a kleene then add it into the nfa int preForkCommonState = nfa->createNewState(); int pseudoEnd = nfa->createNewState(); // transfer everything in second to pseudoEnd if (nfa->hasEpsilonTransitions(statesForMatchX.second)) { nfa->transferEpsilonTransitions(statesForMatchX.second, pseudoEnd); } nfa->addEpsilonTransition(statesForMatchX.second, pseudoEnd); // loop back on the kleene nfa->addEpsilonTransition(statesForMatchX.second, statesForMatchX.first); nfa->addEpsilonTransition(preForkCommonState, pseudoEnd); nfa->addEpsilonTransition(preForkCommonState, statesForMatchX.first); std::string kleene = "(" + x + "*)"; symbolMapping.insert(std::make_pair( kleene, std::make_pair(preForkCommonState, pseudoEnd))); stack.push_back(kleene); } else { stack.push_back(primitiveCandidate); } std::cout << "State of NFA" << std::endl; nfa->print(); } // connect everything in stack as concat on main NFA if (stack.size() > 0) { std::cout << "Concatenating Remaining Sub-Automatas" << std::endl; while (!stack.empty()) { std::string x = stack.at(stack.size() - 1); stack.pop_back(); std::pair<int, int> xState = symbolMapping.at(x); nfa->removeEpsilonTransition(lastStateStart, lastStateEnd); nfa->addEpsilonTransition(lastStateStart, xState.first); nfa->addEpsilonTransition(xState.second, lastStateEnd); lastStateEnd = xState.first; } } return nfa; } void test(const std::string ®ex, std::vector<std::string> matches, std::vector<std::string> notMatches) { std::cout << "INFIX Regex: " << regex << std::endl; std::cout << "######## COMPILE TO NFA ########" << std::endl; std::vector<std::string> processed = preProcessRegex(regex); std::vector<std::string> postFixed = infixToPostFixTranslation(processed); std::unique_ptr<Nfa> nfa = buildNfa(postFixed); std::cout << "######## Completed Nfa: for " << regex << " #########" << std::endl; nfa->print(); std::cout << "######## Evaluate against NFA #########" << std::endl; for (const std::string &trues : matches) { bool mustBeTrue = nfa->runSimulation(trues); if (!mustBeTrue) { std::cout << "Test case " << trues << " FAILED against Regex " << regex << std::endl; assert(false); } } for (const std::string &falses : notMatches) { bool mustBeFalse = nfa->runSimulation(falses); if (mustBeFalse) { std::cout << "Test case " << falses << " FAILED against Regex " << regex << std::endl; assert(false); } } std::cout << "-------- Test Passes ---------" << std::endl; } int main() { std::cout << "############ REGULAR EXPRESSION PARSER & COMPILER #############" << std::endl; // single element std::string singleA = "a"; std::vector<std::string> singleFalses = {"b", "c", "ab"}; std::vector<std::string> singleTrues = {"a"}; test(singleA, singleTrues, singleFalses); // concat as a binary operator std::string concatenationBinary = "a+b"; std::vector<std::string> concatenationBinaryFalses = {"", "a", "b", "c", "ac", "acb", "cba"}; std::vector<std::string> concatenationBinaryTrues = {"ab"}; test(concatenationBinary, concatenationBinaryTrues, concatenationBinaryFalses); // triple concats std::string concatenationRegex = "(a+b)+c"; std::vector<std::string> concatenationFalses = {"a", "b", "c", "ab", "ac", "acb", "cba"}; std::vector<std::string> concatenationTrues = {"abc"}; test(concatenationRegex, concatenationTrues, concatenationFalses); // quad concat std::string quadConcatenationRegex = "a+b+c+d"; std::vector<std::string> quadConcatenationFalses = { "", "a", "b", "c", "ab", "ac", "acb", "cba", "acbd", "abc"}; std::vector<std::string> quadConcatenationTrues = {"abcd"}; test(quadConcatenationRegex, quadConcatenationTrues, quadConcatenationFalses); // binary alternator std::string binaryAlternatorRegex = "a|b"; std::vector<std::string> binaryAlternatorFalses = {"", "c", "ab", "ac"}; std::vector<std::string> binaryAlternatorTrues = {"a", "b"}; test(binaryAlternatorRegex, binaryAlternatorTrues, binaryAlternatorFalses); // quad alternator std::string quadAlternatorRegex = "a|b|c|d"; std::vector<std::string> quadAlternatorFalses = {"", "f", "ab", "ac"}; std::vector<std::string> quadAlternatorTrues = {"a", "b", "c", "d"}; test(quadAlternatorRegex, quadAlternatorTrues, quadAlternatorFalses); // alternator and concatenation combos std::string altConCombo1Regex = "(a+b)|(c+d)"; std::vector<std::string> altConCombo1Falses = {"", "a", "b", "c", "d", "ac", "ad", "dca"}; std::vector<std::string> altConCombo1Trues = {}; // {"ab", "cd"}; test(altConCombo1Regex, altConCombo1Trues, altConCombo1Falses); std::string altConCombo2Regex = "(a+b)|c"; std::vector<std::string> altConCombo2Falses = {"a" "ac", " ", ""}; std::vector<std::string> altConCombo2Trues{"ab", "c"}; test(altConCombo2Regex, altConCombo2Trues, altConCombo2Falses); std::string altConCombo3Regex = "a+(b|c)+d"; std::vector<std::string> altConCombo3Falses = {"", "a", "ab", "ac", "ad", "abe", "ace"}; std::vector<std::string> altConCombo3Trues{"abd", "acd"}; test(altConCombo3Regex, altConCombo3Trues, altConCombo3Falses); // kleene std::string simpleKleene = "a*"; std::vector<std::string> simpleKleeneFalses = {"b", "gdfsd"}; std::vector<std::string> simpleKleeneTrues = {"", "a", "aa", "aaaaaaaaaaaaaaa"}; test(simpleKleene, simpleKleeneTrues, simpleKleeneFalses); // concatenate kleene combo std::string concatKleeneCombo1Regex = "(c+d)*"; std::vector<std::string> concatKleeneCombo1Falses = {"a", "bc" "de", "d"}; std::vector<std::string> concatKleeneCombo1Trues = {"", "cd", "cdcdcd"}; test(concatKleeneCombo1Regex, concatKleeneCombo1Trues, concatKleeneCombo1Falses); std::string concatKleeneCombo2Regex = "c+(o)*"; std::vector<std::string> concatKleeneCombo2Falses = {"cool", "a", "dog", ""}; std::vector<std::string> concatKleeneCombo2Trues = {"c", "co", "coooooo"}; test(concatKleeneCombo2Regex, concatKleeneCombo2Trues, concatKleeneCombo2Falses); // alternator kleene combo std::string altKleeneCombo1 = "(c|o)*"; std::vector<std::string> altKleeneCombo1Falses = {"a", "aa"}; std::vector<std::string> altKleeneCombo1Trues = { "", "cc", "oo", "coco", "coocoo", "cooo", "ooooc"}; test(altKleeneCombo1, altKleeneCombo1Trues, altKleeneCombo1Falses); // alternator concat kleen combo std::string allCombo = "d+(a*)"; std::vector<std::string> allComboFalses = {"", "do", "dao", "a"}; std::vector<std::string> allComboTrues = {"d", "da", "daaa"}; test(allCombo, allComboTrues, allComboFalses); std::string allCombo2 = "d+(a*)+(n|o)+i"; std::vector<std::string> allCombo2Falses = {"", "dock", "day", "a" "daaa", "da", "d", "dano", "dno"}; std::vector<std::string> allCombo2Trues = {"dni", "dani", "daaani", "daoi", "doi"}; test(allCombo2, allCombo2Trues, allCombo2Falses); // all pattern mega regex std::string finalTestRegex = "k+h+a+l+i+d+.+h+a+m+d+(a*)+n+@+((g+m+a+i+l)|(m+" "i+c+r+o+s+o+f+t))+.+c+o+m"; std::vector<std::string> finalTestRegexFalses = {"khalid.hamdaan@yahoo.com", "khalid.ham"}; std::vector<std::string> finalTestRegexTrues = { "khalid.hamdaan@gmail.com", "khalid.hamdaan@microsoft.com", "khalid.hamdn@gmail.com", "khalid.hamdn@microsoft.com", "khalid.hamdaaaaaaaaaaaaaaaaaaaan@gmail.com"}; test(finalTestRegex, finalTestRegexTrues, finalTestRegexFalses); return 0; }
https://en.wikipedia.org/wiki/Thompson%27s_construction#:~:text=In%20computer%20science%2C%20Thompson's%20construction,nondeterministic%20finite%20automaton%20(NFA).