Regular Expression Interpreter: Building & Traversing Automatas


Github link: https://github.com/hamdaankhalid/Compilers/blob/main/regex/regex.cpp


If you have worked on any sort of validation logic in web services or frontend forms you are bound to have come across Regular Expressions. We use them for all sorts of string matching tasks such as checking if a given string matches a phone number pattern, if a string is a valid email address, etc. Not known to many regular expressions are also commonly used for lexical analysis and breaking a string written into your IDE into small "tokens" that have a meaning to further parts in a compiler.



Fun side quest: Looks at Lex and Flex (spoiler alert: they let you create a lexical analyzer by letting you define your regular expressions for different parts of your language.


The common problem category that all above examples have is a problem of pattern matching. You define a pattern, something like this:

\d{3}-\d{3}-\d{4}

and then you take a string "224-607-4296" and check if it matches against the pattern you described.


TLDR:

The regex interpreter takes your regex definition and constructs an Automata from it. An Automata is a model of states and transitions, where the transitions help you move from state A to state B given an input (or no input in what is called an epsilon transition). Think of it like a state machine that is dynamically constructed based on your input. We then iterate over you candidate string and model each character transitioning over the states. At the end of the input and when we have run out of state transitions we check if we happened to land on an accepting state (a state symbolizing the end of a valid input).



Now if you read the TLDR and you know a bit about automatas and interpreters on automatas you would want to throw a punch at me, for skipping over too much. Well don't get excited I don't know enough Automata theory to sit and write a paper on it, but I do know just enough to help make you see how regular expressions depend on Automatas and run them to tell you whether or your javascript function should show your user an error for their username with an emoji in the middle lol.


We don't need to be an expert but we do need to have an understanding of the following concepts to understand how regular expressions work:


  • Automata
  • Deterministic Finite Automata
  • Non-deterministic Finite Automata



Automata:

Specific kind of state machine used to model a computation wiht states, and transitions.


Deterministic Finite Automata:

An Automata where every state given an input will only have one state it can transition into next. So if we are state0, there can only be on transition if we see a symbol "a" to state1, we cannot have any other transitions from state0 given the single transition.

The finite stands for "A finite set of states with transitions based on input symbols".


Non-Deterministic Finite Automata:

An Automata that confirms to the same concept of finiteness we described above, but given a symbol a state may or may not have multiple transitions.



Language Definition:

- Literal: any symbol that is not already a reserverd character is treated as a literal. Example 'a', 'c', 'p'. A language such as "c" matches a string that is the exact character as the literal 'c'.

- Concatenation "+": This operator is used to express combining literals or "sub-expressions" sequentially. For example "c+a+t" is a concatenation of the literals 'c', 'a', 't', and it matches a string that has the characters c,a,t all together sequentially -> cat.

- Alternator "|": This one is a logical "or" on two subexpressions or literals. for example the expression "a|b" will match any string that is either a single character "a" or a single character "b". We can also use Alternators on subexpressions.

- Parenthesis "(..)": Parenthesis are used similar to how we use them in math to create subexpressions that take the highest precedence in evaluation. For example the expression "(A|B)+C" will match the strings "AC" or "BC", because the first character must comply to the the subexpression "(A|B)" and then the result must be followed by a concatenation of the symbol C/

- Kleene "\*": The kleene closure is preceeded by a literal or a subexpression and it expresses the occurence of what it is preceeded by 0 or as many times as possible. Example "A\*" matches -> "", "A", "AA", "AAAAAAAA....A".



Similarity with Arithmetic:

When I started working on this problem, I realized that evaluating the language was very similar to performing Arithmetic operations on an expression. Operations "concatenation" & "alterantors" were binary operators (2 arguments to the function), and the "Kleene" operator was an unary operator. The Parenthesis definitely behaved the same way they do in math. Handling Parenthesis for subexpressions is a commmon problem that can be solved using a well known notation format that removes the need for Parenthesis altogether, the Reverse-Polish-Notation format or the post fix format. The Post Fix format just puts the operand(arguments) before the operator. So "a+b", becomes "ab+", or "(c|b)+(a|b)" becomes "c b | a b + |". For the unary operator of a kleene we can express it pretty as we do it already "a\*" -> "a\*". 

This gets clearer in the function responsible for turning infix to postfix format which is very creatively named as "infixToPostfixTranslation" by me.


Now from post fix notation we just need to start going left to right, and when we come across an operator we just need to pop the last two elements and apply the operator on them, and continue evaluating. This will get clearer in the code under the function "buildNfa" which processes the postfix translated expression. 


Well we know how to evaluate an expression but how do we handle these expressions? What does concatenating and alternting, or kleen"ing" really do? For that we are going to be diving deeper into the world of Nondeterministic Finite Automata representation of these operations.



NFA representation:


* Quick note: You will see the word "epsilon transition" being used below, this just means that you can move from stateA to stateB without needing any symbol.


* Empty expression 

"State 0" --- epsilon transition ----> "State 1 (Acceptable Ending State)"



* Literal expression (using the literal 'a' as an example)

"State 0" --- "a" ---> "State 1 (Acceptable Ending State)"



* Concatenation expression


Given two subexpressions that have a start state and end state.

Literal expression 1:
"State 0" --- "a" ---> "State 1 (Acceptable Ending State)"

Literal expression 2:
"State 2" --- "b" ---> "State 3 (Acceptable Ending State)"

Concatenation of Expr1 and Expr2
"State 0" --- "a" ---> "State 1" --- epsilon transition ---> "State 2" --- "b" ---> "State 3 (Accept)" )"



* Alternator expression


Given two subexpressions that have a start state and end state.

Literal expression 1:
"State 1" --- "a" ---> "State 2 (Acceptable Ending State)"

Literal expression 2:
"State 3" --- "b" ---> "State 4 (Acceptable Ending State)"

Alternator of Expr1 and Expr2
          -----epsilon transition----> "State 1" ----- "a" ----> "State 2"--epsilon transition
          |                                                                                  |
"State 0" |                                                             "State 5 (Acceptable Ending)" 
          |                                                                                  |	
          -----epsilon transition----> "State 3" ----- "b" ----> "State 4"--epsilon transition




* Kleene expression

Literal expression 1:
"State 1" --- "a" ---> "State 2 (Acceptable Ending State)"

Kleene of expression 1
    ------<------Epsilon Transition----<-	
    |                                   |
"State 0" -----"a" ---> "State 1 (Acceptable Ending State)"
    |                                   |
    ------->----Epsilon Transition----->



Now that we have our primitive operators combined creating more complex ones on top of these base abstraction layers is fairly simple the pseudo-code looks something like this:

func process_post_fixed_regular_language(Array<Symbol> postfixed):
  symbolmapping = Map<string, tuple<int, int>

  for item in postfixed:
		if item is literal:
			create a graph that satisfies the single literal graph
			store the start and end states of the above graph in symbolMapping

  stack = []	
  for item in postfixed:
		if item is literal or subexpression:
			push literal or subexpression on stack
		else if item is concatenation operator:
			pop the last two items on stack
			get their start and end state from symbol mapping
			use their start and end state to make a new graph for the subexpression
			save the subexpression and its start and end state into symbol mapping
			pusgh subexpression on stack
		else if item is alternator operator:
			....same as concatenation but but instead of making subexpression for concatenation, make graph that complies with alternator
		else if item is kleene closure:
			....same as concatenation but but instead of making subexpression for concatenation, pop only one item from current stack and create kleene's graph from it.
	

	return top most item in stack along with the graphs attached to it

```


The actual implementation of the Non Finite State Automata is as a class that store the states and their relations in a hashmaps and vectors. Look at the class "NFA" in the source code. Once we go throught the above procedure ther NFA class is populated with all necessary states and transitions for the entire graph of the given regular expression. Now we can evaluate candidate string against it and tell whether a given string matches the pattern defined by the regular expression or not!



## Evaluation:


The evaluation of a string against the NFA(Non-deterministic fintie automata) is performed by an depth frist search on the NFA as we iterate over each character of the candidate string using it as a transition or using any epsilon transitions as we go over different states in our NFA. At the very end we check if we finish on an Acceptable state, and that is our answer to whether the candidate matches or not! Look at the method runSimulation in the class NFA to see the iterative implementation of Depth first search for this problem.



## Improvements:


Going back to our discussion of Deterministic vs Nondeterministic Finite Automatas we can see the computational cost of simulating state transitions on NFAs being much higher than on DFA's for complex enough expressions. Based on the need of the hour if the cost of constructing the graph is something we are willing to incur, in order for faster evaluation on candidate strings, we can convert NFAs into DFAs using different algorithms! I will leave this up to you to explore further! Based on the regular expressions we built today we can build other syntactic sugar oeprators such as "?" and "[]" to make our regular expressions less verbose.



CODE:


#include <algorithm>
#include <cassert>
#include <iostream>
#include <memory>
#include <stdexcept>
#include <string>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>


/*
 * Primitives:
 * - For now if something is not an operator we treat it as a literal
 *
 * Concatenation:
 * - "+"
 *
 * Alternation:
 * - "|"
 *
 * Kleene Star:
 * - "*"
 *
 * Grouping & Paranthesis:
 * - "( )"
 * */


/*
 * Epsilon NFA Explanation Copy Pasta:
 *
 *An ε-NFA (epsilon-Nondeterministic Finite Automaton) is a type of finite
automaton used in theoretical computer science and formal language theory to
describe and recognize regular languages. It is an extension of a traditional
NFA (Nondeterministic Finite Automaton) with the addition of epsilon transitions
(also known as ε-transitions or lambda transitions).


Let's break down the components and concepts of an ε-NFA in absolute detail:


States: An ε-NFA consists of a finite set of states, often denoted by Q. Each
state represents a specific condition or configuration of the automaton.


Alphabet: There is a finite input alphabet Σ, which consists of symbols or
characters that the automaton reads from an input string. These symbols are the
basic building blocks of the language.


Transitions: Unlike a traditional DFA (Deterministic Finite Automaton), an ε-NFA
can have multiple transitions for a given state and input symbol. These
transitions are non-deterministic, meaning that the automaton can move to
multiple states at once when processing a particular symbol. Additionally,
ε-NFAs can have epsilon transitions (ε-transitions), which are transitions that
occur without consuming any input symbol. An ε-transition allows the automaton
to move from one state to another without reading any input, effectively adding
an extra dimension to its computation.


Start State: There is a designated start state (often denoted as q0) where the
automaton begins its computation.


Accept States: The ε-NFA also has a set of accept states (or final states),
which are states that, when reached after processing the entire input string,
indicate that the automaton recognizes the input string as a valid member of the
language.


Epsilon Transitions: Epsilon transitions (ε-transitions) are transitions that
are not associated with any input symbol. They are represented using the ε
symbol. When the automaton encounters an ε-transition, it can move from the
current state to the target state without consuming any input symbol. Epsilon
transitions allow for non-determinism, as the automaton can make choices without
needing any input cues.


Language Recognition: To determine whether a given input string belongs to the
language recognized by the ε-NFA, the automaton explores all possible paths
through its states, considering both regular transitions (based on input
symbols) and ε-transitions. It accepts the input string if there exists at least
one path that leads to an accept state after processing the entire input.


Equivalence to Regular Languages: ε-NFAs are capable of recognizing the same
class of languages as DFAs and NFAs. This means that any language recognized by
an ε-NFA can also be recognized by a regular NFA or DFA. Similarly, any language
recognized by a regular NFA or DFA can be recognized by an ε-NFA.
 * */


char groupingAndRelations[5] = {'*', '|', '+', '(', ')'};


bool isOperatorOrGroups(char a) {
  for (int i = 0; i < 5; i++) {
    if (groupingAndRelations[i] == a) {
      return true;
    }
  }
  return false;
}


bool isOperatorOrGroups(const std::string &x) {
  if (x.size() != 1) {
    return false;
  }
  return isOperatorOrGroups(x[0]);
}


class Nfa {
private:
  std::unordered_set<int> finalStates;
  std::unordered_map<int, std::unordered_set<int>> epsilonTransitions;
  std::unordered_map<int,
                     std::unordered_map<std::string, std::unordered_set<int>>>
      transitions;


  int stateCounter = 0;


public:
  const int startState =
      0; // Based on how states are create 0 is always the init state
  const int endAcceptanceState =
      1; // Based on how states are created 1 is always the accepting state


  static std::unique_ptr<Nfa> createEpsilonNfa() {
    std::unique_ptr<Nfa> base = std::make_unique<Nfa>();
    base->addEpsilonTransition(base->createNewState(), base->createNewState());


    base->addFinalState(base->endAcceptanceState);


    return base;
  }


  Nfa() {}


  int createNewState() {
    this->stateCounter++;
    return this->stateCounter - 1;
  }


  void addTransition(int fromState, const std::string &symbol, int toState) {
    this->transitions[fromState][symbol].insert(toState);
  }


  void addEpsilonTransition(int fromState, int toState) {
    this->epsilonTransitions[fromState].insert(toState);
  }


  bool hasEpsilonTransitions(int state) {
    std::unordered_map<int, std::unordered_set<int>>::iterator found =
        this->epsilonTransitions.find(state);
    if (found == this->epsilonTransitions.end()) {
      return false;
    }


    return !this->epsilonTransitions[state].empty();
  }


  void transferEpsilonTransitions(int owner, int transferTo) {
    std::unordered_map<int, std::unordered_set<int>>::iterator found =
        this->epsilonTransitions.find(owner);
    if (found == this->epsilonTransitions.end()) {
      std::cout << "nothing transfered" << std::endl;
      return;
    }
    std::unordered_set<int> epsilons = this->epsilonTransitions[owner];
    this->epsilonTransitions[transferTo] = epsilons;
    this->epsilonTransitions[owner] = std::unordered_set<int>();
  }


  void removeEpsilonTransition(int fromState, int toState) {
    std::unordered_map<int, std::unordered_set<int>>::iterator found =
        this->epsilonTransitions.find(fromState);
    if (found == this->epsilonTransitions.end()) {
      return;
    }


    this->epsilonTransitions[fromState].erase(toState);
  }


  void addFinalState(int state) { finalStates.insert(state); }


  /*
   * Given an input string iterate over each character
   * for each character we are going to explore a next path
   * if there are any explorations that result in an accepted state
   * we return true, otherwise false.
   */
  bool runSimulation(const std::string &input) {
    int hardLimitSelfLoop = 100;
    std::unordered_map<int, int> selfLoopTransitions;


    int transitionEndsAt = input.size();


    // stack stores the current state and the current character index being
    // explored
    std::vector<std::pair<int, int>> stack;
    // add the intial state onto the stack
    std::pair<int, int> initial = std::make_pair(this->startState, 0);
    stack.push_back(initial);


    while (!stack.empty()) {
      std::pair<int, int> candidate = stack[stack.size() - 1];
      stack.pop_back();


      if (candidate.second == transitionEndsAt &&
          candidate.first == this->endAcceptanceState) {
        return true;
      }


      // epsilon transitions
      std::unordered_map<int, std::unordered_set<int>>::iterator found =
          this->epsilonTransitions.find(candidate.first);
      if (found != this->epsilonTransitions.end()) {
        std::unordered_set<int> epsilonTransitions =
            this->epsilonTransitions[candidate.first];
        for (int nextState : epsilonTransitions) {
          std::pair<int, int> toExplore =
              std::make_pair(nextState, candidate.second);
          // Check if already looped before and if at max limit
          std::unordered_map<int, int>::iterator loopedBefore =
              selfLoopTransitions.find(nextState);
          if (loopedBefore != selfLoopTransitions.end() &&
              selfLoopTransitions[nextState] > hardLimitSelfLoop) {
            continue;
          }


          if (loopedBefore != selfLoopTransitions.end()) {
            selfLoopTransitions[nextState] += 1;
          } else {
            selfLoopTransitions[nextState] = 1;
          }


          stack.push_back(toExplore);
        }
      }


      // when we are on last index of input string only epsilon transitions are
      // allowed
      if (candidate.second == transitionEndsAt) {
        continue;
      }


      // action based transitions
      std::unordered_map<
          int,
          std::unordered_map<std::string, std::unordered_set<int>>>::iterator
          foundTransitionAbleState = this->transitions.find(candidate.first);
      if (foundTransitionAbleState != this->transitions.end()) {
        std::unordered_map<std::string, std::unordered_set<int>>
            actionTransitions = this->transitions[candidate.first];


        std::string symbol = std::string(1, input.at(candidate.second));


        std::unordered_map<std::string, std::unordered_set<int>>::iterator
            foundSymbol = actionTransitions.find(symbol);


        if (foundSymbol != actionTransitions.end()) {
          std::unordered_set<int> nextStates = actionTransitions[symbol];
          int nextIdx = candidate.second + 1;
          for (int nextState : nextStates) {
            std::pair<int, int> exploration =
                std::make_pair(nextState, nextIdx);
            stack.push_back(exploration);
          }
        }
      }
    }


    return false;
  }


  void print() {
    std::cout << "NFA { \n";


    std::cout << "Final States: \n";


    for (int finalState : this->finalStates) {
      std::cout << "+ " << finalState << std::endl;
    }


    std::cout << "EpsilonTransitions \n";


    for (std::pair<int, std::unordered_set<int>> epsilonTransition :
         this->epsilonTransitions) {
      std::cout << "+ " << epsilonTransition.first << " -> [ ";
      for (const int &state : epsilonTransition.second) {
        std::cout << state << ", ";
      }
      std::cout << " ] \n";
    }


    std::cout << "Transitions \n";


    for (std::pair<int,
                   std::unordered_map<std::string, std::unordered_set<int>>>
             transition : this->transitions) {
      std::cout << "+ From " << transition.first << std::endl;
      std::cout << "\t";
      for (std::pair<std::string, std::unordered_set<int>> symbolToStates :
           transition.second) {
        std::cout << "Given Symbol " << symbolToStates.first << ": [ ";
        for (const int &toState : symbolToStates.second) {
          std::cout << toState << ", ";
        }
        std::cout << " ] \n";
      }
    }


    std::cout << "} \n";
  }
};


std::vector<std::string> preProcessRegex(const std::string &regex) {
  std::vector<std::string> splitted;
  std::string builder;
  for (char curr : regex) {
    if (isOperatorOrGroups(curr)) {
      if (!builder.empty()) {
        splitted.push_back(builder);
      }
      builder.clear();
      // insert the delimitter
      splitted.push_back(std::string(1, curr));
    } else {
      builder += curr;
    }
  }


  if (!builder.empty()) {
    splitted.push_back(builder);
  }


  int uniqueness = 0;
  for (std::string &section : splitted) {
    if (!isOperatorOrGroups(section)) {
      section += "_" + std::to_string(uniqueness);
      uniqueness++;
    }
  }
  return splitted;
}


std::vector<std::string>
infixToPostFixTranslation(std::vector<std::string> infix) {
  std::vector<std::string> pushers =
      std::vector<std::string>{"+", "|", "(", "*"};
  std::string popOn = ")";


  std::vector<std::string> outputQueue;
  std::vector<std::string> stack;


  for (const std::string &candidate : infix) {
    std::vector<std::string>::iterator it =
        std::find(pushers.begin(), pushers.end(), candidate);
    if (it != pushers.end()) {
      stack.push_back(candidate);
    } else if (candidate == popOn) {
      // while the stack is not empty and the current top most element is an
      // operator of higher than the left one keep popping the top of
      while (!stack.empty()) {
        std::string top = stack[stack.size() - 1];
        if (top == "(") {
          stack.pop_back();
          break;
        } else {
          outputQueue.push_back(top);
          stack.pop_back();
        }
      }
    } else {
      outputQueue.push_back(candidate);
    }
  }


  // Anything remaining on stack?
  while (!stack.empty()) {
    if (stack[stack.size() - 1] == "(") {
      std::__throw_invalid_argument("Invalid regex");
    }
    outputQueue.push_back(stack[stack.size() - 1]);
    stack.pop_back();
  }


  std::cout << "POSTFIX: ";
  for (const std::string &i : outputQueue) {
    std::cout << i;
  }


  std::cout << "\n";
  return outputQueue;
}


// Create an NFA and change its topology as we iterate through the regular
// expression return the NFA back to the invoker. This Nfa now has all the
// states and transitions needed to be able to match a pattern
std::unique_ptr<Nfa> buildNfa(std::vector<std::string> postFixed) {
  // all our primitive operators can be treated as binary so I will
  // solve them the same way we do mathematical binary operators at
  // this point we can only have primitive operators of Concatenation
  // and alternations infix to postfix took care of handling groupings
  std::unique_ptr<Nfa> nfa = Nfa::createEpsilonNfa();


  const int lastStateStart = nfa->startState;
  int lastStateEnd = nfa->endAcceptanceState;


  std::unordered_map<std::string, std::pair<int, int>> symbolMapping;
  int i = 0;
  for (const std::string &candidate : postFixed) {
    if (!isOperatorOrGroups(candidate)) {
      int startState = nfa->createNewState();
      int endState = nfa->createNewState();


      std::pair<int, int> startEndStatePair =
          std::make_pair(startState, endState);
      // remove the issue of having a candidate get demolished by the index of
      // duplicates
      std::string uniqueCandidate = candidate + std::to_string(i);
      symbolMapping.insert(std::make_pair(candidate, startEndStatePair));
      // these are the actual non epsilon transitions
      // in the actual transition remove the uniqueness postfix
      std::string postFixedRemoved = candidate.substr(0, 1);
      nfa->addTransition(startState, postFixedRemoved, endState);
      // when we iterate over our actual regex we will be adding these real
      // transitions between the existing start and end state to create
      // topologies that reflect the expressions and sub-expressions
      i++;
    }
  }


  std::vector<std::string> stack;


  for (const std::string &primitiveCandidate : postFixed) {
    if (primitiveCandidate == "|") {
      // pop the last two
      std::string y = stack.at(stack.size() - 1);
      stack.pop_back();


      std::string x = stack.at(stack.size() - 1);
      stack.pop_back();


      // make a fork connecting the alternator
      // x and y represent a state, this state is already in our symbol mapping
      std::pair<int, int> statesForMatchX = symbolMapping.at(x);
      std::pair<int, int> statesForMatchY = symbolMapping.at(y);


      // when doing a fork we basically break the connection between the current
      // start state and end state then we add our fork in between
      int preForkCommonState = nfa->createNewState();
      int postForkCommonState = nfa->createNewState();


      // create the alternator
      nfa->addEpsilonTransition(preForkCommonState, statesForMatchX.first);
      nfa->addEpsilonTransition(preForkCommonState, statesForMatchY.first);


      if (nfa->hasEpsilonTransitions(statesForMatchX.second)) {
        std::cout << "Attempt transfer from " << statesForMatchX.second
                  << " to " << postForkCommonState << std::endl;
        nfa->transferEpsilonTransitions(statesForMatchX.second,
                                        postForkCommonState);
      }


      // when we do a concat, and then another concat and then do an alternator
      // the concats are sequenced.... this aint good
      if (nfa->hasEpsilonTransitions(statesForMatchY.second)) {
        std::cout << "Attempt transfer from " << statesForMatchY.second
                  << " to " << postForkCommonState << std::endl;
        nfa->transferEpsilonTransitions(statesForMatchY.second,
                                        postForkCommonState);
      }


      nfa->addEpsilonTransition(statesForMatchX.second, postForkCommonState);
      nfa->addEpsilonTransition(statesForMatchY.second, postForkCommonState);


      // find the x and y states start and ends for both and then create an
      // alternator then take their combined start and end and store in the
      // mapping
      std::string alternated = "(" + x + ")|(" + y + ")";


      // add the state and metadata back into mappings
      symbolMapping.insert(std::make_pair(
          alternated, std::make_pair(preForkCommonState, postForkCommonState)));


      stack.push_back(alternated);
    } else if (primitiveCandidate == "+") {
      // pop the last two
      std::string y = stack.at(stack.size() - 1);
      stack.pop_back();


      std::string x = stack.at(stack.size() - 1);
      stack.pop_back();


      // x and y represent a state, this state is already in our symbol mapping
      std::pair<int, int> statesForMatchX = symbolMapping.at(x);
      std::pair<int, int> statesForMatchY = symbolMapping.at(y);


      if (nfa->hasEpsilonTransitions(statesForMatchX.second)) {
        nfa->transferEpsilonTransitions(statesForMatchX.second,
                                        statesForMatchY.second);
      }
      // connect x and y
      nfa->addEpsilonTransition(statesForMatchX.second, statesForMatchY.first);


      std::string concatenated = "(" + x + ")+(" + y + ")";
      symbolMapping.insert(
          std::make_pair(concatenated, std::make_pair(statesForMatchX.first,
                                                      statesForMatchY.second)));


      stack.push_back(concatenated);
    } else if (primitiveCandidate == "*") {
      // Kleene is a unary and is to be treated differently
      std::string x = stack.at(stack.size() - 1);
      stack.pop_back();


      std::pair<int, int> statesForMatchX = symbolMapping.at(x);


      // construct a kleene then add it into the nfa
      int preForkCommonState = nfa->createNewState();
      int pseudoEnd = nfa->createNewState();


      // transfer everything in second to pseudoEnd
      if (nfa->hasEpsilonTransitions(statesForMatchX.second)) {
        nfa->transferEpsilonTransitions(statesForMatchX.second, pseudoEnd);
      }


      nfa->addEpsilonTransition(statesForMatchX.second, pseudoEnd);


      // loop back on the kleene
      nfa->addEpsilonTransition(statesForMatchX.second, statesForMatchX.first);
      nfa->addEpsilonTransition(preForkCommonState, pseudoEnd);
      nfa->addEpsilonTransition(preForkCommonState, statesForMatchX.first);


      std::string kleene = "(" + x + "*)";


      symbolMapping.insert(std::make_pair(
          kleene, std::make_pair(preForkCommonState, pseudoEnd)));


      stack.push_back(kleene);


    } else {
      stack.push_back(primitiveCandidate);
    }


    std::cout << "State of NFA" << std::endl;
    nfa->print();
  }


  // connect everything in stack as concat on main NFA
  if (stack.size() > 0) {
    std::cout << "Concatenating Remaining Sub-Automatas" << std::endl;
    while (!stack.empty()) {
      std::string x = stack.at(stack.size() - 1);
      stack.pop_back();


      std::pair<int, int> xState = symbolMapping.at(x);


      nfa->removeEpsilonTransition(lastStateStart, lastStateEnd);


      nfa->addEpsilonTransition(lastStateStart, xState.first);


      nfa->addEpsilonTransition(xState.second, lastStateEnd);


      lastStateEnd = xState.first;
    }
  }


  return nfa;
}


void test(const std::string &regex, std::vector<std::string> matches,
          std::vector<std::string> notMatches) {
  std::cout << "INFIX Regex: " << regex << std::endl;


  std::cout << "######## COMPILE TO NFA ########" << std::endl;


  std::vector<std::string> processed = preProcessRegex(regex);
  std::vector<std::string> postFixed = infixToPostFixTranslation(processed);


  std::unique_ptr<Nfa> nfa = buildNfa(postFixed);


  std::cout << "######## Completed Nfa: for " << regex << " #########"
            << std::endl;


  nfa->print();


  std::cout << "######## Evaluate against NFA #########" << std::endl;


  for (const std::string &trues : matches) {
    bool mustBeTrue = nfa->runSimulation(trues);
    if (!mustBeTrue) {
      std::cout << "Test case " << trues << " FAILED against Regex " << regex
                << std::endl;
      assert(false);
    }
  }


  for (const std::string &falses : notMatches) {
    bool mustBeFalse = nfa->runSimulation(falses);
    if (mustBeFalse) {
      std::cout << "Test case " << falses << " FAILED against Regex " << regex
                << std::endl;
      assert(false);
    }
  }


  std::cout << "-------- Test Passes ---------" << std::endl;
}


int main() {
  std::cout << "############ REGULAR EXPRESSION PARSER & COMPILER #############"
            << std::endl;


  // single element
  std::string singleA = "a";
  std::vector<std::string> singleFalses = {"b", "c", "ab"};
  std::vector<std::string> singleTrues = {"a"};
  test(singleA, singleTrues, singleFalses);


  // concat as a binary operator
  std::string concatenationBinary = "a+b";
  std::vector<std::string> concatenationBinaryFalses = {"",   "a",   "b",  "c",
                                                        "ac", "acb", "cba"};
  std::vector<std::string> concatenationBinaryTrues = {"ab"};


  test(concatenationBinary, concatenationBinaryTrues,
       concatenationBinaryFalses);


  // triple concats
  std::string concatenationRegex = "(a+b)+c";
  std::vector<std::string> concatenationFalses = {"a",  "b",   "c",  "ab",
                                                  "ac", "acb", "cba"};
  std::vector<std::string> concatenationTrues = {"abc"};


  test(concatenationRegex, concatenationTrues, concatenationFalses);


  // quad concat
  std::string quadConcatenationRegex = "a+b+c+d";
  std::vector<std::string> quadConcatenationFalses = {
      "", "a", "b", "c", "ab", "ac", "acb", "cba", "acbd", "abc"};
  std::vector<std::string> quadConcatenationTrues = {"abcd"};


  test(quadConcatenationRegex, quadConcatenationTrues, quadConcatenationFalses);


  // binary alternator
  std::string binaryAlternatorRegex = "a|b";
  std::vector<std::string> binaryAlternatorFalses = {"", "c", "ab", "ac"};
  std::vector<std::string> binaryAlternatorTrues = {"a", "b"};


  test(binaryAlternatorRegex, binaryAlternatorTrues, binaryAlternatorFalses);


  // quad alternator
  std::string quadAlternatorRegex = "a|b|c|d";
  std::vector<std::string> quadAlternatorFalses = {"", "f", "ab", "ac"};


  std::vector<std::string> quadAlternatorTrues = {"a", "b", "c", "d"};


  test(quadAlternatorRegex, quadAlternatorTrues, quadAlternatorFalses);


  // alternator and concatenation combos
  std::string altConCombo1Regex = "(a+b)|(c+d)";
  std::vector<std::string> altConCombo1Falses = {"",  "a",  "b",  "c",
                                                 "d", "ac", "ad", "dca"};
  std::vector<std::string> altConCombo1Trues = {}; // {"ab", "cd"};


  test(altConCombo1Regex, altConCombo1Trues, altConCombo1Falses);


  std::string altConCombo2Regex = "(a+b)|c";
  std::vector<std::string> altConCombo2Falses = {"a"
                                                 "ac",
                                                 " ", ""};
  std::vector<std::string> altConCombo2Trues{"ab", "c"};


  test(altConCombo2Regex, altConCombo2Trues, altConCombo2Falses);


  std::string altConCombo3Regex = "a+(b|c)+d";
  std::vector<std::string> altConCombo3Falses = {"",   "a",   "ab", "ac",
                                                 "ad", "abe", "ace"};
  std::vector<std::string> altConCombo3Trues{"abd", "acd"};


  test(altConCombo3Regex, altConCombo3Trues, altConCombo3Falses);


  // kleene
  std::string simpleKleene = "a*";
  std::vector<std::string> simpleKleeneFalses = {"b", "gdfsd"};
  std::vector<std::string> simpleKleeneTrues = {"", "a", "aa",
                                                "aaaaaaaaaaaaaaa"};


  test(simpleKleene, simpleKleeneTrues, simpleKleeneFalses);


  // concatenate kleene combo
  std::string concatKleeneCombo1Regex = "(c+d)*";
  std::vector<std::string> concatKleeneCombo1Falses = {"a",
                                                       "bc"
                                                       "de",
                                                       "d"};
  std::vector<std::string> concatKleeneCombo1Trues = {"", "cd", "cdcdcd"};


  test(concatKleeneCombo1Regex, concatKleeneCombo1Trues,
       concatKleeneCombo1Falses);


  std::string concatKleeneCombo2Regex = "c+(o)*";
  std::vector<std::string> concatKleeneCombo2Falses = {"cool", "a", "dog", ""};
  std::vector<std::string> concatKleeneCombo2Trues = {"c", "co", "coooooo"};


  test(concatKleeneCombo2Regex, concatKleeneCombo2Trues,
       concatKleeneCombo2Falses);


  // alternator kleene combo
  std::string altKleeneCombo1 = "(c|o)*";
  std::vector<std::string> altKleeneCombo1Falses = {"a", "aa"};
  std::vector<std::string> altKleeneCombo1Trues = {
      "", "cc", "oo", "coco", "coocoo", "cooo", "ooooc"};


  test(altKleeneCombo1, altKleeneCombo1Trues, altKleeneCombo1Falses);


  // alternator concat kleen combo
  std::string allCombo = "d+(a*)";
  std::vector<std::string> allComboFalses = {"", "do", "dao", "a"};
  std::vector<std::string> allComboTrues = {"d", "da", "daaa"};


  test(allCombo, allComboTrues, allComboFalses);


  std::string allCombo2 = "d+(a*)+(n|o)+i";
  std::vector<std::string> allCombo2Falses = {"",
                                              "dock",
                                              "day",
                                              "a"
                                              "daaa",
                                              "da",
                                              "d",
                                              "dano",
                                              "dno"};
  std::vector<std::string> allCombo2Trues = {"dni", "dani", "daaani", "daoi",
                                             "doi"};


  test(allCombo2, allCombo2Trues, allCombo2Falses);


  // all pattern mega regex
  std::string finalTestRegex = "k+h+a+l+i+d+.+h+a+m+d+(a*)+n+@+((g+m+a+i+l)|(m+"
                               "i+c+r+o+s+o+f+t))+.+c+o+m";
  std::vector<std::string> finalTestRegexFalses = {"khalid.hamdaan@yahoo.com",
                                                   "khalid.ham"};
  std::vector<std::string> finalTestRegexTrues = {
      "khalid.hamdaan@gmail.com", "khalid.hamdaan@microsoft.com",
      "khalid.hamdn@gmail.com", "khalid.hamdn@microsoft.com",
      "khalid.hamdaaaaaaaaaaaaaaaaaaaan@gmail.com"};


  test(finalTestRegex, finalTestRegexTrues, finalTestRegexFalses);


  return 0;
}


Comments

Commenter: Hamdaan Khalid

https://en.wikipedia.org/wiki/Thompson%27s_construction#:~:text=In%20computer%20science%2C%20Thompson's%20construction,nondeterministic%20finite%20automaton%20(NFA).

Add a comment: