Transforming human understood language into a representation for the computer is an intimidating challenge. Compiler that take such languages and perform these transformations are black magic written by Wizards not software engineers. Languages range from Programming Languages such as Python, Java, C etc... that are written by humans, and through a chain of steps brought down to instructions for your CPU to execute. Another set of "language" that has a new flavor every month is data definition languages -> YAML, TOML, JSON, etc...
Both the data and Programming languages have the following common traits:
- Clear Syntax Definitions
- Custom Data Representations that the human readable languages transform into.(Think variables, classes, functions, JSON Arrays, JSON objects etc...)
So in my quest to uncover the abstractions of modern day Programming I decided to explore writing a Parser that would take a popular Language and be able to transform it into Node Representations that respect the language syntax rules. I chose JSON since this is a data type that has come to dominate the industry (and also because I wanted to learn parsing but not spend a month, I want to continue on my HTTP lib after this detour).
Parsing of all *(that i know of) languages can be broken into 2 initial steps that are present in all cases: Tokenizing, and Parsing. The 2 phases are responsible for taking a program from a state of char iterables into Nodes that can be traversed for further processing. In Data Languages such as JSON, parsing can often be the end goal, where as in Programming languages, we will create a Parse Tree, and then pass the tree to another set of chained operations that will then eventually be responsible for the executable, or interpretations. Tokenizing and Parsing, remains fairly comparable in both language types.
Tokenizing is responsible for taking a char iterable and then converting it into tokens. Tokens are chunks of chars that represent the keywords, and other primitives of the language, and syntactical components. Tokenizers will do things like ignore extra whitespace and comments, and group together characters to form a single token based on the rules.
The tokenizer provides a way to lookahead for the next token, as well as a way to consume that token from the stream, and proceed to reading the next token it creates from the stream. This gives us the ability to "lookahead".
The parser takes the whitespace stripped tokens and uses them to enforce certain syntax rules, and create nodes corresponding to a "Grammar". The nodes are basically what you are used to seeing in the things like Json parsing libraries that let you read in strings into Json Objects.
Now before we get into the algorithm we need to understand the core building block behind writing a parsing algorithm for a language such as ours. The language in the context of parsing stems from it's grammar. We use context free grammar as a way to represent the rules of our language formally, these rules. Each rule is called a production rule and it is composed of a non-terminal symbol that is described by a sequence of terminal and non-terminal symbols. Terminals are the basic elements of the language, while non-terminals represent syntactic structures composed of terminals and other non-terminals.
The grammar we will be using to implement our parser describes JSON using production rules!
- Value -> Object | Array | String | Number | True | False | Null
- Object -> '{' Members '}'
- Members -> Pair | Pair ',' Members
- Pair -> String ':' Value
- Array -> '[' Elements ']'
- Elements -> Value | Value ',' Elements
- String -> '"' characters '"'
- Number -> '-'? digit+ ('.' digit+)? (('E'|'e') ('+'|'-')? digit+)?
- True -> 'true'
- False -> 'false'
- Null -> 'null'
- characters -> any valid characters except '"'
- digit -> '0' | '1' | ... | '9'
Based on the above grammar we are using the left values or the non-terminals to describe their structure using a combination of other terminals, and non-terminals such as '{' and '}' and ','. We can read our grammar left to right and we can figure out which "option" from the possible disjoint union of options to pick from by looking one token ahead!
In terms of our tokenizer we need something that can sort of strip away everything else from the input stream and give us a sequence of non-terminals that we can also do lookaheads on.
Recursive Descent Parsing is a top down parsing technique (starts from a start symbol, in our case "Value"), which then uses the token scanner's abiltiy to looahead for the next token and make a decision for what type of production rule to use. Since the Grammar can be Recursive in nature we rely on recursion to expand certain nodes as we parse through the streams of token left to right. The recursion is what allows the parser to handle nested structures such as an object that holds an array of objects...
In my code you will see the idea of peeking to make a decision or syntax checking and consuming a token by calling Get on the tokenizer to making progress in our token stream. Overall, I think the trick is in understandin your grammar and defining the right data structures. Once I had that, the code was fairly a breeze.
We can start with the tokenizer that sort of works on accumulating characters from a string stream based on an understanding of what type of token we are building. At any point we look for violations for the rule and cases where we can safely terminate a token. The accumulating is only performed when someone wants us to read in a a token.
Peeking can be implementing by sort of caching the last read token, and invalidating the "cache" when Get is called aka moving the cursor.
Since tokens work in a stream-like manner and we don't need to get all tokens before parsing. We can use the tokenizer directly in the parser to do things such as lookaheads and make appropriate decisions. As we make decisions we keep progressing the tokenizer till we eventually reach the end of the token stream. Each non-terminal can be given it's own method in the parser and it uses the grammar to check for rules and expectations. We can connect each non-terminal node to the other non-terminal node based on the relationship we want it to support, for example key value pairs and properties fit very well as hashmaps where as the idea of a node owning many nodes can be done using something like an array or vector. This becomes more obvious in the non-terminal parsing methods where we build these relationships.
#include <iostream> #include <unordered_map> #include <memory> #include <ostream> #include <sstream> #include <optional> #include <vector> // -------------------- Tokenizer and corresponding data types -------- enum TokenType { String, Number, LeftParenthesis, RightParenthesis, LeftBracket, RightBracket, Colon, Comma, Null, True, False, }; struct Token { TokenType type; std::string lexeme; }; // Lots of copying here, need to move this to using smart pointers class JsonTokenStream { public: JsonTokenStream(std::stringstream stream): m_stream(std::move(stream)), m_peeked(std::nullopt){} bool HasTokens() { return m_stream.good() && m_stream.peek() != EOF || m_peeked != std::nullopt; } std::pair<Token, bool> Peek() { // if nothing staged already, read in and stage it for later if (m_peeked == std::nullopt) { m_peeked = _readInToken(); } return m_peeked.value(); } // result, valid or invalid std::pair<Token, bool> Get() { // if something was staged from previoud call to peek use that and reset staging area for peek // if something was already staged we dont need to read anything string stream if (m_peeked != std::nullopt) { std::pair<Token, bool> tkn = m_peeked.value(); m_peeked = std::nullopt; return tkn; } return _readInToken(); } private: std::stringstream m_stream; std::optional<std::pair<Token, bool> > m_peeked; std::pair<Token, bool> _readInKnownString(TokenType type, char currChar, const std::string& knownStr) { Token res; res.type = type; // we know first char already matches for (int i = 1; i < knownStr.size(); i++) { if (m_stream.eof() || m_stream.get() != knownStr[i]) { return { res , false }; } } res.lexeme = knownStr; return { res, true }; } std::pair<Token, bool> _readInToken() { char currChar = ' '; // read in all useless whitespace and line skip, and other ignored white space-ish chars while (!m_stream.eof() && ((currChar = m_stream.get()) == ' ' || currChar == '\n' || currChar == '\t')); Token res; if (currChar == ' ') { return {res, false}; } // at this point we know we are at a valid non EOF char to be read and have a token created from if (currChar == '{') { res.type = LeftParenthesis; res.lexeme = '{'; return { res, true }; } if (currChar == '}') { res.type = RightParenthesis; res.lexeme = '}'; return { res, true }; } if (currChar == '[') { res.type = LeftBracket; res.lexeme = '['; return { res, true }; } if (currChar == ']') { res.type = RightBracket; res.lexeme = ']'; return { res, true }; } if (currChar == ',') { res.type = Comma; res.lexeme = ','; return { res, true }; } if (currChar == ':') { res.type = Colon; res.lexeme = ':'; return { res, true }; } if (currChar == '\"') { // start of a string std::string jsonStr; res.type = String; while (true) { if (m_stream.eof()) { return { res, false }; } char next = m_stream.get(); if (next == '\"') { // if the last element in our sstream is not an escape character and // we have reached second quote this says we are at end of string if (jsonStr.back() != '\\') { res.lexeme = jsonStr; return { res, true }; } jsonStr += next; } jsonStr += next; } } // check if keyword true if (currChar == 't') { return _readInKnownString(True, currChar, "true"); } if (currChar == 'f') { return _readInKnownString(False, currChar, "false"); } // check if null symbol if (currChar == 'n') { return _readInKnownString(Null, currChar, "null"); } // check if start of a number if (currChar == '-' || (currChar >= '0' && currChar <= '9')) { std::pair<Token, bool> numResult = tokenizeNumber(currChar); return numResult; } // invalid token return {res, false}; } std::pair<Token, bool> tokenizeNumber(char currChar) { Token res; res.type = Number; bool hadMyE = false; bool hadMyDecimal = false; bool hadMyNegaitveBeforeE = currChar == '-'; std::string numero(1, currChar); while (HasTokens()) { char newChar = m_stream.peek(); // based on last number we know if this is valid or not // we know that if the last character was a - or . the number after it must be digit if ((numero.back() == '-' || numero.back() == '.') && (newChar < '0' || newChar > '9')) { return { res, false }; } if ( newChar == '-' && (numero.back() != 'e' && numero.back() != 'E') ) { return { res, false }; } // can add e, E, or number, or . if (newChar =='-' || newChar == 'e' || newChar == 'E' || (newChar >= '0' && newChar <= '9') || newChar == '.') { if (newChar == 'e' || newChar == 'E') { if (hadMyE) return { res, false }; else hadMyE = true; hadMyDecimal = false; // allow decimal again after E } if (newChar == '.') { if (hadMyDecimal) return { res, false }; else hadMyDecimal = true; } m_stream.get(); // consume numero += newChar; continue; } // whatever we are adding is an invalid addition to ongoing number string, so it's end of this number string by now // if it is invalid then new token will not be formed after this :) res.lexeme = numero; return { res, true }; } res.lexeme = numero; return { res, true }; } }; // ------------ End of tokenizer ----------- // ------------ Parser ------------------- enum JsonNodeType { // ErrorNodeType is used to represent error found while parsing ErrorNodeType, ObjectNodeType, ArrayNodeType, StringNodeType, NumberNodeType, BooleanNodeType, NullNodeType, }; // forward decl struct JsonNode; struct ObjectNode { std::unordered_map<std::string, std::unique_ptr<JsonNode>> properties; }; struct ArrayNode { std::vector<std::unique_ptr<JsonNode>> vals; }; using FlatVal = std::string; // Would have used union or std::variant this struct keeps messing up the default constructor issue // Lots of wasted space in this struct that can be avoided if we use Union type struct JsonNodeValue { FlatVal val; ObjectNode objNode; ArrayNode arrNode; }; struct JsonNode { JsonNodeType type; JsonNodeValue data; // recursive tree repr print to screen void PrintTreeRepr(int indent = 0) { std::string output = ""; switch (type) { case ErrorNodeType: case StringNodeType: case NumberNodeType: case BooleanNodeType: case NullNodeType: for (int i = 0; i < indent; i++) output += ' '; output += data.val; std::cout << output << std::endl; break; case ArrayNodeType: for (int i = 0; i < indent; i++) output += ' '; std::cout << output << '[' << std::endl; for (int i = 0; i < data.arrNode.vals.size(); i++) { data.arrNode.vals[i]->PrintTreeRepr(indent+1); if (i != data.arrNode.vals.size()-1) std::cout << output << ","; } std::cout << output << ']' << std::endl; break; case ObjectNodeType: for (int i = 0; i < indent; i++) output += ' '; std::cout << output << "{ \n"; for (std::pair<const std::string&, const std::unique_ptr<JsonNode>& >pairs : data.objNode.properties) { std::cout << output << " " << pairs.first << ":\n"; pairs.second->PrintTreeRepr(indent+1); } std::cout << output << "} \n"; break; } } }; class Parser { public: Parser(std::unique_ptr<JsonTokenStream> scanner): m_scanner(std::move(scanner)) {} std::unique_ptr<JsonNode> MakeJsonNode() { std::pair<Token, bool> token = m_scanner->Peek(); if (!token.second) { std::unique_ptr<JsonNode> node = std::make_unique<JsonNode>(); node->type = ErrorNodeType; node->data.val = "End of tokens from parser before Node formed"; return node; } switch (token.first.type) { case LeftParenthesis: return makeObject(); case LeftBracket: return makeArray(); case String: return makeString(); case Number: return makeNumber(); case True: case False: return makeBoolean(); case Null: return makeNull(); default: std::unique_ptr<JsonNode> node = std::make_unique<JsonNode>(); node->type = ErrorNodeType; node->data.val = "Unexpected token"; return node; }; } private: std::unique_ptr<JsonTokenStream> m_scanner; std::unique_ptr<JsonNode> makeObject() { // consume "{" that we know definitely existed m_scanner->Get(); std::unique_ptr<JsonNode> node = std::make_unique<JsonNode>(); // Pair | Pair ',' Members node->type = ObjectNodeType; // store Pair | Pair, Members in the node std::pair<Token, bool> lookahead = m_scanner->Peek(); while (lookahead.second && lookahead.first.type != RightParenthesis) { // Should be a parse-able pair, aka STRING ":" JsonNode std::pair<Token, bool> shouldBeStr = m_scanner->Peek(); if (!shouldBeStr.second || shouldBeStr.first.type != String) { node->type = ErrorNodeType; node->data.val = "Failed to build object, keys can only be strings"; return node; } // consume Key string m_scanner->Get(); // Should have a comma that we parse next std::pair<Token, bool> shouldBeColon = m_scanner->Peek(); if (!shouldBeColon.second || shouldBeColon.first.type != Colon) { node->type = ErrorNodeType; node->data.val = "A string key in a Json Pair in an object should be followed by a ':'"; return node; } // consume COLON m_scanner->Get(); std::unique_ptr<JsonNode> pairValueNode = MakeJsonNode(); if (pairValueNode->type == ErrorNodeType) { return pairValueNode; } node->data.objNode.properties[shouldBeStr.first.lexeme] = std::move(pairValueNode); // potentially have a comma so check and consume lookahead = m_scanner->Peek(); if (!lookahead.second) { node->type = ErrorNodeType; node->data.val = "Unexpected end of token stream"; return node; } if (lookahead.first.type == Comma) { // consume the comma! // in the next iteration when this loop will call MakeJsonNode if this is a trailing comma // then the MakeJsonNode will throw an error and hence handle the case of JSON syntax not // allowing trailing commas m_scanner->Get(); } } if (!lookahead.second || lookahead.first.type != RightParenthesis) { node->type = ErrorNodeType; node->data.val = "Unexpected end of object while parsing the tokens, valid tokens finished before object ended"; return node; } // consume closing parenthesis m_scanner->Get(); return node; } // '[' JsonNode ',' .. ']' std::unique_ptr<JsonNode> makeArray() { std::unique_ptr<JsonNode> node = std::make_unique<JsonNode>(); node->type = ArrayNodeType; // consume the '[' m_scanner->Get(); std::pair<Token, bool> lookahead = m_scanner->Peek(); while (lookahead.second && lookahead.first.type != RightBracket) { std::unique_ptr<JsonNode> jsonNode = MakeJsonNode(); if (jsonNode->type == ErrorNodeType) { return jsonNode; } node->data.arrNode.vals.push_back(std::move(jsonNode)); lookahead = m_scanner->Peek(); if (!lookahead.second) { node->type = ErrorNodeType; node->data.val = "Unexpected end of token stream while building array"; return node; } if (lookahead.first.type == Comma) { // consume the comma! // in the next iteration when this loop will call MakeJsonNode if this is a trailing comma // then the MakeJsonNode will throw an error and hence handle the case of JSON syntax not // allowing trailing commas m_scanner->Get(); } } if (!lookahead.second || lookahead.first.type != RightBracket) { node->type = ErrorNodeType; node->data.val = "Array ended without valid Right Bracket Token"; return node; } return node; } std::unique_ptr<JsonNode> makeString() { std::pair<Token, bool> shouldBeString = m_scanner->Get(); std::unique_ptr<JsonNode> node = std::make_unique<JsonNode>(); node->type = StringNodeType; if (!shouldBeString.second || shouldBeString.first.type != String) { node->type = ErrorNodeType; node->data.val = "Expected String Node"; return node; } node->data.val = shouldBeString.first.lexeme; return node; } std::unique_ptr<JsonNode> makeBoolean() { std::pair<Token, bool> shoudlBeBool = m_scanner->Get(); std::unique_ptr<JsonNode> node = std::make_unique<JsonNode>(); node->type = BooleanNodeType; if (!shoudlBeBool.second ||( shoudlBeBool.first.type != True && shoudlBeBool.first.type != False)) { node->type = ErrorNodeType; node->data.val = "Expected Boolean Node"; return node; } node->data.val = shoudlBeBool.first.type == True ? "true" : "false"; return node; } std::unique_ptr<JsonNode> makeNumber() { std::pair<Token, bool> shouldBeNumber = m_scanner->Get(); std::unique_ptr<JsonNode> node = std::make_unique<JsonNode>(); node->type = NumberNodeType; if (!shouldBeNumber.second || shouldBeNumber.first.type != Number) { node->type = ErrorNodeType; node->data.val = "Expected Number"; return node; } node->data.val = shouldBeNumber.first.lexeme; return node; } std::unique_ptr<JsonNode> makeNull() { std::pair<Token, bool> shoudlBeNull = m_scanner->Get(); std::unique_ptr<JsonNode> node = std::make_unique<JsonNode>(); node->type = NullNodeType; node->data.val = "null"; if (!shoudlBeNull.second || shoudlBeNull.first.type != Null) { node->type = ErrorNodeType; node->data.val = "Expected NULL"; return node; } return node; } }; // ------------ End of parser ------------