Mathematical Expression Parser

I haven’t been blogging for long now, and it’s been quite a while since my last blog. I’ve been working on the same problem for a while. I decided to move away from the Project Euler exercises in favour of doing an exercise that was put to me in an interview a few years ago. It was quite an interesting problem, and I was only asked to work through one aspect of the problem in the interview. It occurred to me that it would make a good kata and subsequent blog, so here goes…

Define an object graph to represent a mathematical expression.

I’ve elaborated on this original problem to also

create a parser that can take a mathematical expression inputted by a user as a string, and evaluate it.

For example, given the string “8 + 3 * 4”, evaluate it to 20. Sounds simple!

We need to take operator precedence into account. So in the above example, the multiplication between 3 and 4 happens first even though it’s later in the expression. 3 * 4 evaluates to 12, and that 12 is then added to 8, i.e. in the absence of parenthesis, multiplications are done before additions.

Also, the parser needs to be able to deal with parenthesis. So “(8 + 3) * 4” evaluates to (8 + 3 = 11) * 4 = 44.

I think it will be easier to explain how we want the final parsed expression to look. Then we can look at how the string is parsed to create this structure.

How the expression will ultimately be represented

I decided to represent the expression in a tree data structure.

If we take the simple case of “3 * 4”, that would be represented graphically as follows:

Taking the slightly more complicated example above of “8 + 3 * 4”, this would be represented in tree form as:

Hopefully these 2 examples above clarify how the tree representation works – you start at the root node, which will be an operator node (+ in this case), and evaluate it by first evaluating the left (8) and right (*) child nodes, and then applying the operator in that root node to the result of each of it’s child nodes. If one of those child nodes in turn is an operator (as in the * node in this case), it’s result must first be evaluated before the root node above it can be evaluated. The * node in this case evaluates to 12 (3 * 4), and then the root node addition of 8 and 12 yields 20. Every node except the leaf nodes are operator nodes, and every leaf node is a number. Which child nodes are left and which are right is important where the parent operator node is minus or divide by.

So let’s have a look at the most general interface first. I’ll hide parts of the following classes until ready to discuss those parts.

public interface INode
{
	decimal Evaluate();
	...
}

So each node has the ability to evaluate its own subtree. This goes for operator and number nodes. We have a number of different types of node. For operators, there is a base implementation for all operator nodes:

public abstract class OperatorNode : INode
{
	...
	public abstract decimal Evaluate();

	public abstract char Symbol { get; }

	public INode LeftChild
	{
		get { return left; }
		set
		{
			...
			left = value;
		}
	}

	public INode RightChild
	{
		get { return right; }
		set
		{
			...
			right = value;
		}
	}
	...
}

So we’re not dealing with unary operators here – I’ve limited the scope of this problem to just deal with operators that have 2 operands. The AdditionNode illustrates how each operator is implemented:

public class AdditionNode : OperatorNode
{
...
	public override decimal Evaluate()
	{
		return LeftChild.Evaluate() + RightChild.Evaluate();
	}

	public override char Symbol
	{
		get
		{
			return '+';
		}
	}
}

And the leaf node, representing a simple decimal value, is represented as follows:

public class LeafNode : INode
{
	private decimal value;
...
	public decimal Evaluate()
	{
		return value;
	}
...
}

Parsing the mathematical expression

The top level parser looks as follows:

public class MathsExpressionParser
{
	private readonly INodeParser nodeParser;
	private INodeInserter nodeInserter;

	public MathsExpressionParser(INodeParser nodeParser, INodeInserter nodeInserter)
	{
		this.nodeParser = nodeParser;
		this.nodeInserter = nodeInserter;
	}

	public INode Parse(string expression)
	{
		expression = expression.Replace(" ", string.Empty);
			
		Trace.WriteLine(string.Format("Evaluating {0}", expression));
		int expressionIndex = 0;
		OperatorNode operatorNode = null;
		int openParenthesisCount = 0;

		// Find the leftmost node first...
		var result = nodeParser.FindNextNode(expression, expressionIndex, openParenthesisCount);
		var left = result.Node;
		openParenthesisCount = left.OpenParenthesisCount;

		expressionIndex = result.NextIndex;
		var firstPass = true;
		INode right = null;

		// Now iterate, finding operator and right nodes until done...
		while (expressionIndex < expression.Length)
		{
			result = nodeParser.FindNextNode(expression, expressionIndex, openParenthesisCount);
			if (result == null) // could be null if last characters were closing brackets.
			{
				break;
			}

			operatorNode = (OperatorNode)result.Node;
			openParenthesisCount = operatorNode.OpenParenthesisCount;
			expressionIndex = result.NextIndex;

			if (firstPass)
			{
				operatorNode.LeftChild = left;
				left.Parent = operatorNode;
				firstPass = false;
			}
			else
			{
				nodeInserter.Insert(operatorNode, right); // i.e. insert operator node in the existing tree. Use the right node from the previous iteration as the starting point to try to find where to insert it.
			}

			result = nodeParser.FindNextNode(expression, expressionIndex, openParenthesisCount);
			right = result.Node;
			openParenthesisCount = right.OpenParenthesisCount;
			expressionIndex = result.NextIndex;

			Trace.WriteLine(string.Format("Setting {0} as right child of {1}", right, operatorNode));
			operatorNode.RightChild = right;
			right.Parent = operatorNode;
		}

		var topNode = GetTopNode(operatorNode);
		Trace.WriteLine(string.Format("Top node is {0}", topNode));
		return topNode;
	}

	private OperatorNode GetTopNode(OperatorNode operatorNode)
	{
		while (operatorNode.Parent != null)
		{
			Trace.WriteLine(string.Format("Parent of {0}({1}{2}) is {3}({4}{5})", operatorNode, operatorNode.LeftChild, operatorNode.RightChild, operatorNode.Parent, operatorNode.Parent.LeftChild, operatorNode.Parent.RightChild));
			operatorNode = operatorNode.Parent;
		}

		return operatorNode;
	}
}

So we first get rid of the spaces in the expression, just to simplify it. Then get the first number node. Then iterate until we’ve reached the end of the expression, each time getting the next operator node and then the next number node. On the first pass, we know that the first parsed decimal number should be the left child of the first operator found. That’s the trivial case! From then on, as each operator is found, we need to decide where it fits in the existing tree parsed so far. This is where operator precedence and parenthesis comes in. More on this in the explanation of the NodeInserter class below. After each operator node has been placed at the correct point in the part of the tree parsed so far, the following decimal node parsed is added as the right child of this most recently inserted operator node.

Once the whole string has been parsed, we navigate up the tree to the top root node, and that’s the node that gets passed back from the Parse method.

Parsing the next node

The DecimalNodeParser’s responsibility is to find the next node past the current expression index, advance the current expression index to beyond the characters represented in the node just found, and to note how many parenthesis have been opened and not closed yet. This openParenthesisCount is then held against each operator node, and helps in determining an operator’s position in the tree. Extension methods are used for readability.

public class DecimalNodeParser : INodeParser
{
	private readonly IOperatorNodeFactory operatorNodeFactory;

	public DecimalNodeParser(IOperatorNodeFactory operatorNodeFactory)
	{
		this.operatorNodeFactory = operatorNodeFactory;
	}

	public FindNextNodeResult FindNextNode(string expression, int startIndex, int openParenthesisCount)
	{
		if(startIndex >= expression.Length)
		{
			return null;
		}

		var index = startIndex;
		var nextCharacter = expression[index];
		while(nextCharacter.IsASpaceOrBracket())
		{
			if(nextCharacter.IsAnOpenBracket())
			{
				openParenthesisCount++;
			}
			else if (nextCharacter.IsACloseBracket())
			{
				openParenthesisCount--;
			}

			index++;
			if (index < expression.Length)
			{
				nextCharacter = expression[index];
			}
			else
			{
				return null;
			}
		}

		if(nextCharacter.IsAnOperator())
		{
			var node = operatorNodeFactory.CreateOperator(nextCharacter, openParenthesisCount);
			return new FindNextNodeResult(node, index + 1);
		}

		var decimalNumberMatch = new Regex("(\\d+(\\.\\d+)?)");
		var match = decimalNumberMatch.Match(expression.Substring(index));
		if(match.Success)
		{
			var decimalNumber = Convert.ToDecimal(match.Value);
			return new FindNextNodeResult(new LeafNode(decimalNumber, openParenthesisCount), index + match.Value.Length);
		}
		return new FindNextNodeResult(null, 0);
	}
}

The OperatorNodeFactory takes the symbol just parsed from the expression to decide what kind of OperatorNode to create. This is done by holding an internal mapping of symbols to an appropriate lambda expression creating the right kind of node. The openParenthesisCount is just passed straight through the lambda expression to whichever operator constructor is invoked.

public class OperatorNodeFactory : IOperatorNodeFactory
{
	private Dictionary<char, Func<int, OperatorNode>> operators = new Dictionary<char, Func<int, OperatorNode>>
	{
		{'+', openParenthesisCount => new AdditionNode(openParenthesisCount)},
		{'-', openParenthesisCount => new SubtractionNode(openParenthesisCount)},
		{'*', openParenthesisCount => new MultiplicationNode(openParenthesisCount)},
		{'/', openParenthesisCount => new DivisionNode(openParenthesisCount)},
		{'%', openParenthesisCount => new ModulusNode(openParenthesisCount)}
	};

	public OperatorNode CreateOperator(char operatorCharacter, int parenthesisCount)
	{
		return operators[operatorCharacter](parenthesisCount);
	}
}

Having seen how each node is parsed, we’ll now look at how each node is then placed in the tree.

Inserting the next parsed operator node at the correct position in the tree

The NodeInserter’s responsibility is to figure out where a newly-parsed operator node should be placed in the partial tree parsed so far. This needs to take into account operator precedence, as well as considering parenthesis. The NodeInserter is given the node to insert, and a handle on the partial tree parsed so far, in the form of the rightmost node found so far. The operator node to insert is first compared against the operator node immediately above this rightmost node, i.e. the rightmost operator already in the tree. The new operator node either gets:

  • inserted below the existing operator node in the tree – in this case the existing operator’s right node is replaced with the new operator node, and the old right node becomes the new operator’s left node
  • compared against the node above the existing operator node’s parent operator node
  • made the new root node
public class NodeInserter : INodeInserter
{
	public void Insert(OperatorNode operatorNodeToInsert, INode rightNode)
	{
		var operatorNodeAlreadyInTree = rightNode.Parent;
		var comparer = new OperatorNodeComparer();
		bool newNodeInsertedInTree = false;
		while (!newNodeInsertedInTree)
		{
			if (comparer.Compare(operatorNodeToInsert, operatorNodeAlreadyInTree) <= 0)
			{
				var existingRightChild = operatorNodeAlreadyInTree.RightChild;
				Trace.WriteLine(string.Format("Existing right child: {0}", existingRightChild));
				Trace.WriteLine(string.Format("Operator Node Already in tree: {0}", operatorNodeAlreadyInTree));
				Trace.WriteLine(string.Format("Operator Node to insert: {0}", operatorNodeToInsert));

				Trace.WriteLine(string.Format("Setting {0} as right child of {1}", operatorNodeToInsert, operatorNodeAlreadyInTree));
				operatorNodeAlreadyInTree.RightChild = operatorNodeToInsert;
				operatorNodeToInsert.Parent = operatorNodeAlreadyInTree;

				Trace.WriteLine(string.Format("Setting {0} as left child of {1}", existingRightChild, operatorNodeToInsert));
				operatorNodeToInsert.LeftChild = existingRightChild;
				existingRightChild.Parent = operatorNodeToInsert;
				newNodeInsertedInTree = true;
			}
			else
			{
				if (operatorNodeAlreadyInTree.Parent != null)
				{
					Trace.WriteLine(string.Format("{0} should be higher up tree than {1}. Moving up tree to compare {0} with next node.", operatorNodeToInsert, operatorNodeAlreadyInTree));
					operatorNodeAlreadyInTree = operatorNodeAlreadyInTree.Parent;
				}
				else
				{
					Trace.WriteLine(string.Format("At top of the tree: making {0} the parent of {1}.", operatorNodeToInsert, operatorNodeAlreadyInTree));
					operatorNodeToInsert.LeftChild = operatorNodeAlreadyInTree;
					operatorNodeAlreadyInTree.Parent = operatorNodeToInsert;
					newNodeInsertedInTree = true;
				}
			}
		}
	}
}

The operator nodes are compared on operator precedence and relative number of open parenthesis. This is done in the OperatorNodeComparer. Parenthesis is considered first, then if they’re equal on both operators, operator precedence is considered.

public class OperatorNodeComparer : IComparer<OperatorNode>
{
	private Dictionary<char, int> operators = new Dictionary<char, int> { { '*', 1 }, { '/', 1 }, { '+', 2 }, { '-', 2 } };
		
	public int Compare(OperatorNode x, OperatorNode y)
	{
		var leftOpenParenthesisCount = x.OpenParenthesisCount;
		var rightOpenParenthesisCount = y.OpenParenthesisCount;
		if (leftOpenParenthesisCount != rightOpenParenthesisCount)
		{
			return rightOpenParenthesisCount.CompareTo(leftOpenParenthesisCount);
		}

		var left = x.Symbol;
		var right = y.Symbol;
		var leftInt = operators[left];
		var rightInt = operators[right];
		return leftInt.CompareTo(rightInt);
	}
}

I’ve done my best to explain it here, but to best understand how node’s are inserted in the tree, or just to see the full code, have a look at the code at the Miscellaneous project on my github account (yes – I was quite tired when I came up with the name of that project). Running the tests in MathsExpressionParserTests outputs trace statements which help to understand how nodes are inserted – these statements proved very useful in debugging!

So I hope you found this interesting! A few caveats worth mentioning – like all my posts so far, this isn’t intended to be production code – there’s no exception handling, and expressions are not being validated. I was more interested in performing the main functionality of the parser than catching all those edge conditions.

I’ll try not to leave it so long until my next blog post! I think I’ll go back to smaller tasks for the next few posts at least. As usual, comments / thoughts / criticisms are always welcome!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s