Handwritten Equation Solver

Anil Parshad
12 min readJul 31, 2021

Project Definition

Project Overview

Optical Character Recognition (OCR) is a growing field in today’s market. The need for OCR are in multiple industries including healthcare, insurance, and banking. Each industry presents a unique challenge such as digitizing and processing checks, claims, or prescriptions (Anil Matcha). With so much variance in handwriting, it can be difficult to properly train and collect enough data to build an OCR that performs.

Being able to identify and interpret handwriting into a digital format presents a number of challenges, these include:

  • variability in how characters are drawn from person to person
  • difference or mix of writing style (i.e. print vs cursive)
  • difference in quality of the source/document
  • an adequate amount of labeled data to train

This here is my attempt at creating an OCR whose eventual goal is to evaluate a handwritten equation and print the result. Given the challenges listed above, how do we accurately predict these handwritten characters?

Problem Statement

With increasing demand of OCR and unique challenges discussed above, this blog will go through an attempt to train a set of handwritten images digitized into a jpg format containing single numerical digits and the plus and minus operations.

The data is processed and put into an array where it will be fed to a Convolutional Neural Network (CNN). The best model is chosen based on the best(minimum) loss value of our validation images out of 100 epochs. The CNN is constructed with two hidden layers, two dense layers, and max pool size of 2. With how well CNN’s work with images, the validation data reached an accuracy of 0.9841 with a loss value of 0.0636.

The model is testing with 20 handwritten jpg images containing mathematical expressions adding and subtracting two or three numbers containing 1 to 3 digits each. The outcome was less than stellar with a 25 percent accuracy. A significant number of test cases did not receive the correct number of characters as input. With this error, the model cannot accurately predict the characters, much less evaluate the expression to give a correct answer. Although accuracy does also seem to be an issue, it is not clear how much until the input data for our test images can be corrected.

Metrics

Although our model is evaluating accuracy as a metric, validation_loss (val_loss) is taken into account when choosing the best weights. Using loss to choose the best weights instead of accuracy helps us get the best model where the cost of predicting the model is as low as possible while still maintaining a high accuracy. The reason we use accuracy as a metric here instead of something like an f1 score is because the importance of each class here is the same. We care more about maximizing our accuracy and minimizing our loss.

For further reading on loss and accuracy vs f1-score, these two blog posts were enlightening.

Analysis

Data Exploration and Visualization

First, I’ll talk about the dataset. The dataset I used comes from Kaggle. This dataset contains thousands of images for multiple characters including digits, operators, operations, the Greek and English alphabet, and other mathematical symbols. For this particular project, due to time constraints (the whole data set can take up to 3 hours to train) I only focused on the numerical digits as well as plus and minus operations. Below is a small example of how the data looks; imperfect handwriting with some variation. Each image is in a jpg format with a 45 x 45 dimension. All images are on a white background and written in black. Please note that each zero shown below is in it’s own jpg file. There is only one character per image file.

Handwritten zeros in three different image files

There are a total of 153,366 images in our dataset. Half of these will be used for training while the other half is used for validation. The number of images for each class is not equal. Some classes have many more data points than others. Class 7 only has 2,909 images, while the minus sign has the most at 33,997. There are 4 classes with image numbers above 25,000, while the next lowest starts at 10,909. There is a rather big disparity here, and some classes will be trained more, arguably better, than the others.

Methodology — Tackling the Problem

In order to start identifying our characters we have to first undergo a number of steps. We need to prepare our data, train our data, then test our data.

Data Preprocessing — Preparing the Data

The data for each digit and operation is kept in it’s own folder, half are taken and put into a validation folder with the same folder structure. Each image is converted to gray scale and then the pixels are inverted so that the background is black and the character outline is white. This makes finding contours easier for our find contours method. Contours are found using cv2, then sorted from left to right. The inverted image is then cropped depending on the location and dimension of the bounding rectangle and then resized to a 28 x 28 image. This is done for each bounding box created by cv2.boundingRect for the character image and placed into an array. A labels array is also created that corresponds to the class output we are expecting. For example 0 maps to class 0, 1 to class 1, and so forth. Plus (+) and minus (-) map to class 10 and 11 respectively.

img = cv2.imread(os.path.join(folder_path, filename), cv2.IMREAD_GRAYSCALE)

# invert image pixels
invert = cv2.bitwise_not(img)
# make image binary (black or white) based on threshold value
ret, thresh = cv2.threshold(invert, 127, 255, cv2.THRESH_BINARY)
# Sort contours
contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
# sort contours form left to right
sorted_bounding_boxes = sort_contours(contours)

# resize contour regions to 28 x 28
maxi = 0
for box in sorted_bounding_boxes:
# get top left corner point as well as width and height of box
x, y, width, height = cv2.boundingRect(box)
maxi = max(width * height, maxi)
if maxi == width * height:
x_max = x
y_max = y
w_max = width
h_max = height

# crop the bounding box(the contour) on the original inverted image
crop_img = invert[y_max:y_max + h_max + 10, x_max:x_max + w_max + 10]
# resize the cropped image and scale to a 28 x 28 image
resize_img = cv2.resize(crop_img, (28, 28))
train_data.append(resize_img)
label_array.append(label_dict[folder])

Implementation

Metric Implementation

The ModelCheckPoint class is used from the keras.callbacks package to keep track of the best outcome and save the associated weights for future use. The time it takes to train just the digits and two mathematical operations took 3247 seconds (about 54 minutes) to complete.

# compile the model
model.compile(loss='categorical_crossentropy', optimizer='rmsprop',
metrics=['accuracy'])

# train the model and save the best weights
checkpointer = ModelCheckpoint(filepath='best.hdf5', verbose=1,
save_best_only=True)
model.fit(x_train, y_train, batch_size=32, epochs=100,
validation_data=(x_validation, y_validation), verbose=2,
callbacks=[checkpointer], shuffle=True)

Training our Data

From this point we are ready to train our model using a Convolutional Neural Network (CNN) through OpenCV. CNN’s have become quite popular in the deep learning/machine learning space, the have a number of advantages:

The input arrays must be in the correct shape in order to be valid input for the CNN. Lets focus on the shape of our training data. The training data array as a whole has 4 dimensions. The 4 dimensions contain the number of training samples, the width and height of the image, and the depth of the image. In this particular case our shape for the training data is (99600, 28, 28, 1). Note that for an RGB image our depth would be 3, but since this is a binary (black/white) image, our depth is 1.

x_train = np.array(train_data).reshape(len(train_data), len(train_data[0]), len(train_data[0][0]), 1)
y_train = keras.utils.to_categorical(train_labels, num_classes)

Each training sample comes into the CNN as a 28 x 28 x 1 matrix. The height and width (the spatial dimensions) gradually decrease for each layer, but the depth is increased. We are trying to learn the shape of the contour and using that to determine what character the image might be.

The above is a summary of the model defined in code.

model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, padding='same', activation='relu',
input_shape=(28, 28, 1)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(12, activation='softmax'))

Here I’ll lay out a small explanation of each parameter. Our filter parameters are increasing the depth of the image and typically set as powers of 2.

Our pool_size will reduce our spatial dimensions by a factor of 2. The size of each filter is set by our kernel_size is set to 2 (meaning 2 x 2 matrix), the default stride set by Conv2D is (1,1).

It is generally believed that for best results we set the padding to ‘same’. Here, padding refers to what we do with values in our input layer where our kernel has fallen outside our input layer. Setting padding to same, pads our input layer with 0’s as opposed to losing the information from those nodes.

We add dropout layers to minimize overfitting. We typically start with smaller values and increase as needed. You can read more about dropout here: https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/.

We flatten our last max pooling layer to a single vector and then use the Dense function. This dense layer takes in an input and returns and output specified by the argument. This reduces the dimensions of our output, and the number of parameters until we reach the end where our output would be the number of classes (in our case 12).

For activations, the last output layer is typically set to ‘softmax’, which returns probabilities for our classes. Hidden layers contain ‘relu’ as a value for activation. ReLU stands for rectified linear unit. This activation type is usually used when training in deep neural networks. The reasons behind using this is too big to discuss here, but I’ll leave this link here if you are interested.

Refinement

Much of the time was spent getting the input data in the correct shape to be accepted by the conv2D method. Learning how to properly manipulate the image data had a bit of learning curve to it. Because the data already seemed to be black and white and not being to able to see the entire image array, there were pixels that were not 0 or 255 (values that correspond to black and white respectively). Using OpenCV’s not_bitwise method seemed to do the job, but the resulting image was not a true binary image. This problem didn’t become apparent until testing our method. The resulting shape of our testing data was not being created correctly. The OpenCV method of threshold needed to be implemented in order to make sure any pixels that didn’t fit within a certain range were changed to 0 or 255.

There wasn’t really any tuning needed to the CNN itself, the results were quite good from the get-go. Increasing the dropout did increase our val_loss and decrease our validation accuracy. The values of 0.3 and 0.4 were used because the validation results were more favorable giving an accuracy of 0.9841 with a loss value of 0.0636. For example, increasing our dropout values to 0.4 and 0.5 decreased accuracy to 0.9799 and increased loss to 0.0852. Further increasing the dropout to 0.5 and 0.6 drops our accuracy only a little bit more, but our loss is more significant. These dropouts change the accuracy to 0.9709 and loss to 0.1133. These values did not seem to have a significant impact to our testing data, however, resulting in 1 less correct output at the higher dropout rates. Increasing the dropout rate did have the benefit of reducing the training time. With the 0.3, 0.4 dropout rates a training time of 3247 seconds elapsed. With the higher 0.5, 0.6 dropout rates, the training time is reduced to about 2743 seconds.

Results

Model Evaluation and Validation

Out of 100 epochs, evaluating our model with the validation and training data yields a score of 0.0636 loss and 0.9841 accuracy rating when using the best saved weights. The evaluation of our model happens first during training, and our ModelCheckPoint using val_loss for determination. Below we can see the values for the validation accuracy and loss over these 100 epochs. The checkpointer uses val_loss as a metric to determine which weights are best and save our data. Generally, the accuracy and val_loss are inversely related here.

Justification

Testing

Once the training is complete we’re ready to test our model. The test data looks a bit different than the training and validation data. I created my own training data simply using Paint and writing a few equations. The test data contains multiple characters, it is the job of the feature extraction to draw the bounding boxes around the correct character. In the case that multiple boxes are drawn around the same character, we need to check for overlap. If there is an overlap the box with the smaller area is removed from our array. We send the resulting array to our model and try to predict our classes.

Test Data Example
result_string = ''
for data in test_data:
eval_bool = False
evaluate_data = np.array(data)
evaluate_data = data.reshape(1, evaluate_data.shape[0], evaluate_data.shape[1], 1)
result = loaded_model.predict_classes(evaluate_data)
value_position = val_list.index(str(result[0]))
key = key_list[value_position]
result_string += str(key)

Test Results

For this particular test run we can see we predicted 5 equations out of 20 correctly, so only a 25% accuracy here. Our biggest issue is accurately creating the bounded boxes around the characters. 9 out of 20 of our test cases had an incorrect number of characters identified when creating the boxes around our contours. Often, the rectangles drawn around the image get interpreted as more characters than are actually there. Below we can see an example of the algorithm seeing an extra character, but it also seems like we have an accuracy problem, receiving 580 instead of 589.

Conclusion

Reflection

By far, the most difficult aspect of this project was prepping the data in such a way that would be palatable to training with a CNN. The model itself gave quite good results during the training/validation portion, but quite terrible in data in the testing phase. The largest factor here, however, is the interpreted input that our model receives as test data.

Improvement

Better boxes need to be drawn around consecutive handwritten characters when grabbing the contours. Updating and trying new values for our CNN should also be considered, possibly adding more layers and reducing the spatial dimensions even more. Although, it is difficult to advocate for changing the current model given that the input in these cases is less than ideal. Making sure the image data is presented to our input in an accurate way would allow better insights concerning this model. Other than creating a new way to determine the bounding boxes, presenting test data with thicker characters may help the testing portion perform better. This shouldn’t be an issue if the handwritten input is coming from the computer as a medium as using a computer program to write in a particular thickness is trivial. However, in the case of hard copy mediums, (hand writings on notebook paper for example), this is something out of our control in a real environment.

Final Thoughts

Although strides have been made in OCR, it is still a difficult problem, requiring a good amount of data and time to get right. OCR is a growing sector where its usefulness would be a boon to many industries; more progress is sure to be made. My code can be found here on github. https://github.com/prsn670/Handwritten-Equation-Solver

--

--