Signature & Date Detector on Documents using Faster-RCNN Algorithm

Sitong Ye
7 min readMar 24, 2021
Test data and visualized detection and bounding box

In my first-ever data science blog, we are going to walk through a different use case of multi-object detection using Faster-RCNN.

My codes can be accessible here:

https://github.com/sitongye/Document_MultiObject_Detection_FasterRCNN

Use Case Description

The goal is to detect two classes of objects on the documents: handwriting signature and dates as well as their locations in the form of the bounding box. Since only classification of handwriting needs to be considered regardless of what exactly the name is, the use case is linked to multi-object detection, the location of bounding box is however translated to the prediction of relative position in x and y of the detected region center.

Algorithms

After the definition of case type, algorithms are narrowed down to those specific for multi-object detection. Popular algorithms to this set of problem include but not limited to:

  • Fast RCNN
  • Faster R-CNN
  • Histogram of Oriented Gradients (HOG)
  • Region-based Convolutional Neural Networks (R-CNN)
  • Region-based Fully Convolutional Network (R-FCN)
  • Single Shot Detector (SSD)
  • Spatial Pyramid Pooling (SPP-net)
  • YOLO (You Only Look Once)

It’s reasonable to further benchmark each algorithm on this same use case, however, the discussion in this article will only focus on the region-based algorithm family and the implementation is in Faster RCNN, which was developed from R-CNN → Fast-RCNN →Faster RCNN.

A very brief introduction of RCNNs

  1. RCNN

original paper: link

resource: link

The basic idea of a region-based convolutional network is to use selective search (a greedy algorithm to cluster similar regions of an image together based on textual, size, color, shape) to extract regions of interests (ROI) as candidates for the following CNN neural network, each image went through a CNN neural network, which is further connected to a binary SVM. Among all the regions that are classified as positive as a defined class, they are filtered according to their overlapping with ground truth using the metric IoU (intersection over union), only those have a IoU ≥ defined threshold(e.g. 0.3) are considered.

A bounding box regression model is configured to correct the bounding box location.

Non-Maximum Suppression

Another problem that rose is that multiple regions could refer to the same targeted object which may introduce noises to the final output. This issue is handled with “non-maximum suppression”: Basically, for all detected regions which overlap and belong to the same class, only the one with the highest confidence score will be selected.

Pitfall of RCNN

The main drawback of RCNN is the selective search is very time-consuming and going through two models without weight sharing is also costly and slow. The selective search algorithm generates around 2000 regions of interest and each of them generates a feature vector of 2000.

2. Fast RCNN

original paper source: link

resource: link

The Fast RCNN jointly unifies the framework of the separate models into one to speed up the training process and increase shared weights between neural networks. Instead of feeding individual sections of images selected from selective search to the CNN, Fast RCNN feeds the original image firstly to the CNN and works further on the feature matrix instead of feature vectors of regions.

ROI Pooling

The proposal of the region is conducted on the feature map derived from CNN, and constructed to fully connected layer through max pooling, this layer connects further one regressor and one classifier, which minimize the loss function at training time altogether, enabling Fast CNN to be an end-to-end system, instead of having complex pipeline. And this speeds up the training on Pascal VOC 2007 dataset by 8 times!

3. Faster RCNN

original paper: link

One weakness of Fast RCNN is that the regional proposal part does not participate in sharing the weights of CNN layers. Faster RCNN thus inserted a Regional Proposal Network after the last convolutional layer, it utilized the feature map of CNN to propose the regions directly.

resource: link
resource by Sitong Ye

3.1. convolutional layers

The CNN network is a transfer learning of VGG_16 in the original paper, later implementation also attempts to use another large trained network like ImageNet.

resource: link

3.2 Regional Proposal Network

In order to use a feature map to propose regions, a fixed number of anchor boxes, each having four shapes slide on the feature map and generate regions to be fed further into classifier and regressor within the RPN. The procedure is described as follows:

  • sliding window over feature maps
  • at each window, generates k Anchor boxes of different shapes and sizes
  • predict:

→ if this anchor is an object (binary classification: foreground/ background)

→ the bounding box for adjusting the anchors to better fit the object

  • k-anchor box output
resource by: Sitong Ye

e.g.: If the feature map has shape 38x50=1900, there are 1900x9=17100 potential anchors

Not all the anchors will be passed to training samples of RPN classifier and regressor, the following criteria apply:

  • only sample a certain number of positive and negative anchor boxes for training (Select 256 anchors as mini-batch for training)
  • Filter anchors that overlap the border of the images
  • For every ground truth, the anchor with the largest IoU is marked as positive(0.3–0.7)

3.3 ROI Pooling

resource by Sitong Ye

As was introduced above, ROI pooling regulates different proposed regions into a stacked fulled connected layer. In Faster RCNN, they received the output from Region Proposal Network.

resource by Sitong Ye

3.4 Classification for objects and Regression for bounding box

likewise, the fully connected layer is further connected to a classifier and a regressor for classification and localization of objects.

Implementation

dataset: Tobacco 800

depending on how the data is labeled, data preparation aimed at generating the for both training and test data ground truth information containing the following features:

resource by Sitong Ye

The realization of Faster RCNN is not yet officially integrated as a complete function in deep learning libraries in TensorFlow or PyTorch, but the community has already provided us several versions of implementation in both Tensorflow and PyTorch.

My implementation: https://github.com/sitongye/Document_MultiObject_Detection_FasterRCNN

was originally based on this github.

My trained model and base model can be downloaded in release.

By cloning the repo and placing trained models in the Model folder (alternatively, you can parse the model path to the training, testing, and predicting script), you are ready to test the results by running “predict_rcnn.py” as described in the documentation!

Evaluation Metrics:

mAP: mean average Precision

  1. IoU:

As IoU (intersection-over-Union), defined as intersection over the union of the detection bounding box and the ground truth bounding box, was used in the training network for non-maximum suppression and regional proposal, it’s also an important criterion in the evaluation phase. In object detection, the metrics that we are familiar with are specifically defined as follows with IoU:

True Positives [TP]

Number of detections with IoU>0.5

False Positives [FP]

Number of detections with IoU<=0.5 or detected more than once

False Negatives [FN]

Number of objects that not detected or detected with IoU<=0.5

Precision

Precision measures how accurate your predictions are. i.e. the percentage of your predictions that are correct.

Precision = True positive / (True positive + False positive)

Recall

Recall measures how good you find all the positives.

Recall = True positive / (True positive + False negative)

F1 Score

F1 score is HM (Harmonic Mean) of precision and recall.

Source

Source

AP

Average Precision (AP) is finding the area under the precision-recall curve

The AP is averaged over all categories. Traditionally, this is called “mean average precision” (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context. COCO Evaluation

As an example, let’s calculate the mean AP for two of the output images from the testing sets.

source by Sitong Ye

As seen on the upper table, we evaluate each prediction made by the model according to the IoU and rank the predictions according to its recall in ascending order. With this sequence, we accumulate the True Positives and False Positives and plot it as shown below:

source by Sitong Ye

mAP refers to the region under this plotted recall-precision line.

The whole code structure can be depicted as below:

source by Sitong Ye
training result snippet

The experiment has reached a pretty satisfying result, although some missed predictions have happened, or non-handwriting part is classified.

Hope you have fun and leave a comment if you want a discussion or anything is unclear:)

Cheers and code on!

--

--