RCNN

From <<Rich feature hierarchies for accurate object detection and semantic segmentation>>:

CNNs saw heavy use in the 1990s, but then fell out of fashion with the rise of support vector machines.  In 2012, Krizhevsky rekindled interest in CNNs by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) . Their success resulted from training a large CNN on 1.2 million labeled images, together with a few twists on LeCun’s CNN (e.g., max(x, 0) rectifying non-linearities and “dropout” regularization)


Instead, we solve the CNN localization problem by operating within the recognition using regions paradigm, which has been successful for both object detection and semantic segmentation. At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape.Since our system combines region proposals with CNNs, we dub the method R-CNN: Regions with CNN features.

Our object detection system consists of three modules.  The first generates category-independent region proposals.  These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of classspecific linear SVMs. 


Selective Search (Region Proposals)主要思想:
  1. 使用一种过分割手段,将图像分割成小区域 (1k~2k 个)
  2. 查看现有小区域,按照合并规则合并可能性最高的相邻两个区域。重复直到整张图像合并成一个区域位置
  3. 输出所有曾经存在过的区域,所谓候选区域
其中合并规则如下: 优先合并以下四种区域:
  • 颜色(颜色直方图)相近的 
  • 纹理(梯度直方图)相近的 
  • 合并后总面积小的: 保证合并操作的尺度较为均匀,避免一个大区域陆续“吃掉”其他小区域 (例:设有区域a-b-c-d-e-f-g-h。较好的合并方式是:ab-cd-ef-gh -> abcd-efgh -> abcdefgh。 不好的合并方法是:ab-c-d-e-f-g-h ->abcd-e-f-g-h ->abcdef-gh -> abcdefgh)
  • 合并后,总面积在其BBOX中所占比例大的: 保证合并后形状规则

调优训练

网络结构 
同样使用上述网络,最后一层换成4096->21的全连接网络。 
学习率0.001,每一个batch包含32个正样本(属于20类)和96个背景。
训练数据 
使用PASCAL VOC 2007的训练集,输入一张图片,输出21维的类别标号,表示20类+背景。 
考察一个候选框和当前图像上所有标定框重叠面积最大的一个。如果重叠比例大于0.5,则认为此候选框为此标定的类别;否则认为此候选框为背景。

类别判断

分类器 
对每一类目标,使用一个线性SVM二类分类器进行判别。输入为深度网络输出的4096维特征,输出是否属于此类。 
由于负样本很多,使用hard negative mining方法。 
正样本 
本类的真值标定框。 
负样本 
考察每一个候选框,如果和本类所有标定框的重叠都小于0.3,认定其为负样本

位置精修


目标检测问题的衡量标准是重叠面积:许多看似准确的检测结果,往往因为候选框不够准确,重叠面积很小。故需要一个位置精修步骤。 回归器对每一类目标,使用一个线性脊回归器进行精修。正则项λ=10000输入为深度网络pool5层的4096维特征,输出为xy方向的缩放和平移。 训练样本判定为本类的候选框中,和真值重叠面积大于0.6的候选框。



首先是negative,即负样本,其次是hard,说明是困难样本,也就是说在对负样本分类时候,loss比较大(label与prediction相差较大)的那些样本,也可以说是容易将负样本看成正样本的那些样本,例如roi里没有物体,全是背景,这时候分类器很容易正确分类成背景,这个就叫easy negative;如果roi里有二分之一个物体,标签仍是负样本,这时候分类器就容易把他看成正样本,这时候就是had negative。
hard negative mining就是多找一些hard negative加入负样本集,进行训练,这样会比easy negative组成的负样本集效果更好。主要体现在虚警率更低一些(也就是false positive少)。


作者:光明
链接:https://www.zhihu.com/question/46292829/answer/284236956
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

For binary classification problem, you can easily calculate your accuracy of your model since your outputs is either ture or false. 
But for the object detection problem, to assess the model is not that straightforward. so we need "iou", the measure of accuracy of prediction of object detection.

If the predicted bounding box exactly matches the "ground truth" bounding box(which is hand-labeled). Then you can get one full points. 

In real world, a complete accurate match is not so easy.  Here we define a new metric, which is a ratio.  The numerator we compute the area of overlap between the predicted bounding box and the ground truth bounding box and the denominator is the area of union, or more simply, the area ocuupited by he predicted bounding box and the ground truth bounding box.

You Only Look Once - this object detection algorithm is currently the state of the art, outperforming R-CNN and it's variants

留言

這個網誌中的熱門文章

AndrewNg's CNN notes(practical issues)

Confidence intervals & Credible intervals