Faster RCNN 详解与实现

2018-02-26

最近一直在研究CV领域的一些算法，由于本人编程水平较弱加之faster rcnn算法比较复杂，虽然原理在很早就大致看懂，但在看源码的时候仍然吃了不少苦头，近些时候，终于了解了代码的大致含义并借鉴大神们的代码自己用pytorch实现了一下。在此，想跟大家详细地解释一下代码的含义，帮助读者更好的理解这一算法。本人代码如下：

Faster-RCNN-pytorch

需要先了解一下faster rcnn的读者可以点击以下链接，这是本人看到的比较全面细致的faster rcnn教程。

Faster RCNN 详解

model

这一部分对应model_easy.py，我们知道, Faster RCNN 总共实际上就是四个模型，CNN模型， RPN(Region proposal layer), ROI Pooling layer, Classification layer(可以称为Faster RCNN层)。CNN层使用预训练的vgg16提取特征，得到feature map，然后进入RPN层, RPN层s首先经过一个kernel_size为3的层,之后再分为两部分，分别输出29大小和49大小的向量（对应于9 anchor 2 classifier以及 9 anchor 4 coordinate）,通过训练修正proposal的位置以及数量（通过rpn文件里面的proposal_layer.py），修正好的proposal进入ROI Pooling层统一划成7*7大小，最后进入FasterRCNN层输出最后的bbox_pred(bbox_delta)以及scores。

CNN层比较简单，在这里不再赘述。我们看一看RPN层的源码。

RPN

# the simple model of RPN
class RPN(nn.Module):

    def __init__(self):
        super(RPN, self).__init__()

        self.conv = nn.Sequential(nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=(1, 1)),
                                            nn.ReLU())

        # 9 anchor * 2 classfier (object or non-object) each grid
        self.conv1 = nn.Conv2d(512, 2 * 9, kernel_size=1, stride=1)

        # 9 anchor * 4 coordinate regressor each grids
        self.conv2 = nn.Conv2d(512, 4 * 9, kernel_size=1, stride=1)
        self.softmax = nn.Softmax()

    def forward(self, features):

        features = self.conv(features)

        logits, rpn_bbox_pred = self.conv1(features), self.conv2(features)

        height, width = features.size()[-2:]
        logits = logits.squeeze(0).permute(1, 2, 0).contiguous()  # (1, 18, H/16, W/16) => (H/16 ,W/16, 18)
        logits = logits.view(-1, 2)  # (H/16 ,W/16, 18) => (H/16 * W/16 * 9, 2)

        rpn_cls_prob = self.softmax(logits)
        rpn_cls_prob = rpn_cls_prob.view(height, width, 18)  # (H/16 * W/16 * 9, 2)  => (H/16 ,W/16, 18)
        rpn_cls_prob = rpn_cls_prob.permute(2, 0, 1).contiguous().unsqueeze(0) # (H/16 ,W/16, 18) => (1, 18, H/16, W/16)

        return rpn_bbox_pred, rpn_cls_prob, logits

注意forward的返回值，一个是rpn_bbox_pred，这个是我们我提到的教程里面的预测的[dx, dy, dh, dw]，也是之后的bbox_delta。rpn_cls_prob也仅仅只是logits经过了一层softmax函数而已。模型大致架构比较简单，下面来看看与RPN层相关的函数。

def proposal(self, rpn_bbox_pred, rpn_cls_prob, im_info, test, args):
       """
       Arguments:
           rpn_bbox_pred (Tensor) : (1, 4*9, H/16, W/16)
           rpn_cls_prob (Tensor) : (1, 2*9, H/16, W/16)
           im_info (Tuple) : (Height, Width, Channel, scale_ratios)
           test (Bool) : True or False
           args (argparse.Namespace) : global arguments
       Return:
           # in each minibatch number of proposal boxes is variable
           proposals_boxes (Ndarray) : ( # proposal boxes, 4)
           scores (Ndarray) :  ( # proposal boxes, )
       """
       """
       # Algorithm:
       #
       # for each (H, W) location i
       #   generate A anchor boxes centered on cell i
       #   apply predicted bbox deltas at cell i to each of the A anchors
       # clip predicted boxes to image
       # remove predicted boxes with either height or width < threshold
       # sort all (proposal, score) pairs by score from highest to lowest
       # take top pre_nms_topN proposals before NMS
       # apply NMS with threshold 0.7 to remaining proposals
       # take after_nms_topN proposals after NMS
       # return the top proposals (-> RoIs top, scores top)
       #layer_params = yaml.load(self.param_str_)
       """


       anchors = generate_anchors()
       _num_anchors = anchors.shape[0]

       all_anchors = get_anchor(rpn_cls_prob, anchors)   # [H * W * 9, 4]

       pre_nms_topn = args.pre_nms_topn if test == False else args.test_pre_nms_topn
       nms_thresh = args.nms_thresh if test == False else args.test_nms_thresh
       post_nms_topn = args.post_nms_topn if test == False else args.test_post_nms_topn

       """It's directly from anchor_target_layer, essentially from training the RPN"""
       bbox_deltas = self._get_bbox_deltas(rpn_bbox_pred).data.cpu().numpy()

       # 1. Convert anchors into proposal via bbox transformation
       """Here we need to generate the precise proposal location for the later operation"""
       proposals_boxes = bbox_transform_inv(all_anchors, bbox_deltas)  # (H/16 * W/16 * 9, 4) all proposal boxes
       scores = self._get_pos_score(rpn_cls_prob).data.cpu().numpy()


       # 2. clip predicted boxes to image
       proposals = clip_boxes(proposals_boxes, im_info[:2])

        # 3. remove predicted boxes with either height or width < threshold
       # (NOTE: convert min_size to input image scale stored in im_info[3])
       keep = filter_boxes(proposals_boxes, self.args.min_size * max(im_info[3]))
       proposals = proposals[keep, :]
       scores = scores[keep]

       # 4. sort all (proposal, score) pairs by score from highest to lowest
       # 5. take top pre_nms_topn (e.g. 6000)
       order = scores.ravel().argsort()[::-1]
       if pre_nms_topn > 0:
           order = order[:pre_nms_topn]
       proposals = proposals[order, :]
       scores = scores[order]

       # 6. apply nms (e.g. threshold = 0.7)

       keep = py_cpu_nms(np.hstack((proposals, scores)), nms_thresh)

       # 7. take after_nms_topN (e.g. 300)
       if post_nms_topn > 0:
           keep = keep[:post_nms_topn]
       
       # 8. return the top proposals (-> RoIs top)
       proposals = proposals[keep, :]
       scores = scores[keep]
       batch_inds = np.zeros((proposals.shape[0], 1), dtype=np.float32)
       blob = np.hstack((batch_inds, proposals.astype(np.float32, copy = False)))
       return blob

这段代码的作用是什么？用教程里面的话就是：

生成anchors，利用[dx(A)，dy(A)，dw(A)，dh(A)]对所有的anchors做bbox regression回归（这里的anchors生成和训练时完全一致）

按照输入的foreground softmax scores由大到小排序anchors，提取前pre_nms_topN(e.g. 6000)个anchors，即提取修正位置后的foreground anchors。

利用im_info将fg anchors从MxN尺度映射回PxQ原图，判断fg anchors是否大范围超过边界，剔除严重超出边界fg anchors。

进行nms（nonmaximum suppression，非极大值抑制）

再次按照nms后的foreground softmax scores由大到小排序fg anchors，提取前post_nms_topN(e.g. 300)结果作为proposal输出。

之后输出proposal=[x1, y1, x2, y2]，注意，由于在第三步中将anchors映射回原图判断是否超出边界，所以这里输出的proposal是对应MxN输入图像尺度的

简单来说，就是对产生的所有anchor进行修正以及筛选。此函数输入的主要参数是之前RPN网络输出的rpn_bbox_pred以及rpn_cls_prob。这里有几行代码需要解释一下：

bbox_deltas = self._get_bbox_deltas(rpn_bbox_pred).data.cpu().numpy()

proposals_boxes = bbox_transform_inv(all_anchors, bbox_deltas)  # (H/16 * W/16 * 9, 4) all proposal boxes

scores = self._get_pos_score(rpn_cls_prob).data.cpu().numpy()

_get_bbox_deltas仅仅是把rpn_bbox_pred转换为了（H/16 W/16 9, 4）, bbox_transform_inv则是根据bbox_deltas以及anchors生成了对应的pred_anchor(注意之前的rpn_bbox_pred仅仅是生成的[dx, dy, dh, dw]，不是最终确定的proposal)，_get_pos_score也仅仅是将rpn_cls_score转化为(H/16 W/16 9, 1)

ROI Pooling

class ROIpooling(nn.Module):

    def __init__(self, size=(7, 7), spatial_scale=1.0 / 16.0):
        super(ROIpooling, self).__init__()
        self.adapmax2d = nn.AdaptiveMaxPool2d(size)
        self.spatial_scale = spatial_scale

    def forward(self, features, rois_boxes):

        # rois_boxes : [x, y, x`, y`]

        if type(rois_boxes) == np.ndarray:
            rois_boxes = to_var(torch.from_numpy(rois_boxes))

        rois_boxes = rois_boxes.data.float().clone()
        rois_boxes.mul_(self.spatial_scale)
        rois_boxes = rois_boxes.long()

        output = []

        for i in range(rois_boxes.size(0)):
            roi = rois_boxes[i]

            try:

                roi_feature = features[:, :, roi[1]:(roi[3] + 1), roi[0]:(roi[2] + 1)]
            except Exception as e:
                print(e, roi)


            pool_feature = self.adapmax2d(roi_feature)
            output.append(pool_feature)

        return torch.cat(output, 0)

ROI Pooling显得比较简单，但这并不是因为其算法本身简单，实际上，此层需要实现的算法是对任意输入的proposal，都要加工成7*7大小，这个工作量似乎不小，实际上，之前大神的代码里，ROI Pooling的实现代码很长，但正因为如此，pytorch的contributor将其封装，现在仅仅只需要调用adapmax2d就可以了。

FasterRCNN

这个代码比较简单，没什么好讲的。

class FasterRcnn(nn.Module):

    def __init__(self):
        super(FasterRcnn, self).__init__()
        self.fc1 = nn.Sequential(nn.Linear(512 * 7 * 7, 4096),
                                 nn.ReLU(),
                                 nn.Dropout())

        self.fc2 = nn.Sequential(nn.Linear(4096, 4096),
                                 nn.ReLU(),
                                 nn.Dropout())

        # 20 class + 1 backround classfier each roi
        self.classfier = nn.Linear(4096, 21)
        self.softmax = nn.Softmax()

        # 21 class * 4 coordinate regressor each roi
        self.regressor = nn.Linear(4096, 21 * 4)

    def forward(self, features):

        features = features.view(-1, 512 * 7 * 7)
        features = self.fc1(features)
        features = self.fc2(features)

        try:
            logits = self.classfier(features)
            scores = self.softmax(logits)
            bbox_delta = self.regressor(features)

        except Exception as e:
            print(e, logits)

        return bbox_delta, scores, logits

Loss

loss部分主要分为两个部分，rpn_loss以及fasterRCNN_loss

def rpn_loss(rpn_cls_prob, rpn_logits, rpn_bbox_pred, rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights):
    
    """
    Arguments:
        rpn_cls_prob (Tensor): (1, 2*9, H/16, W/16)
        rpn_logits (Tensor): (H/16 * W/16 * 9 , 2) object or non-object rpn_logits
        rpn_bbox_pred (Tensor): (1, 4*9, H/16, W/16) predicted boxes
        rpn_labels (Ndarray) : (H/16 * W/16 * 9 ,)
        rpn_bbox_targets (Ndarray) : (H/16 * W/16 * 9, 4)
        rpn_bbox_inside_weights (Ndarray) : (H/16 * W/16 * 9, 4) masking for only positive box loss
    Return:
        cls_loss (Scalar) : classfication loss
        reg_loss * 10 (Scalar) : regression loss
    """

    height, width = rpn_cls_prob.size()[-2:]  # (H/16, W/16)
    rpn_cls_prob = rpn_cls_prob.squeeze(0).permute(1, 2, 0).contiguous()  # (1, 18, H/16, W/16) => (H/16 ,W/16, 18)
    rpn_cls_prob = rpn_cls_prob.view(-1, 2)  # (H/16 ,W/16, 18) => (H/16 * W/16 * 9, 2)

    rpn_labels = to_tensor(rpn_labels).long() # convert properly # (H/16 * W/16 * 9)

    #index where not -1
    idx = rpn_labels.ge(0).nonzero()[:, 0]
    rpn_cls_prob = rpn_cls_prob.index_select(0, to_var(idx))
    rpn_labels = rpn_labels.index_select(0, idx)
    rpn_logits = rpn_logits.squeeze().index_select(0, to_var(idx))

    positive_cnt = torch.sum(rpn_labels.eq(1))
    negative_cnt = torch.sum(rpn_labels.eq(0))

    rpn_labels = to_var(rpn_labels)

    cls_crit = nn.CrossEntropyLoss()
    cls_loss = cls_crit(rpn_logits, rpn_labels)

    rpn_bbox_targets = torch.from_numpy(rpn_bbox_targets)
    rpn_bbox_targets = rpn_bbox_targets.view(height, width, 36)  # (H/16 * W/16 * 9, 4)  => (H/16 ,W/16, 36)
    rpn_bbox_targets = rpn_bbox_targets.permute(2, 0, 1).contiguous().unsqueeze(0) # (H/16 ,W/16, 36) => (1, 36, H/16, W/16)
    rpn_bbox_targets = to_var(rpn_bbox_targets)

    rpn_bbox_inside_weights = torch.from_numpy(rpn_bbox_inside_weights)
    rpn_bbox_inside_weights = rpn_bbox_inside_weights.view(height, width, 36)  # (H/16 * W/16 * 9, 4)  => (H/16 ,W/16, 36)
    rpn_bbox_inside_weights = rpn_bbox_inside_weights.permute(2, 0, 1).contiguous().unsqueeze(0) # (H/16 ,W/16, 36) => (1, 36, H/16, W/16)

    rpn_bbox_inside_weights = rpn_bbox_inside_weights.cuda() if torch.cuda.is_available()

    rpn_bbox_pred = to_var(torch.mul(rpn_bbox_pred.data, rpn_bbox_inside_weights))
    rpn_bbox_targets = to_var(torch.mul(rpn_bbox_targets.data, rpn_bbox_inside_weights))

    reg_loss = F.smooth_l1_loss(rpn_bbox_pred, rpn_bbox_targets, size_average = False) / (positive_cnt + 1e-4)

    return cls_loss, reg_loss * 10

首先注意一点，rpn_bbox_targets, rpn_bbox_inside_weights这两个参数是从anchor_target此函数得来的。

rpn_loss的主要操作流程如下：

筛选出label值不是-1的proposal(-1表示don’t care area)
计算出是前景的proposal和是背景的proposal的数目。
classification使用CrossEntropyLoss(注意CrossEntropyLoss已经包含了log softmaxLoss,所以只需要使用logits作为参数)
regression使用smooth_l1_loss
综合两个函数

frcnn_loss与之相近，就不再赘述