Anchor Frame Calibration and Spatial Position Information Compensation for Street Scene Video Instance Segmentation

ZHANG Yin-hui; ZHAO Chong-ren; HE Zi-fen; YANG Hong-kuan; HUANG Ying

doi:10.12263/DZXB.20220885

您当前的位置：

首页 >

文章列表页 >

Anchor Frame Calibration and Spatial Position Information Compensation for Street Scene Video Instance Segmentation

PAPERS | 更新时间：2026-04-10

- Anchor Frame Calibration and Spatial Position Information Compensation for Street Scene Video Instance Segmentation
- ACTA ELECTRONICA SINICA Vol. 52, Issue 1, Pages: 94-106(2024)
- 作者机构：
  
  昆明理工大学机电工程学院，云南昆明 650500
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62061022;62171206)
- DOI：10.12263/DZXB.20220885
  CLC： TP391;
- Received：25 July 2022，
  
  Revised：2023-03-10，
  
  Published：25 January 2024
- 稿件说明：
移动端阅览
张印辉,赵崇任,何自芬,等.锚框校准和空间位置信息补偿的街道场景视频实例分割[J].电子学报,2024,52(01):94-106.

ZHANG Yin-hui, ZHAO Chong-ren, HE Zi-fen, et al.Anchor Frame Calibration and Spatial Position Information Compensation for Street Scene Video Instance Segmentation[J].Acta Electronica Sinica, 2024, 52(01): 94-106.
张印辉,赵崇任,何自芬,等.锚框校准和空间位置信息补偿的街道场景视频实例分割[J].电子学报,2024,52(01):94-106. DOI：10.12263/DZXB.20220885

ZHANG Yin-hui, ZHAO Chong-ren, HE Zi-fen, et al.Anchor Frame Calibration and Spatial Position Information Compensation for Street Scene Video Instance Segmentation[J].Acta Electronica Sinica, 2024, 52(01): 94-106. DOI：10.12263/DZXB.20220885

摘要

街道场景视频实例分割是无人驾驶技术研究中的关键问题之一，可为车辆在街道场景下的环境感知和路径规划提供决策依据.针对现有方法存在多纵横比锚框应用单一感受野采样导致边缘特征提取不充分以及高层特征金字塔空间细节位置信息匮乏的问题，本文提出锚框校准和空间位置信息补偿视频实例分割（Anchor frame calibration and Spatial position information compensation for Video Instance Segmentation，AS-VIS）网络.首先，在预测头3个分支中添加锚框校准模块实现同锚框纵横比匹配的多类型感受野采样，解决目标边缘提取不充分问题.其次，设计多感受野下采样模块将各种感受野采样后的特征融合，解决下采样信息缺失问题.最后，应用多感受野下采样模块将特征金字塔低层目标区域激活特征映射嵌入到高层中实现空间位置信息补偿，解决高层特征空间细节位置信息匮乏问题.在Youtube-VIS标准库中提取街道场景视频数据集，其中包括训练集329个视频和验证集53个视频.实验结果与YolactEdge检测和分割精度指标定量对比表明，锚框校准平均精度分别提升8.63%和5.09%，空间位置信息补偿特征金字塔平均精度分别提升7.76%和4.75%，AS-VIS总体平均精度分别提升9.26%和6.46%.本文方法实现了街道场景视频序列实例级同步检测、跟踪与分割，为无人驾驶车辆环境感知提供有效的理论依据.

Abstract

Due to the decision-making provision for vehicle environment perception and path planning

street scenes video instance segmentation as one of the key issues in research of self-driving technology has aroused wide concern. However

current researches focus on insufficient edge feature extraction

which is caused by utilization of single receptive field sampling for multi-aspect ratio anchor frames and deficiencies of spatial detailed position information in the high-level feature pyramid architecture. To alleviate these problems

we propose a network anchor frame calibration and spatial position information compensation for video instance segmentation (AS-VIS). Firstly

we conduct the anchor frame calibration module as additional branch in parallel with three prediction branches to align multi-type receptive field sampling with different aspect ratio of anchor frame. Secondly

a multi-receptive field subsampling module is designed to fuse the features of various receptive fields achieving less information missing compared with traditional down-sampling. Finally

for spatial location information compensation and detail location information dispersion in the higher-level feature space

we design multi-receptive field subsampling module embedded in higher level to map active feature of target region in lower level of the feature pyramid. The street scene video dataset is extracted from Youtube-VIS benchmark

including 329 videos in training set and 53 videos in validation set. Quantitative comparison of experimental results with YolactEdge show that the average accuracy of anchor frame calibration is improved by 8.63% and 5.09%

spatial position information compensation feature pyramid network is improved by 7.76% and 4.75%

and the overall average accuracy of AS-VIS is improved by 9.26% and 6.46%. The proposed network AS-VIS realizes detection

tracking

and segmentation synchronously on instance-level street scene video sequences

and provides an effective theoretical basis for environment perception of self-driving vehicles.

关键词

Keywords

references

曾凡 . 基于激光雷达的低速无人物流车的环境感知算法研究 [D]. 重庆 : 重庆理工大学 , 2020 .

ZENG F . Research on Environmental Perception Algorithm of Low Speed Unmanned Logistics Vehicle Based on Lidar [D]. Chongqing : Chongqing University of Technology , 2020 . (in Chinese)

徐国艳 , 牛欢 , 郭宸阳 , 等 . 基于三维激光点云的目标识别与跟踪研究 [J]. 汽车工程 , 2020 , 42 ( 1 ): 38 - 46 .

XU G Y , NIU H , GUO C Y , et al . Research on target recognition and tracking based on 3D laser point cloud [J]. Automotive Engineering , 2020 , 42 ( 1 ): 38 - 46 . (in Chinese)

王阳阳 , 刘之光 , 邓航云 , 等 . 电动小车自动变道环境感知系统 [J]. 同济大学学报(自然科学版) , 2019 , 47 ( 8 ): 1201 - 1206 .

WANG Y Y , LIU Z G , DENG H Y , et al . Automatic lane change environment perception system of electric vehicle [J]. Journal of Tongji University (Natural Science) , 2019 , 47 ( 8 ): 1201 - 1206 . (in Chinese)

张硕 , 叶勤 , 史婧 , 等 . 改进RangeNet++损失函数的车载点云小目标语义分割方法 [J]. 计算机辅助设计与图形学学报 , 2021 , 33 ( 5 ): 704 - 711 .

ZHANG S , YE Q , SHI J , et al . A semantic segmentation method of in-vehicle small targets point cloud based on improved RangeNet++ loss function [J]. Journal of Computer-Aided Design & Computer Graphics , 2021 , 33 ( 5 ): 704 - 711 . (in Chinese)

陈治宇 . 无人驾驶中多传感器融合环境感知算法研究 [D]. 南京 : 南京邮电大学 , 2020 .

CHEN Z Y . Research on Multi-sensor Fusion Environment Perception Algorithm in Autonomous Driving [D]. Nanjing : Nanjing University of Posts and Telecommunications , 2020 . (in Chinese)

WEON I S , LEE S G . Environment recognition based on multi-sensor fusion for autonomous driving vehicles [J]. Journal of Institute of Control, Robotics and Systems , 2019 , 25 ( 2 ): 125 - 131 .

郑少武 , 李巍华 , 胡坚耀 . 基于激光点云与图像信息融合的交通环境车辆检测 [J]. 仪器仪表学报 , 2019 , 40 ( 12 ): 143 - 151 .

ZHENG S W , LI W H , HU J Y . Vehicle detection in the traffic environment based on the fusion of laser point cloud and image information [J]. Chinese Journal of Scientific Instrument , 2019 , 40 ( 12 ): 143 - 151 . (in Chinese)

王新竹 , 李骏 , 李红建 , 等 . 基于三维激光雷达和深度图像的自动驾驶汽车障碍物检测方法 [J]. 吉林大学学报(工学版) , 2016 , 46 ( 2 ): 360 - 365 .

WANG X Z , LI J , LI H J , et al . Obstacle detection based on 3D laser scanner and range image for intelligent vehicle [J]. Journal of Jilin University (Engineering and Technology Edition) , 2016 , 46 ( 2 ): 360 - 365 . (in Chinese)

王中宇 , 倪显扬 , 尚振东 . 利用卷积神经网络的自动驾驶场景语义分割 [J]. 光学精密工程 , 2019 , 27 ( 11 ): 2429 - 2438 .

WANG Z Y , NI X Y , SHANG Z D . Autonomous driving semantic segmentation with convolution neural networks [J]. Optics and Precision Engineering , 2019 , 27 ( 11 ): 2429 - 2438 . (in Chinese)

孟琭 , 徐磊 , 郭嘉阳 . 一种基于改进的MobileNetV2网络语义分割算法 [J]. 电子学报 , 2020 , 48 ( 9 ): 1769 - 1776 .

MENG L , XU L , GUO J Y . Semantic segmentation algorithm based on improved MobileNetV2 [J]. Acta Electronica Sinica , 2020 , 48 ( 9 ): 1769 - 1776 . (in Chinese)

刘强 , 何自芬 , 张印辉 . 分支空洞卷积神经网络的机加工车间场景语义分割 [J]. 计算机辅助设计与图形学学报 , 2021 , 33 ( 1 ): 126 - 141 .

LIU Q , HE Z F , ZHANG Y H . Semantic segmentation of mechanical workshop scenes with branch-atrous convolutional neural networks [J]. Journal of Computer-Aided Design & Computer Graphics , 2021 , 33 ( 1 ): 126 - 141 . (in Chinese)

邹逸群 , 肖志红 , 唐夏菲 , 等 . Anchor-free的尺度自适应行人检测算法 [J]. 控制与决策 , 2021 , 36 ( 2 ): 295 - 302 .

ZOU Y Q , XIAO Z H , TANG X F , et al . Anchor-free scale adaptive pedestrian detection algorithm [J]. Control and Decision , 2021 , 36 ( 2 ): 295 - 302 . (in Chinese)

YANG L J , FAN Y C , XU N . Video instance segmentation [C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2020 : 5187 - 5196 .

HE K M , GKIOXARI G , DOLLÁR P , et al . Mask R-CNN [C]// 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 2980 - 2988 .

CAO J L , ANWER R M , CHOLAKKAL H , et al . SipMask: Spatial information preservation for fast image and video instance segmentation [C]// European Conference on Computer Vision . Cham : Springer , 2020 : 1 - 18 .

BOLYA D , ZHOU C , XIAO F Y , et al . YOLACT: Real-time instance segmentation [C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2020 : 9156 - 9165 .

BOLYA D , ZHOU C , XIAO F Y , et al . YOLACT++: Better real-time instance segmentation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 2 ): 1108 - 1121 .

LIU D F , CUI Y M , TAN W B , et al . SG-net: Spatial granularity network for one-stage video instance segmentation [C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 9811 - 9820 .

FU Y , YANG L J , LIU D , et al . CompFeat: Comprehensive feature aggregation for video instance segmentation [J]. Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 2 ): 1361 - 1369 .

LI M H , LI S , LI L D , et al . Spatial feature calibration and temporal fusion for effective one-stage video instance segmentation [C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 11210 - 11219 .

ZHU X Z , XIONG Y W , DAI J F , et al . Deep feature flow for video recognition [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 4141 - 4150 .

ZHU X Z , WANG Y J , DAI J F , et al . Flow-guided feature aggregation for video object detection [C]// 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 408 - 417 .

ZHU X Z , DAI J F , YUAN L , et al . Towards high performance video object detection [C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 7210 - 7218 .

BERTASIUS G , TORRESANI L . Classifying, segmenting, and tracking object instances in video with mask propagation [C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 9736 - 9745 .

GOEL V , LI J , GARG S , et al . MSN: Efficient online mask selection network for video instance segmentation [EB/OL]. ( 2021-06-19 )[ 2022-11-25 ]. https://arxiv.org/abs/2106.10452 https://arxiv.org/abs/2106.10452 .

LIU Z , LIN Y T , CAO Y , et al . Swin transformer: Hierarchical vision transformer using shifted windows [C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2022 : 9992 - 10002 .

YANG Z X , WEI Y C , YANG Y . Collaborative video object segmentation by foreground-background integration [C]// European Conference on Computer Vision . Cham : Springer , 2020 : 332 - 348 .

LIU H T , RIVERA SOTO R A , XIAO F Y , et al . YolactEdge: Real-time instance segmentation on the edge [C]// 2021 IEEE International Conference on Robotics and Automation (ICRA) . Piscataway : IEEE , 2021 : 9579 - 9585 .

LIN T Y , DOLLÁR P , GIRSHICK R , et al . Feature pyramid networks for object detection [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 936 - 944 .

LIU W , ANGUELOV D , ERHAN D , et al . SSD: Single shot MultiBox detector [C]// European Conference on Computer Vision . Cham : Springer , 2016 : 21 - 37 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

⁰