基于Star-RTMPose的双目视觉定位与测量

张梦权; 许四祥; 杨玉; 吴端正

doi:10.12263/DZXB.20250422

您当前的位置：

首页 >

文章列表页 >

基于Star-RTMPose的双目视觉定位与测量

学术论文 | 更新时间：2026-04-24

- 基于Star-RTMPose的双目视觉定位与测量
- Binocular Vision Localization and Measurement Based on Star-RTMPose
- 电子学报 2025年53卷第12期页码：4317-4329
- 作者机构：
  
  安徽工业大学机械工程学院，安徽马鞍山 243032
- 作者简介：
  
  [ "张梦权男，2001年4月出生于安徽省宿州市.现为安徽工业大学机械工程学院硕士研究生.主要研究方向为机器人与机器视觉.E-mail: 2992466836@qq.com" ]
  [ "许四祥男，1974年6月出生于湖北省汉川市.现为安徽工业大学机械工程学院教授、硕士生导师.主要研究方向为机器人与机器视觉.E-mail: xsxhust@ahut.edu.cn" ]
  [ "杨玉男，2001年11月出生于安徽省安庆市.现为安徽工业大学机械工程学院硕士研究生.主要研究方向为机器人与机器视觉.E-mail: 1308889562@qq.com" ]
  [ "吴端正男，2000年10月出生于安徽省蚌埠市.现为安徽工业大学机械工程学院硕士研究生.主要研究方向为机器人与机器视觉.E-mail: 3251387273@qq.com" ]
- 基金信息：
  
  国家自然科学基金(51374007);安徽高校自然科学研究重点项目(KJ2020A0259)
- DOI：10.12263/DZXB.20250422
  中图分类号： TP391.41;
- 收稿：2025-05-26，
  
  录用：2025-12-01，
  
  纸质出版：2025-12-25
- 稿件说明：
移动端阅览
张梦权, 许四祥, 杨玉, 等. 基于Star-RTMPose的双目视觉定位与测量[J]. 电子学报, 2025, 53(12): 4317-4329.

ZHANG Meng-quan, XU Si-xiang, YANG Yu, et al. Binocular Vision Localization and Measurement Based on Star-RTMPose[J]. Acta Electronica Sinica, 2025, 53(12): 4317-4329.
张梦权, 许四祥, 杨玉, 等. 基于Star-RTMPose的双目视觉定位与测量[J]. 电子学报, 2025, 53(12): 4317-4329. DOI：10.12263/DZXB.20250422

ZHANG Meng-quan, XU Si-xiang, YANG Yu, et al. Binocular Vision Localization and Measurement Based on Star-RTMPose[J]. Acta Electronica Sinica, 2025, 53(12): 4317-4329. DOI：10.12263/DZXB.20250422

摘要

针对传统双目视觉特征点检测算法效率低、匹配精度不足、对光照变化敏感以及参数调优复杂，导致双目视觉定位与测量精度受限的问题，本文提出一种基于Star-RTMPose（Star-enhanced Real-Time Multi-person Pose estimation）的双目视觉定位与测量方法.本文以钢铁冶金行业的连铸坯为研究对象，聚焦其火焰切割后毛刺切除所需的精准定位与尺寸测量需求，给出了对应的技术实现路径.首先，通过标定后的双目相机采集连铸坯图像，并采用LabelMe工具完成目标区域与关键点标注，将标注结果统一转换为MSCOCO（MicroSoft Common Objects in COntext）格式以适配模型训练.随后，采用“目标检测-关键点提取”的双阶段框架实现精准检测，即先基于RTMDet（Real-Time Models for object Detection）算法快速定位连铸坯的主体区域，进而采用基于RTMPose（Real-Time Multi-person Pose estimation）的改进模型Star-RTMPose提取关键点坐标.改进包括：在RTMPose主干引入StarTriBlock（Star Triple Block）模块，通过多支路动态融合机制增强网络对目标高层语义特征的表征能力，充分利用该阶段最大感受野与全局空间关联信息；使用基于深度可分离卷积的MaxDSC2（Maximum Depthwise Separable Convolution 2）模块替代网络头部的7×7大核卷积，并将该模块的中间通道数设定为输入通道数的0.45倍，在提升语义信息敏感度的同时降低参数量；用无参SimAM（Simple parameter-free Attention Module）注意力模块替代传统通道注意力模块，通过能量函数闭式解生成通道-空间三维联合权重，强化网络对空间特征的捕获性能，避免参数冗余.最终，结合双目相机标定参数与三角测量原理，完成关键点三维重建与连铸坯尺寸测量.实验结果表明：在关键点检测任务中，改进后的Star-RTMPose模型对单张图像的推理时间仅为9.86 ms，相较于基准模型RTMPose-T，其AP（Average Precision）提升1.09个百分点，PCK（Percentage of Correct Keypoints）提升0.40个百分点，NME（Normalized Mean Error）降低42.86%；改进后的模型在参数量更为精简的前提下，综合性能显著优于HRNet-W32、SwinTransformer-T等主流模型；在三维测量精度方面，本文方法对1型连铸坯长边尺寸的测量相对误差相较于传统ORB（Oriented FAST and Rotated BRIEF）算法以及改进后的FAST（Features from Accelerated Segment Test）算法分别降低了1.715个百分点和0.365个百分点.本文方法有效解决了传统算法鲁棒性欠佳的问题，实现了检测精度与测量精度的双重提升，切实满足工业场景对高精度检测的需求.

Abstract

A binocular vision localization and measurement method based on star-enhanced real-time multi-person pose estimation (Star-RTMPose) is proposed to address the problems of low efficiency

insufficient matching accuracy

sensitivity to illumination changes

and complex parameter tuning of traditional binocular vision feature point detection algorithms

which limit the accuracy of binocular vision localization and measurement. Taking continuous casting billets in the iron and steel metallurgy industry as the research object

this method focuses on the precise positioning and dimension measurement requirements for burr removal after flame cutting

and proposes a corresponding technical implementation approach. Firstly

images of continuous casting billets are collected using calibrated binocular cameras. The LabelMe tool is then used to annotate target regions and keypoints

which are uniformly converted to the microsoft common objects in context (MSCOCO) format to adapt to model training. Subsequently

a two-stage framework of “target detection-keypoint extraction” is adopted to achieve precise detection: the real-time models for object detection (RTMDet) algorithm is first used to quickly locate the main area of the continuous casting billet

and then the improved real-time multi-person pose estimation (RTMPose) model

Star-RTMPose

is used to extract keypoint coordinates. The improvements include: introducing the star triple block (StarTriBlock) module into the RTMPose backbone to enhance the network’s ability to characterize high-level semantic features of the target through a multi-branch dynamic fusion mechanism

making full use of the maximum receptive field and global spatial correlation information of this stage; replacing the 7×7 large kernel convolution at the network head with the maximum depthwise separable convolution 2 (MaxDSC2) module based on depth-separable convolution

setting the intermediate channel number of this module to 0.45 times the input channel number to improve the sensitivity to semantic information while reducing the number of parameters; substituting the traditional channel attention module with the parameter-free simple parameter-free attention module (SimAM) attention module

which generates channel-spatial three-dimensional joint weights through the closed-form solution of the energy function

strengthens the network’s ability to capture spatial features

and avoids parameter redundancy. Finally

by combining the calibration parameters of the binocular camera with the triangulation principle

the three-dimensional reconstruction of keypoints and the dimensional measurement of continuous casting billets are completed. The experimental results show that: in the keypoint detection task

the inference time of the improved Star-RTMPose model for a single image is only 9.86 ms; compared with the baseline model RTMPose-T

its average precision (AP) is improved by 1.09 percentage points

percentage of correct keypoints (PCK) by 0.40 percentage points

and normalized mean error (NME) is reduced by 42.86%; on the premise of more streamlined parameters

the comprehensive performance of the improved model is significantly superior to that of mainstream models such as HRNet-W32 and SwinTransformer-T. In terms of three-dimensional measurement accuracy

the relative error of the proposed method for measuring the long side dimension of Type 1 continuous casting billet is reduced by 1.715 and 0.365 percentage points compared to the traditional oriented fast and rotated brief (ORB) algorithm and the improved features from accelerated segment test (FAST) algorithm

respectively. This method effectively addresses the issue of poor robustness in traditional algorithms

achieving dual improvements in detection accuracy and measurement accuracy

and thereby meeting the demand for high-precision detection in industrial scenarios.

关键词

Keywords

references

安徽工业大学 . 一种去除板坯毛刺的系统 : CN102935547B [P ] . 2014-10-15 .

LOWE D G . Distinctive image features from scale-invariant keypoints [J ] . International Journal of Computer Vision , 2004 , 60 ( 2 ): 91 - 110 .

BAY H , ESS A , TUYTELAARS T , et al . Speeded-up robust features (SURF) [J ] . Computer Vision and Image Understanding , 2008 , 110 ( 3 ): 346 - 359 .

RUBLEE E , RABAUD V , KONOLIGE K , et al . ORB: An efficient alternative to SIFT or SURF [C ] // 2011 International Conference on Computer Vision . Piscataway : IEEE , 2012 : 2564 - 2571 .

宋超群 , 许四祥 , 杨宇 , 等 . 基于改进FAST和BRIEF的双目视觉测量方法 [J ] . 激光与光电子学进展 , 2022 , 59 ( 8 ): 173 - 180 .

SONG C Q , XU S X , YANG Y , et al . Binocular vision measurement method using improved FAST and BRIEF [J ] . Laser & Optoelectronics Progress , 2022 , 59 ( 8 ): 173 - 180 . (in Chinese)

宋祥 , 许四祥 , 杨利法 , 等 . 基于非线性扩散与高维M-SURF描述符的双目视觉测量方法 [J ] . 光电子·激光 , 2024 , 35 ( 4 ): 405 - 413 .

SONG X , XU S X , YANG L F , et al . Binocular vision measurement method based on nonlinear diffusion and high-dimensional M-SURF descriptor [J ] . Journal of Optoelectronics·Laser , 2024 , 35 ( 4 ): 405 - 413 . (in Chinese)

XU S X , DONG C C , ZHOU S H , et al . Binocular measurement method for the continuous casting slab model based on the improved BRISK algorithm [J ] . Applied Optics , 2022 , 61 ( 11 ): 3019 - 3025 .

CAO Z , HIDALGO G , SIMON T , et al . OpenPose: Realtime multi-person 2D pose estimation using part affinity fields [EB/OL ] . ( 2019-05-30 )[ 2025-10-10 ] . https://arXiv.org/abs/1812.08008 https://arXiv.org/abs/1812.08008 .

SUN K , XIAO B , LIU D , et al . Deep high-resolution representation learning for human pose estimation [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 5686 - 5696 .

江佳鸿 , 夏楠 , 李长吾 , 等 . 基于多尺度增量学习的单人体操动作中关键点检测方法 [J ] . 电子学报 , 2024 , 52 ( 5 ): 1730 - 1742 .

JIANG J H , XIA N , LI C W , et al . Keypoint detection method for single person gymnastics actions based on multi-scale incremental learning [J ] . Acta Electronica Sinica , 2024 , 52 ( 5 ): 1730 - 1742 . (in Chinese)

YUAN Y H , FU R , HUANG L , et al . HRFormer: High-resolution transformer for dense prediction [EB/OL ] . ( 2021-11-07 )[ 2025-10-10 ] . https://arXiv.org/abs/2110.09408 https://arXiv.org/abs/2110.09408 .

JIANG T , LU P , ZHANG L , et al . RTMPose: Real-time multi-person pose estimation based on MMPose [EB/OL ] . ( 2023-07-03 )[ 2025-11-11 ] . https://arXiv.org/abs/2303.07399 https://arXiv.org/abs/2303.07399 .

LI Y J , YANG S , LIU P D , et al . SimCC: A simple coordinate classification perspective forHuman pose estimation [C ] // Computer Vision - ECCV 2022 . Cham : Springer , 2022 : 89 - 106 .

LYU C Q , ZHANG W W , HUANG H A , et al . RTMDet: An empirical study of designing real-time object detectors [EB/OL ] . ( 2022-12-16 )[ 2025-10-10 ] . https://arXiv.org/abs/2212.07784 https://arXiv.org/abs/2212.07784 .

MA X , DAI X Y , BAI Y , et al . Rewrite the stars [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 5694 - 5703 .

YANG L X , ZHANG R Y , LI L D , et al . SimAM: A simple, parameter-free attention module for convolutional neural networks [C ] // International Conference on Machine Learning . Cambridge : PMLR , 2021 ( 139 ): 11863 - 11874 .

HU J , SHEN L , SUN G . Squeeze-and-excitation networks [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 7132 - 7141 .

WOO S , PARK J , LEE J Y , et al . CBAM: Convolutional block attention module [C ] // Computer Vision - ECCV 2018 . Cham : Springer , 2018 : 3 - 19 .

WANG Q L , WU B G , ZHU P F , et al . ECA-net: Efficient channel attention for deep convolutional neural networks [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 11531 - 11539 .

CHOLLET F . Xception: Deep learning with depthwise separable convolutions [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2017 : 1800 - 1807 .

LI X W , SUN K , FAN H B , et al . Real-time cattle pose estimation based on improved RTMPose [J ] . Agriculture , 2023 , 13 ( 10 ): 1938 .

LIU Z , LIN Y T , CAO Y , et al . Swin transformer: Hierarchical vision transformer using shifted windows [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2022 : 9992 - 10002 .

SANDLER M , HOWARD A , ZHU M L , et al . MobileNetV2: Inverted residuals and linear bottlenecks [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 4510 - 4520 .

李同谱 , 许四祥 , 施宇翔 , 等 . 基于双目视觉与Transformer的连铸坯模型定位与测量 [J ] . 中南大学学报(自然科学版) , 2024 , 55 ( 4 ): 1312 - 1322 .

LI T P , XU S X , SHI Y X , et al . Continuous casting slab model positioning and measurement based on binocular vision and Transformer [J ] . Journal of Central South University (Science and Technology) , 2024 , 55 ( 4 ): 1312 - 1322 . (in Chinese)

任加琪 , 许四祥 , 董宾卉 , 等 . 基于轻量化HRNet的双目视觉定位与测量 [J/OL ] . 中国机械工程 , 2024 : 1 - 9 [ 2025-10-10 ] . https://kns.cnki.net/kcms/detail/42.1294.TH.20241211.1933.008.html https://kns.cnki.net/kcms/detail/42.1294.TH.20241211.1933.008.html .

REN J Q , XU S X , DONG B H , et al . Binocular vision localization and measurement based on lightweight HRNet [J/OL ] . China Mechanical Engineering , 2024 : 1 - 9 . https://kns.cnki.net/kcms/detail/42.1294.TH.20241211.1933.008.html https://kns.cnki.net/kcms/detail/42.1294.TH.20241211.1933.008.html . (in Chinese)

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

小目标特征增强图像分割算法

基于双目视觉的改进特征立体匹配方法

利用GPU实现单层螺旋CT的三维图像重建

部分遮挡三维彩色物体的压缩全息