1.北京工业大学信息科学技术学院,北京 100124
2.北京工业大学计算智能与智能系统北京市重点实验室,北京 100124
[ "李文生 男,1994年2月生,山东淄博人。现为北京工业大学博士研究生。主要研究方向为视频理解。E-mail: liwensheng@emails.bjut.edu.cn" ]
[ "张菁 女,1975年2月生,广东梅县人。博士,现为北京工业大学教授、博士生导师。主要研究方向为人工智能与计算机视觉等。E-mail: zhj@bjut.edu.cn" ]
[ "王艺晓 女,2002年6月生,河南安阳人。现为北京工业大学硕士研究生。主要研究方向为视频理解。E-mail: wyx0504@emails.bjut.edu.cn" ]
[ "卓力 女,1971年10月生,江苏徐州人。博士,现为北京工业大学教授、博士生导师。主要研究方向人工智能与计算视觉等。 E-mail: zhuoli@bjut.edu.cn" ]
收稿:2025-07-02,
录用:2026-02-24,
纸质出版:2026-02-25
移动端阅览
李文生, 张菁, 王艺晓, 等. 基于VLM凸优化的网络直播视频场景图生成[J]. 电子学报, 2026, 54(02): 544-561.
LI Wensheng, ZHANG Jing, WANG Yixiao, et al. Scene Graph Generation of Livestreaming Video via VLM Convex Optimization[J]. Acta Electronica Sinica, 2026, 54(02): 544-561.
李文生, 张菁, 王艺晓, 等. 基于VLM凸优化的网络直播视频场景图生成[J]. 电子学报, 2026, 54(02): 544-561. DOI:10.12263/DZXB.20250586
LI Wensheng, ZHANG Jing, WANG Yixiao, et al. Scene Graph Generation of Livestreaming Video via VLM Convex Optimization[J]. Acta Electronica Sinica, 2026, 54(02): 544-561. DOI:10.12263/DZXB.20250586
网络直播视频平台凭借庞大的主播群体、海量的内容供给以及极高的日活跃用户规模,已经成为当下数字内容传播、社交互动与商业转化的核心载体。然而直播内容的实时动态性和不可预测性,为网络内容监管带来严峻挑战。视频场景图作为一种能够刻画视频中对象、属性及行为关系的结构化表示方式,通过在时空维度上构建“对象—关系—行为”的语义网络,可实现视频内容的结构化表征。近年来,视觉语言模型(Visual-Language Models,VLMs)在跨模态特征语义理解与复杂场景推理方面展现出显著优势,为直播视频场景图生成提供了新的技术支撑。值得注意的是,VLM虽能提升复杂直播场景的语义解析精度,但仍需克服直播视频特征分布规律不易挖掘的瓶颈问题。在VLM模型训练过程中,凸函数优化对驱动模型收敛至全局最优解至关重要,提出了一种基于VLM凸优化的网络直播视频场景图生成方法(VLM-based Convex Optimization for Scene Graph Generation,VCO-SGG)。该方法构建VLM近似凸优化架构,通过优化对象语义及其关联关系的特征空间几何结构,缩小特征分布差异,缓解VLM模型在训练过程中的收敛震荡问题;同时,构建动态原型记忆模块,通过参数化记忆机制增强对视频帧间关键语义元素持续性与关联性的记忆能力;此外,提出特征联合与关系筛选策略,在线识别并过滤场景图中由动态变化产生的冗余对象索引,实现场景图的动态生成与更新。实验结果表明,该方法在自建直播视频数据集BJUT-LGSD上R@10与mR@10分别提升至55.41%与34.82%;在公开数据集Mini Charades和Mini Action Genome上R@10和mR@10分别达到48.19%/28.02%、43.42%/26.02%;推理速度保持在22.36 FPS,较现有对比方法更具竞争力,表明了其可以胜任直播视频场景图的生成任务。
Livestreaming video platforms have become an important medium for digital content dissemination
social interaction
and commercial activities. This is largely due to their large number of streamers
massive content supply
and extremely high daily active user base. However
the real-time and unpredictable nature of livestreaming content poses serious challenges for online content supervision and regulation. Video scene graphs provide a structured representation for video understanding. They describe objects
attributes
and behavioral relationships within videos. By constructing a semantic network of “object-relation-action” in the spatiotemporal domain
video scene graphs enable structured modeling of video content. In recent years
vision-language models (VLMs) have shown strong capabilities in cross-modal semantic understanding and complex scene reasoning. These advantages provide new technical support for livestreaming video scene graph generation. Although VLMs can significantly improve semantic parsing accuracy in complex livestreaming scenarios
they still face an important challenge. Specifically
it is difficult to effectively capture the feature distribution patterns of livestreaming videos. Convex optimization plays an important role in training VLMs. It helps guide the model to converge toward a global optimal solution. Based on this observation
this paper proposes a VLM-based convex optimization for scene graph generation (VCO-SGG). The method constructs a VLM-based approximately convex optimization framework that constrains the geometric structure of the feature space for object semantics and their relationships
reducing feature distribution discrepancies and mitigating convergence oscillations during VLM training. A dynamic prototypical memory module is introduced
employing a parametric memory mechanism to strengthen the memory of key semantic elements’ continuity and correlations across video frames. Furthermore
a feature association and relation filtering strategy is proposed to identify and filter redundant object indices online
which are generated in the scene graph due to dynamic changes
thereby enabling dynamic generation and updating the scene graph. Experimental results demonstrate that our method achieves improvements of R@10 and mR@10 reaching 55.41% and 34.82%
on the self-built livestreaming video dataset BJUT-LGSD
respectively. In the publicly available datasets Mini Charades and Mini Action Genome datasets
R@10 and mR@10 are further improved to 48.19%/28.02% and 43.42%/26.02%
respectively
and the inference speed is 22.36 FPS. Overall
the results demonstrate greater competitiveness than other methods
indicating its capability to handle the task of generating scene graphs for livestreaming videos.
国家市场监督管理总局 , 直播电商服务质量的信息监测与评价规范 [R ] . 2024 .
State Administration for Market Regulation , Specification for information monitoring and evaluation of live stream⁃ing E-commerce service quality [R ] . 2024 . (in Chinese)
韩志冬 , 胡升龙 , 宋慧慧 , 等 . 运动提示引导自适应学习无监督视频目标分割 [J ] . 电子学报 , 2025 , 53 ( 7 ): 2305 - 2323 .
Han Zhidong , Hu Shenglong , Song Huihui , et al . Motion-prompts guided adaptive learning for unsupervised video object segmentatio [J ] . Acta Electronica Sinica , 2025 , 53 ( 7 ): 2305 - 2323 . (in Chinese)
杨静 , 刘成城 , 黄洁 , 等 . 联合时延-多普勒-角度的无源雷达目标定位凸优化算法 [J ] . 电子学报 , 2024 , 52 ( 6 ): 2091 - 2102 .
Yang Jing , Liu Chengcheng , Huang Jie , et al . Convex solution for target localization in passive MIMO radar using delay, Doppler and angle measurements [J ] . Acta Electronica Sinica , 2024 , 52 ( 6 ): 2091 - 2102 . (in Chinese)
Jing Shuaiqi , Zhang Haonan , Zeng Pengpeng , et al . Memory-based augmentation network for video captioning [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 2367 - 2379 . DOI: 10.1109/tmm.2023.3295098 http://dx.doi.org/10.1109/tmm.2023.3295098
林丽群 , 暨书逸 , 何嘉晨 , 等 . 基于感知和记忆的视频动态质量评价 [J ] . 电子学报 , 2024 , 52 ( 11 ): 3727 - 3740 .
Lin Liqun , Ji Shuyi , He Jiachen , et al . Research of video dynamic quality evaluation based on human perception and memory [J ] . Acta Electronica Sinica , 2024 , 52 ( 11 ): 3727 - 3740 . (in Chinese)
Mishra D , Saha P , Zhao H , et al . TIER-LOC: Visual Query-based Video Clip Localization in fetal ultrasound videos with a multi-tier Transformer [J ] . Medical Image Analysis , 2025 , 103 : 103611 . DOI: 10.1016/j.media.2025.103611 http://dx.doi.org/10.1016/j.media.2025.103611
Yang Jingkang , Yizhe Ang , Guo Zujin , et al . Panoptic scene graph generation [C ] // 2022 European Conference on Computer Vision . Cham : Springer , 2022 : 178 - 196 . DOI: 10.1007/978-3-031-19812-0_11 http://dx.doi.org/10.1007/978-3-031-19812-0_11
Khandelwal A . FloCoDe: Unbiased dynamic scene graph generation with temporal consistency and correlation debiasing [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2024 : 2516 - 2526 . DOI: 10.1109/cvprw63382.2024.00258 http://dx.doi.org/10.1109/cvprw63382.2024.00258
Kim K , Yoon K , In Y , et al . Adaptive self-training framework for fine-grained scene graph generation [C ] // 12th International Conference on Learning Representations . Vienna : ICLR , 2024 .
Li Lin , Xiao Jun , Shi Hanrong , et al . NICEST: Noisy label correction and training for robust scene graph generation [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024 , 46 ( 10 ): 6873 - 6888 . DOI: 10.1109/tpami.2024.3387349 http://dx.doi.org/10.1109/tpami.2024.3387349
Zheng Chaofan , Gao Lianli , Xinyu Lyu , et al . Dual-branch hybrid learning network for unbiased scene graph generation [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2024 , 34 ( 3 ): 1743 - 1756 . DOI: 10.1109/tcsvt.2023.3297842 http://dx.doi.org/10.1109/tcsvt.2023.3297842
Dong Xingning , Gan Tian , Song Xuemeng , et al . Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 19405 - 19414 . DOI: 10.1109/cvpr52688.2022.01882 http://dx.doi.org/10.1109/cvpr52688.2022.01882
Zheng Chaofan , Xinyu Lyu , Gao Lianli , et al . Prototype-based embedding network for scene graph generation [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 22783 - 22792 . DOI: 10.1109/cvpr52729.2023.02182 http://dx.doi.org/10.1109/cvpr52729.2023.02182
Zhang Haoji , Wang Yiqin , Tang Yansong , et al . Flash-VStream: Memory-based real-time understanding for long video streams [PP/OL ] . V2. arXiv ( 2024-06-30 )[ 2025-07-02 ] . https://doi.org/10.48550/arXiv.2406.08085 https://doi.org/10.48550/arXiv.2406.08085 .
Cheng Dingxin , Li Mingda , Liu Jingyu , et al . Enhancing long video understanding via hierarchical event-based memory [C ] // 2025 IEEE International Conference on Multimedia and Expo . Piscataway : IEEE , 2025 : 1 - 6 . DOI: 10.1109/icme59968.2025.11210102 http://dx.doi.org/10.1109/icme59968.2025.11210102
Tu Yunbin , Li Liang , Su Li , et al . Query-centric audio-visual cognition network for moment retrieval, segmentation and step-captioning [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Menlo Park , 2025 : 7464 - 7472 . DOI: 10.1609/aaai.v39i7.32803 http://dx.doi.org/10.1609/aaai.v39i7.32803
Hu Jingjing , Guo Dan , Li Kun , et al . Unified static and dynamic network: Efficient temporal filtering for video grounding [J/OL ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2025 . https://doi.org/10.48550/arXiv.2403.14174 https://doi.org/10.48550/arXiv.2403.14174 .
Lu Jiale , Chen Lianggangxu , Guan Haoyue , et al . Improving rare relation inferring for scene graph generation using bipartite graph network [J ] . Computer Vision and Image Understanding , 2024 , 239 : 103901 . DOI: 10.1016/j.cviu.2023.103901 http://dx.doi.org/10.1016/j.cviu.2023.103901
Kim J , Park J , Park J , et al . Groupwise query specialization and quality-aware multi-assignment for transformer-based visual relationship detection [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 28160 - 28169 . DOI: 10.1109/cvpr52733.2024.02660 http://dx.doi.org/10.1109/cvpr52733.2024.02660
Wang Guan , Li Zhimin , Chen Qingchao , et al . OED: Towards one-stage end-to-end dynamic scene graph generation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 27938 - 27947 . DOI: 10.1109/cvpr52733.2024.02639 http://dx.doi.org/10.1109/cvpr52733.2024.02639
Han Xianjing , Song Xuemeng , Dong Xingning , et al . DBiased-P: Dual-biased predicate predictor for unbiased scene graph generation [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 5319 - 5329 . DOI: 10.1109/tmm.2022.3190135 http://dx.doi.org/10.1109/tmm.2022.3190135
0
浏览量
42
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621