基于VLM凸优化的网络直播视频场景图生成

李文生; 张菁; 王艺晓; 卓力

doi:10.12263/DZXB.20250586

您当前的位置：

首页 >

文章列表页 >

基于VLM凸优化的网络直播视频场景图生成

学术论文 | 更新时间：2026-05-28

- 基于VLM凸优化的网络直播视频场景图生成
- Scene Graph Generation of Livestreaming Video via VLM Convex Optimization
- 电子学报 2026年54卷第2期页码：544-561
- 作者机构：
  
  1.北京工业大学信息科学技术学院，北京 100124
  2.北京工业大学计算智能与智能系统北京市重点实验室，北京 100124
- 作者简介：
  
  [ "李文生男，1994年2月生，山东淄博人。现为北京工业大学博士研究生。主要研究方向为视频理解。E-mail: liwensheng@emails.bjut.edu.cn" ]
  [ "张菁女，1975年2月生，广东梅县人。博士，现为北京工业大学教授、博士生导师。主要研究方向为人工智能与计算机视觉等。E-mail: zhj@bjut.edu.cn" ]
  [ "王艺晓女，2002年6月生，河南安阳人。现为北京工业大学硕士研究生。主要研究方向为视频理解。E-mail: wyx0504@emails.bjut.edu.cn" ]
  [ "卓力女，1971年10月生，江苏徐州人。博士，现为北京工业大学教授、博士生导师。主要研究方向人工智能与计算视觉等。 E-mail: zhuoli@bjut.edu.cn" ]
- 基金信息：
  
  国家自然科学基金(61971016;62471013);北京市自然科学基金(KZ201910005007)
- DOI：10.12263/DZXB.20250586
  中图分类号： TP391;
- 收稿：2025-07-02，
  
  录用：2026-02-24，
  
  纸质出版：2026-02-25
- 稿件说明：
移动端阅览
李文生, 张菁, 王艺晓, 等. 基于VLM凸优化的网络直播视频场景图生成[J]. 电子学报, 2026, 54(02): 544-561.

LI Wensheng, ZHANG Jing, WANG Yixiao, et al. Scene Graph Generation of Livestreaming Video via VLM Convex Optimization[J]. Acta Electronica Sinica, 2026, 54(02): 544-561.
李文生, 张菁, 王艺晓, 等. 基于VLM凸优化的网络直播视频场景图生成[J]. 电子学报, 2026, 54(02): 544-561. DOI：10.12263/DZXB.20250586

LI Wensheng, ZHANG Jing, WANG Yixiao, et al. Scene Graph Generation of Livestreaming Video via VLM Convex Optimization[J]. Acta Electronica Sinica, 2026, 54(02): 544-561. DOI：10.12263/DZXB.20250586

摘要

网络直播视频平台凭借庞大的主播群体、海量的内容供给以及极高的日活跃用户规模，已经成为当下数字内容传播、社交互动与商业转化的核心载体。然而直播内容的实时动态性和不可预测性，为网络内容监管带来严峻挑战。视频场景图作为一种能够刻画视频中对象、属性及行为关系的结构化表示方式，通过在时空维度上构建“对象—关系—行为”的语义网络，可实现视频内容的结构化表征。近年来，视觉语言模型（Visual-Language Models，VLMs）在跨模态特征语义理解与复杂场景推理方面展现出显著优势，为直播视频场景图生成提供了新的技术支撑。值得注意的是，VLM虽能提升复杂直播场景的语义解析精度，但仍需克服直播视频特征分布规律不易挖掘的瓶颈问题。在VLM模型训练过程中，凸函数优化对驱动模型收敛至全局最优解至关重要，提出了一种基于VLM凸优化的网络直播视频场景图生成方法（VLM-based Convex Optimization for Scene Graph Generation，VCO-SGG）。该方法构建VLM近似凸优化架构，通过优化对象语义及其关联关系的特征空间几何结构，缩小特征分布差异，缓解VLM模型在训练过程中的收敛震荡问题；同时，构建动态原型记忆模块，通过参数化记忆机制增强对视频帧间关键语义元素持续性与关联性的记忆能力；此外，提出特征联合与关系筛选策略，在线识别并过滤场景图中由动态变化产生的冗余对象索引，实现场景图的动态生成与更新。实验结果表明，该方法在自建直播视频数据集BJUT-LGSD上R@10与mR@10分别提升至55.41%与34.82%；在公开数据集Mini Charades和Mini Action Genome上R@10和mR@10分别达到48.19%/28.02%、43.42%/26.02%；推理速度保持在22.36 FPS，较现有对比方法更具竞争力，表明了其可以胜任直播视频场景图的生成任务。

Abstract

Livestreaming video platforms have become an important medium for digital content dissemination

social interaction

and commercial activities. This is largely due to their large number of streamers

massive content supply

and extremely high daily active user base. However

the real-time and unpredictable nature of livestreaming content poses serious challenges for online content supervision and regulation. Video scene graphs provide a structured representation for video understanding. They describe objects

attributes

and behavioral relationships within videos. By constructing a semantic network of “object-relation-action” in the spatiotemporal domain

video scene graphs enable structured modeling of video content. In recent years

vision-language models (VLMs) have shown strong capabilities in cross-modal semantic understanding and complex scene reasoning. These advantages provide new technical support for livestreaming video scene graph generation. Although VLMs can significantly improve semantic parsing accuracy in complex livestreaming scenarios

they still face an important challenge. Specifically

it is difficult to effectively capture the feature distribution patterns of livestreaming videos. Convex optimization plays an important role in training VLMs. It helps guide the model to converge toward a global optimal solution. Based on this observation

this paper proposes a VLM-based convex optimization for scene graph generation (VCO-SGG). The method constructs a VLM-based approximately convex optimization framework that constrains the geometric structure of the feature space for object semantics and their relationships

reducing feature distribution discrepancies and mitigating convergence oscillations during VLM training. A dynamic prototypical memory module is introduced

employing a parametric memory mechanism to strengthen the memory of key semantic elements’ continuity and correlations across video frames. Furthermore

a feature association and relation filtering strategy is proposed to identify and filter redundant object indices online

which are generated in the scene graph due to dynamic changes

thereby enabling dynamic generation and updating the scene graph. Experimental results demonstrate that our method achieves improvements of R@10 and mR@10 reaching 55.41% and 34.82%

on the self-built livestreaming video dataset BJUT-LGSD

respectively. In the publicly available datasets Mini Charades and Mini Action Genome datasets

R@10 and mR@10 are further improved to 48.19%/28.02% and 43.42%/26.02%

respectively

and the inference speed is 22.36 FPS. Overall

the results demonstrate greater competitiveness than other methods

indicating its capability to handle the task of generating scene graphs for livestreaming videos.

关键词

Keywords

references

国家市场监督管理总局 , 直播电商服务质量的信息监测与评价规范 [R ] . 2024 .

State Administration for Market Regulation , Specification for information monitoring and evaluation of live stream⁃ing E-commerce service quality [R ] . 2024 . (in Chinese)

韩志冬 , 胡升龙 , 宋慧慧 , 等 . 运动提示引导自适应学习无监督视频目标分割 [J ] . 电子学报 , 2025 , 53 ( 7 ): 2305 - 2323 .

Han Zhidong , Hu Shenglong , Song Huihui , et al . Motion-prompts guided adaptive learning for unsupervised video object segmentatio [J ] . Acta Electronica Sinica , 2025 , 53 ( 7 ): 2305 - 2323 . (in Chinese)

杨静 , 刘成城 , 黄洁 , 等 . 联合时延-多普勒-角度的无源雷达目标定位凸优化算法 [J ] . 电子学报 , 2024 , 52 ( 6 ): 2091 - 2102 .

Yang Jing , Liu Chengcheng , Huang Jie , et al . Convex solution for target localization in passive MIMO radar using delay, Doppler and angle measurements [J ] . Acta Electronica Sinica , 2024 , 52 ( 6 ): 2091 - 2102 . (in Chinese)

Jing Shuaiqi , Zhang Haonan , Zeng Pengpeng , et al . Memory-based augmentation network for video captioning [J ] . IEEE Transactions on Multimedia , 2024 , 26 : 2367 - 2379 . DOI: 10.1109/tmm.2023.3295098 http://dx.doi.org/10.1109/tmm.2023.3295098

林丽群 , 暨书逸 , 何嘉晨 , 等 . 基于感知和记忆的视频动态质量评价 [J ] . 电子学报 , 2024 , 52 ( 11 ): 3727 - 3740 .

Lin Liqun , Ji Shuyi , He Jiachen , et al . Research of video dynamic quality evaluation based on human perception and memory [J ] . Acta Electronica Sinica , 2024 , 52 ( 11 ): 3727 - 3740 . (in Chinese)

Mishra D , Saha P , Zhao H , et al . TIER-LOC: Visual Query-based Video Clip Localization in fetal ultrasound videos with a multi-tier Transformer [J ] . Medical Image Analysis , 2025 , 103 : 103611 . DOI: 10.1016/j.media.2025.103611 http://dx.doi.org/10.1016/j.media.2025.103611

Yang Jingkang , Yizhe Ang , Guo Zujin , et al . Panoptic scene graph generation [C ] // 2022 European Conference on Computer Vision . Cham : Springer , 2022 : 178 - 196 . DOI: 10.1007/978-3-031-19812-0_11 http://dx.doi.org/10.1007/978-3-031-19812-0_11

Khandelwal A . FloCoDe: Unbiased dynamic scene graph generation with temporal consistency and correlation debiasing [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Piscataway : IEEE , 2024 : 2516 - 2526 . DOI: 10.1109/cvprw63382.2024.00258 http://dx.doi.org/10.1109/cvprw63382.2024.00258

Kim K , Yoon K , In Y , et al . Adaptive self-training framework for fine-grained scene graph generation [C ] // 12th International Conference on Learning Representations . Vienna : ICLR , 2024 .

Li Lin , Xiao Jun , Shi Hanrong , et al . NICEST: Noisy label correction and training for robust scene graph generation [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024 , 46 ( 10 ): 6873 - 6888 . DOI: 10.1109/tpami.2024.3387349 http://dx.doi.org/10.1109/tpami.2024.3387349

Zheng Chaofan , Gao Lianli , Xinyu Lyu , et al . Dual-branch hybrid learning network for unbiased scene graph generation [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2024 , 34 ( 3 ): 1743 - 1756 . DOI: 10.1109/tcsvt.2023.3297842 http://dx.doi.org/10.1109/tcsvt.2023.3297842

Dong Xingning , Gan Tian , Song Xuemeng , et al . Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 19405 - 19414 . DOI: 10.1109/cvpr52688.2022.01882 http://dx.doi.org/10.1109/cvpr52688.2022.01882

Zheng Chaofan , Xinyu Lyu , Gao Lianli , et al . Prototype-based embedding network for scene graph generation [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 22783 - 22792 . DOI: 10.1109/cvpr52729.2023.02182 http://dx.doi.org/10.1109/cvpr52729.2023.02182

Zhang Haoji , Wang Yiqin , Tang Yansong , et al . Flash-VStream: Memory-based real-time understanding for long video streams [PP/OL ] . V2. arXiv ( 2024-06-30 )[ 2025-07-02 ] . https://doi.org/10.48550/arXiv.2406.08085 https://doi.org/10.48550/arXiv.2406.08085 .

Cheng Dingxin , Li Mingda , Liu Jingyu , et al . Enhancing long video understanding via hierarchical event-based memory [C ] // 2025 IEEE International Conference on Multimedia and Expo . Piscataway : IEEE , 2025 : 1 - 6 . DOI: 10.1109/icme59968.2025.11210102 http://dx.doi.org/10.1109/icme59968.2025.11210102

Tu Yunbin , Li Liang , Su Li , et al . Query-centric audio-visual cognition network for moment retrieval, segmentation and step-captioning [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Menlo Park , 2025 : 7464 - 7472 . DOI: 10.1609/aaai.v39i7.32803 http://dx.doi.org/10.1609/aaai.v39i7.32803

Hu Jingjing , Guo Dan , Li Kun , et al . Unified static and dynamic network: Efficient temporal filtering for video grounding [J/OL ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2025 . https://doi.org/10.48550/arXiv.2403.14174 https://doi.org/10.48550/arXiv.2403.14174 .

Lu Jiale , Chen Lianggangxu , Guan Haoyue , et al . Improving rare relation inferring for scene graph generation using bipartite graph network [J ] . Computer Vision and Image Understanding , 2024 , 239 : 103901 . DOI: 10.1016/j.cviu.2023.103901 http://dx.doi.org/10.1016/j.cviu.2023.103901

Kim J , Park J , Park J , et al . Groupwise query specialization and quality-aware multi-assignment for transformer-based visual relationship detection [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 28160 - 28169 . DOI: 10.1109/cvpr52733.2024.02660 http://dx.doi.org/10.1109/cvpr52733.2024.02660

Wang Guan , Li Zhimin , Chen Qingchao , et al . OED: Towards one-stage end-to-end dynamic scene graph generation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 27938 - 27947 . DOI: 10.1109/cvpr52733.2024.02639 http://dx.doi.org/10.1109/cvpr52733.2024.02639

Han Xianjing , Song Xuemeng , Dong Xingning , et al . DBiased-P: Dual-biased predicate predictor for unbiased scene graph generation [J ] . IEEE Transactions on Multimedia , 2023 , 25 : 5319 - 5329 . DOI: 10.1109/tmm.2022.3190135 http://dx.doi.org/10.1109/tmm.2022.3190135

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于注意力惩罚和自适应学习的场景图增强联合多模态方面情感分析

基于用户公平性的IRS辅助无人机隐蔽通信系统资源分配算法

基于能量效率的星地NOMA网络功率分配算法

基于信息融合的区块链系统隐匿安全补丁识别及迁移技术

面向户外多声源增强的鲁棒节点特定分布式广义旁瓣对消