电子学报 ›› 2020, Vol. 48 ›› Issue (5): 878-890.DOI: 10.3969/j.issn.0372-2112.2020.05.007

• 学术论文 • 上一篇    下一篇

一种基于动态拓扑的流计算性能优化方法及其在Storm中的实现

陆佳炜1, 吴涵1, 陈烘2, 张元鸣1, 梁倩卉3, 肖刚1   

  1. 1. 浙江工业大学计算机科学与技术学院, 浙江杭州 310023;
    2. 阿里巴巴基础架构事业部大数据计算与服务团队, 浙江杭州 310011;
    3. 南洋理工大学计算机科学与工程学院, 新加坡 637457
  • 收稿日期:2019-07-02 修回日期:2019-11-13 出版日期:2020-05-25 发布日期:2020-05-25
  • 通讯作者: 肖刚
  • 作者简介:陆佳炜 男,1981年9月出生于浙江湖州,讲师,研究方向为大数据,云计算、服务计算.E-mail:viivan@zjut.edu.cn;吴涵 男,1995年2月出生于福建顺昌,硕士研究生,研究方向为大数据,云计算.E-mail:wuhan@zjut.edu.cn
  • 基金资助:
    国家自然科学基金(No.61976193);浙江省自然科学基金(No.LY19F020034);浙江省重点研发计划项目(No.2018C01064)

A Performance Optimization Method Based on Dynamic Topology for Stream Computing and Its Implementation in Storm

LU Jia-wei1, WU Han1, CHEN Hong2, ZHANG Yuan-ming1, LIANG Qian-hui3, XIAO Gang1   

  1. 1. Department of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, Zhejiang 310023, China;
    2. Team of Big Data Computing and Service, Department of Infrastructure Business, Alibaba, Hangzhou, Zhejiang 310011, China;
    3. School of Computer Science and Engineering, Nanyang Technological University, Singapore 637457, Singapore
  • Received:2019-07-02 Revised:2019-11-13 Online:2020-05-25 Published:2020-05-25

摘要: 响应性和稳定性一直是流式计算中两个至关重要的问题,而流计算系统在过载时常常表现出数据计算延迟增加和拓扑不稳定的现象,无法适应数据负载的动态变化.针对这一问题本文研究提出了一种基于动态拓扑的流计算性能优化方法,主要包括:(1)动态逐级反压:拓扑中的任务可以根据当前自身负载情况,动态调整上游向其发送数据的速率.(2)无状态拓扑数据重放:拓扑不维持数据的计算状态,尽可能地实现数据容错.(3)自适应拓扑替换:在拓扑不暂停的情况下对任务并发度进行自发调整.(4)延迟持久化队列:拓扑中对磁盘的IO读写被延迟到数据处理之外,减缓IO高频阻塞对流计算系统的影响.本文在Apache Storm中实现了以上四种方案,性能测试结果表明优化后的流计算系统与Storm默认实现相比,不仅增强了大数据动态匹配能力,而且在最优情况下改善了17%的吞吐量,并提升了约20%的数据处理速度.

关键词: 数据流拓扑, 流计算, 大数据, 流计算系统, 性能优化

Abstract: Responsiveness and stability have always been two important problems in stream computing.However,as the scale of data being processed in real-time has increased,along with an increase in the data processing latency and topology instability of stream computing,many limitations of stream processing system have become apparent.Aiming at these problems,we present a performance optimization method based on dynamic topology for stream computing:(1) Dynamic step-by-step backpressure:the task in the topology can dynamically adjust the rate of upstream data transmission according to the current load.(2) Stateless topology data replay:topology can achieve data fault tolerance autonomously without maintaining the calculation of data state.(3) Adaptive topology replacement:no need for topology to suspend,the system can adjust the task concurrency spontaneously.(4) Delayed persistent queue:it delays the IO reading and writing in the disk out of the data processing,which mitigates the impact of IO high-frequency blocking in stream computing system.In this paper,the four methods are implemented in Apache Storm.The experimental results show that the optimized system not only enhances the dynamic matching capability of big data,but also achieves 17% higher throughput and 20% better data processing speed in the best case.

Key words: data stream topology, stream computing, big data, stream computing system, performance optimization

中图分类号: