弱标签声音事件检测的空间-通道特征表征与自注意池化

杨利平; 侯振威; 辜小花; 郝峻永

doi:10.12263/DZXB.20210035

您当前的位置：

首页 >

文章列表页 >

弱标签声音事件检测的空间-通道特征表征与自注意池化

学术论文 | 更新时间：2025-12-08

- 弱标签声音事件检测的空间-通道特征表征与自注意池化
- Spatial-Channel Feature Representation and Self-attention Pooling for Weakly-Labeled Sound Event Detection
- 电子学报 2023年51卷第2期页码：297-306
- 作者机构：
  
  1.重庆大学光电技术及系统教育部重点实验室，重庆 400044
  2.重庆科技学院电气工程学院，重庆 401331
- 作者简介：
  
  [ "杨利平男，1981年生，内蒙古鄂尔多斯人.重庆大学副教授.主要研究方向为机器学习，模式识别，以及图像、声音信号处理.E-mail: yanglp@cqu.edu.cn" ]
  [ "侯振威男，1996年生，河北邢台人.重庆大学硕士研究生.主要研究方向为声音信号处理." ]
- 基金信息：
  
  国家自然科学基金(61903054)
- DOI：10.12263/DZXB.20210035
  中图分类号： TP391.4;TP37
- 收稿：2020-12-29，
  
  修回：2021-04-07，
  
  纸质出版：2023-02-25
- 稿件说明：
移动端阅览
杨利平,侯振威,辜小花等.弱标签声音事件检测的空间-通道特征表征与自注意池化[J].电子学报,2023,51(02):297-306.

YANG Li-ping,HOU Zhen-wei,GU Xiao-hua,et al.Spatial-Channel Feature Representation and Self-attention Pooling for Weakly-Labeled Sound Event Detection[J].ACTA ELECTRONICA SINICA,2023,51(02):297-306.
杨利平,侯振威,辜小花等.弱标签声音事件检测的空间-通道特征表征与自注意池化[J].电子学报,2023,51(02):297-306. DOI： 10.12263/DZXB.20210035.

YANG Li-ping,HOU Zhen-wei,GU Xiao-hua,et al.Spatial-Channel Feature Representation and Self-attention Pooling for Weakly-Labeled Sound Event Detection[J].ACTA ELECTRONICA SINICA,2023,51(02):297-306. DOI： 10.12263/DZXB.20210035.

摘要

深度神经网络声音事件检测方法需要大量标记声音事件类别和起止时间的强标签音频样本，然而强标签标注非常困难和耗时.弱标签声音事件检测是解决这一困难的有效途径.本文将弱标签声音事件检测作为多实例学习问题，并基于卷积循环神经网络提出弱标签声音事件检测的空间-通道特征表征与自注意池化方法.该方法研究多实例弱标签声音事件检测的特征表征和帧级预测结果池化两个方面的内容.在特征表征方面，为了增强卷积神经网络的特征表征能力，结合上下文门控和通道注意机制构建门控注意力结构并嵌入到卷积循环神经网络中，实现了音频样本特征的空间和通道特征选择；在预测结果池化方面，引入自注意思想设计音频帧预测结果的自注意池化方法，增强了音频样本中事件帧之间的相关度，使事件帧获得更大的权重.本文方法通过对卷积循环神经网络特征表征和预测结果池化的革新，有效提升了模型的检测性能.本文提出的方法在DCASE 2017任务4和DCASE 2018任务4数据集的评估集中分别取得了52.47%和31.00%的

1得分，性能优于当前绝大部分的弱标签声音事件检测方法.实验结果表明：本文提出的空间-通道特征表征与自注意池化方法能显著改善弱标签声音事件检测的综合性能.

Abstract

A large amount of strong labeled audio samples

which are annotated with detailed sound event categories and timestamps

is required for a deep neural network sound event detection (SED) model. However

obtaining strong label is very difficult and time-consuming. Weakly-labeled SED is an effective way to solve this problem. This paper approaches weakly-labeled SED as a multiple instance learning (MIL) problem and proposes a spatial-channel feature representation and self-attention pooling method for weakly-labeled SED based on convolutional recurrent neural network (CRNN). The proposed method studies the feature representation and the frame-level prediction pooling method for multi-instance weakly-labeled SED. In feature representation

in order to enhance the ability of CRNN

we design a gating attention structure by combining context gating and channel attention mechanism

and embed it into CRNN to realize the spatial and channel selection of audio sample features. In frame-level prediction pooling

we introduce the idea of self-attention and design a self-attention pooling (SAP) function to enhance the event frame correlation in the audio sample and assign great weights for event frames. The proposed method effectively improves the detection performance of SED model by innovating the feature representation of CRNN and the pooling method of frame-level predictions. The proposed method has achieved 52.47% and 31.00%

1 scores respectively in the evaluation set of DCASE 2017 task 4 and DCASE 2018 task 4 datasets

which outperforms most of the current weakly-labeled SED methods. Experimental results show that the proposed spatial-channel feature representation and self-attention pooling method can significantly improve the performance of weakly-labeled SED.

关键词

Keywords

references

GIANNOULIS D , BENETOS E , et al . Detection and classification of acoustic scenes and events: An IEEE AASP challenge [C]// IEEE Workshop on Applications of Signal Processing to Audio and Acoustics . New Paltz : IEEE , 2013 : 1 - 4 .

VALENZISE G , GEROSA L , et al . Scream and gunshot detection and localization for audio-surveillance systems [C]// 2007 IEEE Conference on Advanced Video and Signal Based Surveillance . London : IEEE , 2007 : 21 - 26 .

FOGGIA P , PETKOV N , SAGGESE A , et al . Reliable detection of audio events in highly noisy environments [J]. Pattern Recognition Letters , 2015 , 65 ( C ): 22 - 28 .

ZHANG H , MCLOUGHLIN I , SONG Y . Robust sound event recognition using convolutional neural networks [C]// 2015 IEEE International Conference on Acoustics, Speech and Signal Processing . Brisbane : IEEE , 2015 : 559 - 563 .

PARASCANDOLO G , HUTTUNEN H , VIRTANEN T . Recurrent neural networks for polyphonic sound event detection in real life recordings [C]// 2016 IEEE International Conference on Acoustics, Speech and Signal Processing . Shanghai : IEEE , 2016 : 6440 - 6444 .

PARASCANDOLO G , HEITTOLA T , et al . Convolutional recurrent neural networks for polyphonic sound event detection [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2017 , 25 ( 6 ): 1291 - 1303 .

袁文浩 , 胡少东 , 时云龙 , 等 . 一种用于语音增强的卷积门控循环网络 [J]. 电子学报 , 2020 , 48 ( 7 ): 1276 - 1283 .

YUAN W H , HU S D , SHI Y L , et al . A convolutional gated recurrent network for speech enhancement [J]. Acta Electronica Sinica , 2020 , 48 ( 7 ): 1276 - 1283 . (in Chinese)

CHOU S Y , YANG Y H , et al . Framecnn: A weakly supervised learning framework for frame-wise acoustic event detection and classification [DB/OL]. ( 2017 )[2021]. https://www.semanticscholar.org/paper/fbe6d1324506755d901df17d6378d49713f15aea https://www.semanticscholar.org/paper/fbe6d1324506755d901df17d6378d49713f15aea .

KONG Q , XU Y , et al . Sound event detection and time-frequency segmentation from weakly labeled data [J]. IEEE/ACM Transactions on Audio , Speech and Language Processing, 2019 , 27 ( 4 ): 777 - 787 .

KUMAR A , RAJ B . Audio event detection using weakly labeled data [C]// Proceedings of the 24th ACM international Conference on Multimedia . Amsterdam : ACM , 2016 : 1038 - 1047 .

LIN L W , WANG X D , et al . Specialized decision surface and disentangled feature for weakly-supervised polyphonic sound event detection [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2020 , 28 : 1466 - 1478 .

XU Y , KONG Q , Wang W , et al . Large-scale weakly supervised audio classification using gated convolutional neural network [C]// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing . Calgary : IEEE , 2018 : 121 - 125 .

YAN J , SONG Y , GUO W , et al . A region based attention method for weakly supervised sound event detection and classification [C]// IEEE International Conference on Acoustics, Speech and Signal Processing . Brighton : IEEE , 2019 : 755 - 759 .

KONG Q , XU Y , WANG W , et al . Sound event detection of weakly labelled data with CNN-transformer and automatic threshold optimization [J]. IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2020 , 28 : 2450 - 2460 .

SU T W , LIU J Y , YANG Y H . Weakly-supervised audio event detection using event-specific gaussian filters and fully convolutional networks [C]// IEEE International Conference on Acoustics, Speech and Signal Processing . New Orleans : IEEE , 2017 : 791 - 795 .

KUMAR A , KHADKEVICH M , FUGEN C . Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes [C]// IEEE International Conference on Acoustics, Speech and Signal Processing . Calgary : IEEE , 2018 : 326 - 330 .

WANG Y , LI J , METZE F . A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling [C]// IEEE International Conference on Acoustics, Speech and Signal Processing . Brighton : IEEE , 2019 : 31 - 35 .

KONG Q , XU Y , WANG W , et al . Audio set classification with attention model: A probabilistic perspective [C]// IEEE International Conference on Acoustics, Speech and Signal Processing . Calgary : IEEE , 2018 : 316 - 320 .

MIECH A , LAPTEV I , SIVIC J . Learnable pooling with context gating for video classification [EB/OL]. ( 2017-06-21 )[ 2020-12-29 ]. https://arxiv.org/pdf/1706.06905 https://arxiv.org/pdf/1706.06905 .

HU J , SHEN L , ALBANIE S , et al . Squeeze-and-excitation networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 42 ( 8 ): 2011 - 2023 .

LIN Z , FENG M , SANTOS C N , et al . A structured self-attentive sentence embedding [EB/OL]. ( 2017 )[2021]. https://arxiv.org/abs/1703.03130 https://arxiv.org/abs/1703.03130 .

张志昌 , 曾扬扬 , 庞雅丽 . 融合语义角色和自注意力机制的中文文本蕴含识别 [J]. 电子学报 , 2020 , 48 ( 11 ): 2162 - 2169 .

ZHANG Z C , ZENG Y Y , PANG Y L . Chinese textual implication recognition combining semantic roles and self-attention mechanism [J]. Acta Electronica Sinica , 2020 , 48 ( 11 ): 2162 - 2169 . (in Chinese)

JOZEFOWICZ R , ZAREMBA W , SUTSKEVER I . An empirical exploration of recurrent network architectures [C]// International Conference on Machine Learning . Lille : ACM , 2015 : 2342 - 2350 .

DINKEL H , YU K . Duration robust weakly supervised sound event detection [C]// IEEE International Conference on Acoustics, Speech and Signal Processing . Barcelona : IEEE , 2020 : 311 - 315 .

MESAROS A , HEITTOLA T , VIRTANEN T . Metrics for polyphonic sound event detection [J]. Applied Sciences , 2016 , 6 : 162 .

SERIZEL R , TURPAULT N , et al . Large-scale weakly labeled semi-supervised sound event detection in domestic environments [C]// Detection and Classification of Acoustic Scenes and Events . Surrey : IEEE , 2018 : 19 - 23 .

PELLEGRINI T , CANCES L . Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection [J]. International Joint Conference on Neural Networks , 2019 : 1 - 8 .

KAO C , SHI B , et al . A joint framework for audio tagging and weakly supervised acoustic event detection using DenseNet with global average pooling [C]// Interspeech . Shanghai : ACM , 2020 : 846 - 850 .

MESAROS A , HEITTOLA T , DIMENT A , et al . DCASE 2017 challenge setup: Tasks, datasets and baseline system [C]// Detection and Classification of Acoustic Scenes and Events . Munich : IEEE , 2017 : 85 - 92 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于多随机森林的低信噪比声音事件检测