The Semi-Automatic Classification Data Labeling Method Based on Dispute About Weak Label

LI Zi-qiang; YANG Wei; YANG Xian-feng; LUO Lin

doi:10.12263/DZXB.20230648

您当前的位置：

首页 >

文章列表页 >

The Semi-Automatic Classification Data Labeling Method Based on Dispute About Weak Label

PAPERS | 更新时间：2025-12-08

- The Semi-Automatic Classification Data Labeling Method Based on Dispute About Weak Label
- ACTA ELECTRONICA SINICA Vol. 52, Issue 8, Pages: 2891-2899(2024)
- 作者机构：
  
  1.四川师范大学影视与传媒学院,四川成都 610066
  2.西南石油大学计算机与软件学院,四川成都 610500
  3.泰豪软件股份有限公司成都研发中心,四川成都 610041
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(61802321);Key Research and Development Program of Science and Technology Department of Sichuan Province(2020YFN0019)
- DOI：10.12263/DZXB.20230648
  CLC： TP391.1;
- Received：11 July 2023，
  
  Revised：2024-01-02，
  
  Published：25 August 2024
- 稿件说明：
移动端阅览
李自强, 杨薇, 杨先凤, 等. 基于弱标签争议的半自动分类数据标注方法[J]. 电子学报, 2024, 52(08): 2891-2899.

LI Zi-qiang, YANG Wei, YANG Xian-feng, et al. The Semi-Automatic Classification Data Labeling Method Based on Dispute About Weak Label[J]. Acta Electronica Sinica, 2024, 52(08): 2891-2899.
李自强, 杨薇, 杨先凤, 等. 基于弱标签争议的半自动分类数据标注方法[J]. 电子学报, 2024, 52(08): 2891-2899. DOI：10.12263/DZXB.20230648

LI Zi-qiang, YANG Wei, YANG Xian-feng, et al. The Semi-Automatic Classification Data Labeling Method Based on Dispute About Weak Label[J]. Acta Electronica Sinica, 2024, 52(08): 2891-2899. DOI：10.12263/DZXB.20230648

摘要

当前，深度主动学习（Deep Active Learning，DAL）在分类数据标注工作中获得成功，但如何筛选出最能提升模型性能的样本仍是难题.本文提出基于弱标签争议的半自动分类数据标注方法（Dispute about Weak Label based Deep Active Learning，DWLDAL），迭代地筛选出模型难以区分的样本，交给人工进行准确标注.该方法包含伪标签生成器和弱标签生成器，伪标签生成器是在准确标注的数据集上训练而成，用于生成无标签数据的伪标签；弱标签生成器则是在带伪标签的随机子集上训练而成.弱标签生成器委员会决定哪些无标签数据最有争议，则交给人工标注.本文针对文本分类问题，在公开数据集IMDB（Internet Movie DataBase）、20NEWS（20NEWSgroup）和chnsenticorp（chnsenticorp_htl_all）上进行实验验证.从数据标注和分类任务的准确性2个角度，对3种不同投票决策方式进行评估.DWLDAL方法中数据标注的

分数比现有方法Snuba分别提高30.22%、14.07%和2.57%，DWLDAL方法中分类任务的

分数比Snuba分别提高1.01%、22.72%和4.83%.

Abstract

At present

deep active learning (DAL) in the classification data labeling work has achieved outstanding success. How to select samples to improve the performance of models is still a difficult problem in deep active learning. We proposes a semi-automatic classification data labeling method based on weak label dispute (Dispute about Weak Label-based Deep Active Learning

DWLDAL). The method iteratively selects samples that is difficult for model to distinguish

and manually annotate these sample. This method contains pseudo label generator and weak label generator

pseudo label generator is trained on accurately annotated datasets to generate pseudo label for unlabeled data; weak label generator is trained on random data subset with pseudo labels. Weak label generator committee are used to determine which unlabeled data is the most controversial and should be manually annotated. We conducted experimental validation on the common datasets IMDB (Internet Movie Database)

20NEWS (20NEWSgroup)

and chnsenticorp (chnsenticorp_htl_all) to address the issue of text classification. Three different voting decision-making methods are evaluated from the perspective of the accuracy of data annotation and classification tasks. The

score of data annotation in DWLDAL method is 30.22%

14.07% and 2.57% higher than that in the existing method Snuba

respectively. The

score of classification task in DWLDAL method is 1.01%

22.72% and 4.83% higher than that in Snuba method

respectively.

关键词

Keywords

references

CAO Z H , WONG K , LIN C T . Weak human preference supervision for deep reinforcement learning [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2021 , 32 ( 12 ): 5369 - 5378 .

何雨航 . 基于深度神经网络与弱监督学习的开放域问答技术研究 [D ] . 兰州 : 兰州大学 , 2022 .

HE Y H . Research on Open Domain Question Answering Technology based on Deep Neural Networks and Weak Supervised Learning [D ] . Lanzhou : Lanzhou University , 2022 . (in Chinese)

REN P Z , XIAO Y , CHANG X J , et al . A survey of deep active learning [J ] . ACM Computing Surveys , 2022 , 54 ( 9 ): 1 - 40 .

TRUST P , ZAHRAN A , MINGHIM R . Understanding the influence of news on society decision making: Application to economic policy uncertainty [J ] . Neural Computing and Applications , 2023 , 35 ( 20 ): 14929 - 14945 .

RATNER A , BACH S H , EHRENBERG H , et al . Snorkel: Rapid training data creation with weak supervision [J ] . The VLDB Journal , 2020 , 29 ( 2 ): 709 - 730 .

VARMA P , RÉ C . Snuba: Automating weak supervision to label training data [J ] . Proceedings of the VLDB Endowment , 2018 , 12 ( 3 ): 223 - 236 .

Park Y , Han D J , Park J W , et al . Distribution aware active learning via gaussian mixtures [C ] // The International Conference on Learning Representations . Washington : ICLR , 2023 : 1 - 22 .

CHEN Y K , CARROLL R J , HINZ E R M , et al . Applying active learning to high-throughput phenotyping algorithms for electronic health records data [J ] . Journal of the American Medical Informatics Association , 2013 , 20 ( e2 ): e253 - e259 .

GOUDJIL M , KOUDIL M , BEDDA M , et al . A novel active learning method using SVM for text classification [J ] . International Journal of Automation and Computing , 2018 , 15 ( 3 ): 290 - 298 .

BUCHERT F , NAVAB N , KIM S T . Toward label-efficient neural network training: Diversity-based sampling in semi-supervised active learning [J ] . IEEE Access , 2023 , 11 : 5193 - 5205 .

ZHOU S S , CHEN Q C , WANG X L . Active deep networks for semi-supervised sentiment classification [J ] . Coling 2010-23rd International Conference on Computational Linguistics , Proceedings of the Conference, 2010 , 2 : 1515 - 1523 .

BHATTACHARJEE S D , TALUKDER A , BALANTRAPU B V . Active learning based news veracity detection with feature weighting and deep-shallow fusion [C ] // 2017 IEEE International Conference on Big Data . Piscataway : IEEE , 2017 : 556 - 565 .

LISON P , BARNES J , HUBIN A . Skweak: Weak supervision made easy for NLP [C ] // 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations . Stroudsburg : Association for Computational Linguistics , 2021 : 337 - 346 .

LIU P , WANG L Z , RANJAN R , et al . A survey on active deep learning: From model driven to data driven [J ] . ACM Computing Surveys , 2022 , 54 ( 10 s): 1 - 34 .

LIU J , YANG Y H , LV S Q , et al . Attention-based BiGRU-CNN for Chinese question classification [J ] . Journal of Ambient Intelligence and Humanized Computing , 2019 , 12 ( 2 ): 709 - 730 .

DEVLIN J , CHANG M W , LEE K , et al . BERT: Pre-training of deep bidirectional transformers for language understanding [C ] // 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Stroudsburg : ACL , 2019 : 4171 - 4186 .

GAO Z W , LI Z Y , LUO J Y , et al . Short text aspect-based sentiment analysis based on CNN + BiGRU [J ] . Applied Sciences , 2022 , 12 ( 5 ): 2707 .

SAFRANCHIK E , LUO S Y , BACH S . Weakly supervised sequence tagging from noisy rules [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 4 ): 5570 - 5578 .

BELUCH W H , GENEWEIN T , NURNBERGER A , et al . The power of ensembles for active learning in image classification [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 9368 - 9377 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Automatic Domain-Specific Term Extraction and Its Application in Text Classification

A Combined-Convolutional Neural Network for Chinese News Text Classification

Conviction in Judicial Cases Based on Template Tensor Decomposition and Bidirectional LSTM

High Utility Neural Networks for Text Classification

Related Author

WANG Xiao-long

XU Zhi-ming

LIU Bing-quan

LIU Tao

张昱

高凯龙

王艳歌

张全新

Related Institution

School of Computer Science and Technology, Harbin Institute of Technology

School of Computer Science and Engineering, Northeastern University

Neusoft Group Research, Northeastern University

Research Center of Safety Engineering Technology in Industrial Control of Liaoning Province

School of Computer Science and Engineering Northeastern University Shenyang Liaoning China

⁰