一种在汉语文本中抽取重复字串的快速算法

马颖华; 王永成; 苏贵洋

您当前的位置：

首页 >

文章列表页 >

一种在汉语文本中抽取重复字串的快速算法

科研通信 | 更新时间：2025-07-16

- 一种在汉语文本中抽取重复字串的快速算法
- A Fast Approach of Extracting Repeated String from Chinese Text
- 电子学报 2002年30卷第S1期页码：2177-2180
- 作者机构：
  
  上海交通大学计算机系,上海,200030
- 作者简介：
- 基金信息：
  
  国家自然科学基金 (No.60082003)
- DOI：
  中图分类号： TP391.1
- 纸质出版：2002
- 稿件说明：
移动端阅览
马颖华, 王永成, 苏贵洋. 一种在汉语文本中抽取重复字串的快速算法[J]. 电子学报, 2002,30(S1):2177-2180.

MA Ying-hua, WANG Yong-cheng, SU Gui-yang. A Fast Approach of Extracting Repeated String from Chinese Text[J]. Acta Electronica Sinica, 2002, 30(S1): 2177-2180.
马颖华, 王永成, 苏贵洋. 一种在汉语文本中抽取重复字串的快速算法[J]. 电子学报, 2002,30(S1):2177-2180. DOI：

MA Ying-hua, WANG Yong-cheng, SU Gui-yang. A Fast Approach of Extracting Repeated String from Chinese Text[J]. Acta Electronica Sinica, 2002, 30(S1): 2177-2180. DOI：

摘要

词典未登录词的处理是自然语言处理不可或缺的研究方向.抽取文本中重复出现的字串是抽取未登录词最为直接简便的方法.以往算法运行速度较慢

无法满足海量文本快速处理的要求.遵循左结合优先和最长匹配原则

本文提出一种快速算法:位置记忆跳跃匹配.该方法最差情况下时间复杂度为o(t

)

其中t为重复字串的重复次数.比较实验表明

本方法速度提高明显

数据结构简单

处理过程一次扫描完成.

Abstract

The processing of words unlisted in dictionaries is necessary in natural language processing. Extraction of repeated string appearing in text is the most direct

convenient method

and it is rather effective. Fisting algorithms can not meet the requirement of high speed in vast text processing system. Aceording to principles of left first and longest first

a fast approch named Postitional Remembering and Jump Matching which works in worst condition o(t

) time

where is repeating times of substring

is put forwards.Results of experments show that compared with previous methods

this method gains advantages such as high speed

simple data structures

and simultaneous scanning and matclting processing.

关键词

Keywords

references

浏览量

1487

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据