1. 淮海工学院计算机工程学院,江苏,连云港,222000
2. 上海大学计算机学院,上海,200072
3. 淮海工学院计算机工程学院,江苏,连云港,222000
4. 上海大学计算机学院,上海,200072
纸质出版:2014
移动端阅览
仲兆满, 李存华, 刘宗田, 等. 一种基于搜索策略的多主题信息采集方法[J]. 电子学报, 2014,42(12):2352-2358.
ZHONG Zhao-man, LI Cun-hua, LIU Zong-tian, et al. A Method of Multi-Topic Crawling Based on Search Strategy[J]. Acta Electronica Sinica, 2014, 42(12): 2352-2358.
仲兆满, 李存华, 刘宗田, 等. 一种基于搜索策略的多主题信息采集方法[J]. 电子学报, 2014,42(12):2352-2358. DOI: 10.3969/j.issn.0372-2112.2014.12.003.
ZHONG Zhao-man, LI Cun-hua, LIU Zong-tian, et al. A Method of Multi-Topic Crawling Based on Search Strategy[J]. Acta Electronica Sinica, 2014, 42(12): 2352-2358. DOI: 10.3969/j.issn.0372-2112.2014.12.003.
本文针对多主题信息采集效率低下的问题
调研了主题规则在内置搜索引擎和通用搜索引擎上搜索结果的差异
提出将主题规则拆分成原子规则的思想
分析了原子规则间的相同、互换、包含三种关系.在原子规则之间关系的基础上
设计了针对内置搜索和通用搜索不同的原子规则分配策略
这样做一方面提高主题信息采集的准确率
另一方面减少搜索采集的次数.针对原子规则直接搜索结果的准确率不高的问题
提出了基于句群的主题与信息相关性的过滤方法.设置138条主题规则(拆分后的原子规则为8223条)
14个内置搜索引擎和4个通用搜索引擎
在单位时间内采集到的信息总条数与采集到的相关信息的条数两个方面进行了实验比较.结果表明
所提方法在信息采集数目及相关信息采集数目方面均具有较好的性能.
Aiming at the low efficiency of multi-topic crawling
the difference between built-in search engines (BSEs) and general search engines (GSEs) is investigated.The idea and method of dividing topic rules into atomic rules are proposed respectively
and three relations (equating relation
exchanging relation and containing relation) between atomic rules are analyzed.Based on atomic rule relations
the different allocation strategies for BSEs and GSEs are designed
which can not only improve the precision of topic-specific crawling
but also reduce crawling times.Furthermore
a method of sentence cluster-based relevance computing between topics and documents is proposed to solve the low precision problem of directly crawling information by atomic rules.We conduct an experiment with 138 topic rules (containing 8223 atomic rules)
14 BSEs and 4 GSEs for evaluating the number of crawling information and related information in unit time.The results show that the proposed method offers more effective performances.
0
浏览量
1117
下载量
1
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621