电子学报 ›› 2015, Vol. 43 ›› Issue (5): 1007-1013.DOI: 10.3969/j.issn.0372-2112.2015.05.026

• 科研通信 • 上一篇    下一篇

基于码书索引变换的高通量DNA序列数据压缩算法

谭丽, 孙季丰   

  1. 华南理工大学电子与信息学院, 广东广州 510641
  • 收稿日期:2014-01-07 修回日期:2014-04-17 出版日期:2015-05-25
    • 作者简介:
    • 谭丽 女,1984年生于湖南常德.现为华南理工大学信息与通信工程专业博士研究生.研究方向为生物信息学,计算智能,图像与视频处理.E-mail:t.li07@mail.scut.edu.cn;孙季丰 男,1962年生于广东揭阳,现为华南理工大学电信学院教授,博士生导师.研究方向包括智能信号处理、图像与视频处理、自组织通信网等.
    • 基金资助:
    • 国家自然科学基金青年科学基金 (No.61202292); 广东省自然科学基金 (No.9151064101000037)

High-Throughput DNA Sequence Data Compression Method Based on Codebook Index Transformation

TAN Li, SUN Ji-feng   

  1. School of Electronic and Information Engineering, South China University of Technology, Guangzhou, Guangdong 510641, China
  • Received:2014-01-07 Revised:2014-04-17 Online:2015-05-25 Published:2015-05-25

摘要:

提出一种高通量DNA序列数据的压缩算法.该算法先采用码书索引变换模型,将传统码书索引值的表示方法变换成由四个标准碱基字符替代的四进制数值方式,并采用一种界定替换串与非替换串的简明编码方法,接着通过信息熵的大小来决定是否进行块排序压缩变换(BWT),最后进行前移编码变换和Huffman熵编码.在多种测序数据集上的实验结果表明,CITD在大多数情况下可以获得比本文所对比的高通量DNA专用压缩方法更优的压缩性能.

关键词: 高通量DNA序列, 码书索引变换模型, 块排序压缩变换, 前移编码, 信息熵, 数据压缩算法

Abstract:

A novel high-throughput DNA sequence compression method based on codebook index transformation (CITD) is proposed.In CITD,we used the codebook index transformation (CIT) model,to substitute the traditional represatation of codebook indexes by the quaternary values which are expressed by the four standard base characters,and adopted a simple encoding method to distinguish the replaced and non-replaced substring,and subsequently determined whether need to use the Burrow Wheeler Transformation (BWT) according to the value of information entropy,finally used move to front (MTF) transformation and Huffman entropy coding to compress the data.Experimental results on several sequencing data sets demonstrate better performance of CITD than the high-throughput DNA sequence compression algorithms cited in this paper,in most cases.

Key words: high-throughput DNA sequence, codebook index transformation(CIT)model, burrow wheeler transfarmation(BWT), move to front(MTF), information entropy, data compression algorithm

中图分类号: