n-Gram-Based Text Compression

. 2016 ; 2016 () : 9483646. [epub] 20161114

Jazyk angličtina Země Spojené státy americké Médium print-electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid27965708

We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.

Zobrazit více v PubMed

Salomon D., Motta G. Data Compression—The Complete Reference. 5th. New York, NY, USA: Springer; 2010.

Robinson A. H., Cherry C. Results of a prototype television bandwidth compression scheme. Proceedings of the IEEE. 1967;55(3):356–364. doi: 10.1109/proc.1967.5493. DOI

Fano R. M. Cambridge, Mass, USA: Massachusetts Institute of Technology, Research Laboratory of Electronics; 1949. The transmission of information.

Shannon C. E. A mathematical theory of communication. The Bell System Technical Journal. 1948;27(3):379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. DOI

Huffman D. A. A method for the construction of minimum-redundancy codes. Proceedings of the IRE. 1952;40(9):1098–1101. doi: 10.1109/jrproc.1952.273898. DOI

Howard P. G., Vitter J. S. Arithmetic coding for data compression. Proceedings of the IEEE. 1994;82(6):857–865. doi: 10.1109/5.286189. DOI

Witten I. H., Neal R. M., Cleary J. G. Arithmetic coding for data compression. Communications of the ACM. 1987;30(6):520–540. doi: 10.1145/214762.214771. DOI

Welch T. A. Technique for high-performance data compression. IEEE Computer. 1984;17(6):8–19. doi: 10.1109/mc.1984.1659158. DOI

Ziv J., Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory. 1977;23(3):337–343.

Ziv J., Lempel A. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory. 1978;24(5):530–536. doi: 10.1109/tit.1978.1055934. DOI

Cleary J., Witten I. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications. 1984;32(4):396–402. doi: 10.1109/tcom.1984.1096090. DOI

Burrows M., Wheeler D. A block-sorting lossless data compression algorithm. Digital SRC Research Report. 1994

Nguyen V. H., Nguyen H. T., Duong H. N., Snasel V. A syllable-based method for vietnamese text compression. Proceedings of the ACM 10th International Conference on Ubiquitous Information Management and Communication (IMCOM '16); January 2016; Danang, Viet Nam. p. p. 17. DOI

Nguyen V. H., Nguyen H. T., Duong H. N., Snasel V. Recent Developments in Intelligent Information and Database Systems. Vol. 642. Springer; 2016. Trigram-based vietnamese text compression; pp. 297–307. (Studies in Computational Intelligence). DOI

Al-Bahadili H., Hussain S. M. An adaptive character wordlength algorithm for data compression. Computers and Mathematics with Applications. 2008;55(6):1250–1256. doi: 10.1016/j.camwa.2007.05.014. DOI

Dvorskþ J., Pokornþ J., Snásel J. Word-based compression methods and indexing for text retrieval systems. Proceedings of the 3rd East European Conference on Advances in Databases and Information Systems (ADBIS '99); 1999; Maribor, Slovenia. pp. 75–84.

Kalajdzic K., Ali S. H., Patel A. Rapid lossless compression of short text messages. Computer Standards & Interfaces. 2015;37:53–59. doi: 10.1016/j.csi.2014.05.005. DOI

Platos J., Dvorskþ J. Word-based text compression. CoRR. 2008;(abs/0804.3680)

Akman I., Bayindir H., Ozleme S., Akin Z., Misra S. A lossless text compression technique using syllable based morphology. The International Arab Journal of Information Technology. 2011;8(1):66–74.

Kuthan T., Lansky J. Genetic algorithms in syllable-based text compression. Proceedings of the Dateso Annual International Workshop on Databases, Texts, Specifications and Objects; April 2007; Desna, Czech Republic.

Lansky J., Zemlicka M. Text compression: syllables. Proceedings of the Dateso Annual International Workshop on Databases, Texts, Specifications and Objects; April 2005; Desna, Czech Republic. pp. 32–45.

Lansky J., Zemlicka M. Compression of small text files using syllables. Proceedings of the Data Compression Conference; March 2006; Snowbird, Utah, USA.

Platoš J., Snášel V., El-Qawasmeh E. Compression of small text files. Advanced Engineering Informatics. 2008;22(3):410–417. doi: 10.1016/j.aei.2008.05.001. DOI

Koehn P. Statistical Machine Translation. Cambridge, UK: Cambridge University Press; 2009.

Storer J. A., Szymanski T. G. Data compression via textual substitution. Journal of the ACM. 1982;29(4):928–951. doi: 10.1145/322344.322346. DOI

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...