-
Je něco špatně v tomto záznamu ?
n-Gram-Based Text Compression
VH. Nguyen, HT. Nguyen, HN. Duong, V. Snasel,
Jazyk angličtina Země Spojené státy americké
Typ dokumentu časopisecké články
NLK
Free Medical Journals
od 2007
Hindawi Publishing Open Access
od 2007-06-25
PubMed Central
od 2007
Europe PubMed Central
od 2007
ProQuest Central
od 2008-01-01
Open Access Digital Library
od 2007-01-01
Open Access Digital Library
od 2007-01-01
Open Access Digital Library
od 2007-06-25
Medline Complete (EBSCOhost)
od 2007-01-01
Health & Medicine (ProQuest)
od 2008-01-01
PubMed
27965708
DOI
10.1155/2016/9483646
Knihovny.cz E-zdroje
- MeSH
- algoritmy * MeSH
- Asijci * MeSH
- komprese dat * MeSH
- lidé MeSH
- slovní zásoba * MeSH
- slovníky jako téma * MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.
- 000
- 00000naa a2200000 a 4500
- 001
- bmc17013342
- 003
- CZ-PrNML
- 005
- 20170428110940.0
- 007
- ta
- 008
- 170413s2016 xxu f 000 0|eng||
- 009
- AR
- 024 7_
- $a 10.1155/2016/9483646 $2 doi
- 035 __
- $a (PubMed)27965708
- 040 __
- $a ABA008 $b cze $d ABA008 $e AACR2
- 041 0_
- $a eng
- 044 __
- $a xxu
- 100 1_
- $a Nguyen, Vu H $u Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.
- 245 10
- $a n-Gram-Based Text Compression / $c VH. Nguyen, HT. Nguyen, HN. Duong, V. Snasel,
- 520 9_
- $a We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.
- 650 12
- $a algoritmy $7 D000465
- 650 12
- $a Asijci $7 D044466
- 650 12
- $a komprese dat $7 D044962
- 650 12
- $a slovníky jako téma $7 D004014
- 650 _2
- $a lidé $7 D006801
- 650 12
- $a slovní zásoba $7 D014825
- 655 _2
- $a časopisecké články $7 D016428
- 700 1_
- $a Nguyen, Hien T $u Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.
- 700 1_
- $a Duong, Hieu N $u Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam.
- 700 1_
- $a Snasel, Vaclav $u Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava, Czech Republic.
- 773 0_
- $w MED00163305 $t Computational intelligence and neuroscience $x 1687-5273 $g Roč. 2016, č. - (2016), s. 9483646
- 856 41
- $u https://pubmed.ncbi.nlm.nih.gov/27965708 $y Pubmed
- 910 __
- $a ABA008 $b sig $c sign $y a $z 0
- 990 __
- $a 20170413 $b ABA008
- 991 __
- $a 20170428111301 $b ABA008
- 999 __
- $a ok $b bmc $g 1199807 $s 974120
- BAS __
- $a 3
- BAS __
- $a PreBMC
- BMC __
- $a 2016 $b 2016 $c - $d 9483646 $e 20161114 $i 1687-5273 $m Computational intelligence and neuroscience $n Comput Intell Neurosci $x MED00163305
- LZP __
- $a Pubmed-20170413