Detail
Článek
FT
Medvik - BMČ
  • Je něco špatně v tomto záznamu ?

n-Gram-Based Text Compression

VH. Nguyen, HT. Nguyen, HN. Duong, V. Snasel,

. 2016 ; 2016 (-) : 9483646. [pub] 20161114

Jazyk angličtina Země Spojené státy americké

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/bmc17013342

We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.

000      
00000naa a2200000 a 4500
001      
bmc17013342
003      
CZ-PrNML
005      
20170428110940.0
007      
ta
008      
170413s2016 xxu f 000 0|eng||
009      
AR
024    7_
$a 10.1155/2016/9483646 $2 doi
035    __
$a (PubMed)27965708
040    __
$a ABA008 $b cze $d ABA008 $e AACR2
041    0_
$a eng
044    __
$a xxu
100    1_
$a Nguyen, Vu H $u Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.
245    10
$a n-Gram-Based Text Compression / $c VH. Nguyen, HT. Nguyen, HN. Duong, V. Snasel,
520    9_
$a We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.
650    12
$a algoritmy $7 D000465
650    12
$a Asijci $7 D044466
650    12
$a komprese dat $7 D044962
650    12
$a slovníky jako téma $7 D004014
650    _2
$a lidé $7 D006801
650    12
$a slovní zásoba $7 D014825
655    _2
$a časopisecké články $7 D016428
700    1_
$a Nguyen, Hien T $u Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.
700    1_
$a Duong, Hieu N $u Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam.
700    1_
$a Snasel, Vaclav $u Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava, Czech Republic.
773    0_
$w MED00163305 $t Computational intelligence and neuroscience $x 1687-5273 $g Roč. 2016, č. - (2016), s. 9483646
856    41
$u https://pubmed.ncbi.nlm.nih.gov/27965708 $y Pubmed
910    __
$a ABA008 $b sig $c sign $y a $z 0
990    __
$a 20170413 $b ABA008
991    __
$a 20170428111301 $b ABA008
999    __
$a ok $b bmc $g 1199807 $s 974120
BAS    __
$a 3
BAS    __
$a PreBMC
BMC    __
$a 2016 $b 2016 $c - $d 9483646 $e 20161114 $i 1687-5273 $m Computational intelligence and neuroscience $n Comput Intell Neurosci $x MED00163305
LZP    __
$a Pubmed-20170413

Najít záznam

Citační ukazatele

Nahrávání dat...

Možnosti archivace

Nahrávání dat...