×

Text compression methods. (English. Russian original) Zbl 0728.68095

J. Sov. Math. 56, No. 1, 2249-2262 (1991); translation from Itogi Nauki Tekh., Ser. Teor. Veroyatn., Mat. Stat., Teor. Kibern. 28, 85-109 (1988).
Review article.

MSC:

68T50 Natural language processing
94A24 Coding theorems (Shannon theory)
68U15 Computing methodologies for text processing; mathematical typography
Full Text: DOI

References:

[1] R. N. Ainon, ?Storing text using integer codes,? Proc. 11th Int. Conf. on Comput. Ling. COLING ’86, 25?29 Aug. 1986, Bonn (1986), pp. 418?420.
[2] P. A. Alsberg, ?Space and time savings through large data base compression and dynamic restructuring,? Proc. IEEE,63, No. 8, 1114?1112 (1975). · doi:10.1109/PROC.1975.9903
[3] I. J. Barton, S. E. Creasy, M. F. Lynch, and M. J. Snell, ?An information-theoretic approach to text searching in direct access systems,? Commun. ACM,17, 345?350 (1974). · doi:10.1145/355616.364034
[4] I. J. Barton, M. F. Lynch, J. H. Petrie, and M. J. Snell, ?Variable-length character string analysis of three databases and their application for file compression,? Informatics I, Proc. 1st Informatics Conf., Durham Univ., 1973, ASLIB, London (1974), pp. 154?162.
[5] M. A. Bassiouni, ?Data compression in scientific and statistical databases,? IEEE Trans. Software Eng.,11, No. 10, 1047?1058 (1985). · doi:10.1109/TSE.1985.231852
[6] M. A. Bassiouni and B. Ok, ?Double encoding ? a technique for reducing storage requirements of text,? Inform. Syst.,11, No. 2, 117?184 (1986). · doi:10.1016/0306-4379(86)90006-2
[7] M. A. Bassiouni and K. A. Hazboun, ?A multi-group technique for data compression,? Proc. ASM SIGMOD Int. Conf. on Management of Data (1982), pp. 284?292.
[8] M. A. Bassiouni and K. A. Hazboun, ?Utilization of character reference locality for efficient storage of databases,? Proc. 2nd Int. Workshop on Statistical Database Management, Los Altos, Ca. (1983), pp. 338?334.
[9] D. Batory, ?Index coding: a compression technique for large statistical databases,? Proc. 2nd Int. Workshop on Statistical Database Management (1983), pp. 306?314.
[10] R. W. Bemer, ?Do it by the numbers ? digital shorthand,? Commun. ACM,3, No. 10, 530?536 (1960). · doi:10.1145/367415.367423
[11] L. S. Bobrov and S. L. Hakimi, ?Graph theoretic prefix codes and their synchronizing properties,? Inform. Contr.,16, No. 1, 70?94 (1969). · Zbl 0187.41701 · doi:10.1016/S0019-9958(69)90641-X
[12] A. Bookstein and G. Faulty, ?A mathematical model for estimating the effectiveness of bigram coding,? Inform. Process. Manag.,12, No. 2, 111?116 (1976). · doi:10.1016/0306-4573(76)90041-8
[13] A. D. Booth, ?A ?law? of occurrences of words of low frequency,? Inform. Contr.,10, 386?393 (1967). · Zbl 0148.40604 · doi:10.1016/S0019-9958(67)90201-X
[14] E. V. Brack, D. Cooper, and M. F. Lynch, ?The stability of symbol sets produced by variety generation from bibliographic data,? Program,12, No. 2, 64?77 (1978). · doi:10.1108/eb046772
[15] S. D. Bradley, ?Optimizing a scheme for run-length encoding,? Proc. IEEE (letters),57, No. 1, 108?109 (1969). · doi:10.1109/PROC.1969.6899
[16] J. M. Bray, V. P. Nelson, B. A. deMain, and J. D. Irwin, ?Data compression techniques ease storage problems,? Comput. Design,24, No. 14, 102?106 (1985).
[17] T. Ch. Chen and I. T. Ho, ?Storage efficient representation of decimal data,? Commun. ACM,18, No. 1, 49?52 (1975). · Zbl 0291.68021 · doi:10.1145/360569.360660
[18] K. Choros, P. Majewska, and A. Sieminski, ?Bibliographic data base comperssion,? Int. Forum on Information and Documentation,12, No. 1, 28?32 (1987).
[19] Y. Choueka, A. S. Fraenkel, and Y. Perl, ?Polynomial construction of optimal prefix tables for text compression,? Proc. 19th Annual Allerton Conf. on Communication, Control, and Computing (1981), pp. 762?768.
[20] A. C. Clare, E. M. Cook, and M. F. Lynch, ?The identification of varible-length equifrequent character strings in natural language data base,? Computer J.,15, No. 3, 259?262 (1972). · doi:10.1093/comjnl/15.3.259
[21] D. Colombo and J. E. Rush, ?Use of word fragments in computer based retrieval systems,? J. Chem. Document.,9, No. 1, 47?50 (1969). · doi:10.1021/c160032a011
[22] J. B. Conuel, ?A Huffman-Shannon-Fano code,? Proc. IEEE, 1046?1047 (1973).
[23] D. Cooper, M. A. Emily, M. F. Lynch, and A. R. Yeates, ?Compression of continuous prose texts using variety generation,? J. ASIS,31, No. 3, 201?207 (1980).
[24] D. Cooper and M. F. Lynch, ?Compression of Wiswester Line Notation using variety generation,? J. Chem. Inform. Comput. Sci.,19, No. 3, 165?169 (1979). · doi:10.1021/ci60019a011
[25] D. Cooper and M. F. Lynch, ?Text compression using variable-to-fixed length encodings,? J. ASIS,33, No. 1, 18?31 (1982).
[26] D. Cooper, ?The storage problem,? Mech. Transl.,5, No. 2, 74?83 (1958).
[27] G. V. Cormack, ?Data compression on a database system,? Commun. ACM,28, No. 12, 1336?1342 (1985). · doi:10.1145/214956.214963
[28] D. Cortesi, ?An effective text compression algorithm,? BYTE,7, No. 1, 397?403 (1982).
[29] T. M. Cover, ?Enumerative source coding,? IEEE Trans. Inform. Theory,19, 73 (1973). · Zbl 0247.94009 · doi:10.1109/TIT.1973.1054929
[30] S. E. Creasy, M. F. Lynch, and J. H. Petrie, ?Compression of biblioraphic database using a variable-to-fixed length bit string transformations,? Program,8, No. 4, 191?195 (1974). · doi:10.1108/eb046708
[31] N. Faller, ?An adaptive system for data compression,? 7th Assilomar Conf. on Circuits, Systems, and Computers (1973), pp. 593?597.
[32] R. M. Fano, Transmission of Information, MIT Press (1961). · Zbl 0116.23405
[33] A. S. Fraenkel and M. Mor, ?Combinatorial compression and partitioning of large dictionaries,? Comput. J.,26, No. 4, 336?343 (1983). · Zbl 0523.68090 · doi:10.1093/comjnl/26.4.336
[34] A. S. Fraenkel, M. Mor, and Y. Perl, ?Is the compression by prefixes and suffixes practical?? Acta Inform.,20, No. 4, 371?379 (1983). · doi:10.1007/BF00264280
[35] R. G. Gallager, ?Variations on a theme by Huffman,? IEEE Trans. Inform. Theory,24, No. 6, 668?674 (1978). · Zbl 0399.94012 · doi:10.1109/TIT.1978.1055959
[36] F. Gey, F. McCarthy, D. Merril, and H. Holmes, ?Computer independent data compression for large statistical databases,? Proc. 2nd Int. Workshop on Statistical Database Management (1983), pp. 296?305.
[37] E. N. Gilbert and E. F. Moore, ?Variable length binary encoding,? Bell Syst. Tech. J.,38, 913?967 (1959). · doi:10.1002/j.1538-7305.1959.tb01583.x
[38] S. W. Golomb, ?Run-length encoding,? IEEE Trans. Inform. Theory,12, 399?401 (1966). · Zbl 0141.14202 · doi:10.1109/TIT.1966.1053907
[39] D. Gotlieb, S. A. Hagerth, P. G. H. Lehot, and H. S. Rabinowitz, ?A classification of compression methods and their usefulness for a large data processing center,? Proc. Nat. Comput. Conf., 1975, Vol. 44, 453?458, AFIPS Press, Montville, N.J. (1975).
[40] P. Goyal, ?Coding methods for text string search on compressed databases,? Inform. Syst.,8, No. 3, 231?233 (1983). · doi:10.1016/0306-4379(83)90009-1
[41] D. W. Grooms, Data Compression, NTIS, Springfield, Va. (1977).
[42] M. Guazzo, ?A general minimum redundancy source coding algorithm,? IEEE Trans. Inform. Theory,26, No. 1, 15?25 (1980). · Zbl 0429.94013 · doi:10.1109/TIT.1980.1056143
[43] W. D. Hagamen et al., ?Encoding verbal information as unique numbers,? IBM Syst. J.,11, No. 4, 278?315 (1972). · doi:10.1147/sj.114.0278
[44] B. Hahn, ?A new technique for compression and storage of data,? Commun. ACM,17, No. 8, 434?436 (1974). · Zbl 0281.68016 · doi:10.1145/361082.361086
[45] G. Hallman, ?How to squeeze data into smaller spaces,? Can. Datasystems,10, No. 9, 27?29 (1978).
[46] H. S. Heaps, ?Storage analysis of compression coding for document data bases,? INFOR J.,10, No. 1, 47?61 (1972).
[47] H. S. Heaps, ?Data compression for large document data base,? Conf. on Large Data Bases Sponsored by NAS/NRC, Committee on Chemical Information, National Academy of Sciences (1974).
[48] H. S. Heaps, ?Data compression for large document data bases,? J. Chem. Inform. Comput. Sci.,15, No. 1, 32?39 (1975). · doi:10.1021/ci60001a011
[49] H. S. Heaps and T. Radhakrishnan, ?Compaction of diagnosis messages for compilers,? Software Pract. Exper.,7, 139 (1977). · Zbl 0341.68020 · doi:10.1002/spe.4380070109
[50] D. Huffman, ?A method for the construction of minimum redundancy codes,? Proc. IRE,40, 1098?1101 (1952). · Zbl 0137.13605 · doi:10.1109/JRPROC.1952.273898
[51] M. Jakobsson, ?Huffman coding in bit-vector compression,? Inform. Process. Lett.,7, No. 6, 304?307 (1978). · Zbl 0385.94018 · doi:10.1016/0020-0190(78)90023-6
[52] J. Jelinek and K. S. Schneider, ?On variable-length-to-block coding,? IEEE Trans. Inform. Theory,18, No. 6, 765?774 (1972). · Zbl 0247.94014 · doi:10.1109/TIT.1972.1054899
[53] G. C. Jewell, ?Text compaction for information retrieval systems,? IEEE Systems, Man, and Cybernetics Society News-letters,6, 4?7 (1976).
[54] C. B. Johnes, ?An efficient coding system for long source sequences,? IEEE Trans. Inform. Theory,27, 280?291 (1981). · Zbl 0458.94029 · doi:10.1109/TIT.1981.1056356
[55] J. Katajainen, M. Penttonen, and J. Teuhola, ?Syntax-directed compression of program files,? Software Pract. Exper.,16, No. 3, 269?276 (1986). · doi:10.1002/spe.4380160307
[56] G. G. Langdon and J. Rissanen, ?A double adaptive file compression algorithm,? IEEE Trans. Commun.,31, 1253?1255 (1983). · Zbl 0517.94005 · doi:10.1109/TCOM.1983.1095765
[57] C. A. Lynch and E. B. Brownrigg, ?Application of data compression technique to large bibliographic databases,? Proc. 7th Int. Conf. on Very Large Data Bases (1981), pp. 435?447.
[58] M. F. Lynch, J. H. Pétrie, and M. J. Snell, ?Analysis of the microstructure of titles in the INSPEC database,? Inform. Storage and Retrieval,9, No. 6, 331?337 (1973). · doi:10.1016/0020-0271(73)90072-7
[59] M. F. Lynch, ?Variety generation ? a reinterpretation of Shannon’s mathematical theory of communication and its implications for information science,? J. ASIS,28, No. 1, 19?25 (1977).
[60] M. F. Lynch and S. D. Rawson, ?Equifrequent character strings ? a novel text characterization method,? Proc. 3rd Int. Symp. on the Use of Computers in Linguistics and Linguistic Research, Cardiff 1974, Univ. of Wales Press (1976), pp. 47?58.
[61] P. B. Maggs, ?Compression of legal texts for more economical storage,? Jurimetrics J.,14, 254?261 (1974).
[62] B. M. Mandelbrot, ?On the theory of word frequencies and on related Markovian models of discourse,? AMS Symp. on Applied Math.,72, 190?219 (1960).
[63] B. A. Marron and P. A. D. de Maine, ?Automatic data compression,? Commun. ACM,10, No. 11, 711?715 (1967). · doi:10.1145/363790.363813
[64] J. Martin, Data Compaction in Computer Data Base Organization, Prentice Hall, Englewood Cliffs, N.J. (1977), pp. 572?587.
[65] A. Mayne and E. B. James, ?Information compression by factorizing common strings,? Comput. J.,17, No. 2, 157?160 (1975). · Zbl 0299.68021 · doi:10.1093/comjnl/18.2.157
[66] J. P. McCarthy, ?Automatic file compression,? Int. Computing Symp., Davos 1973, North-Holland, Amsterdam (1974), pp. 511?516.
[67] D. R. McIntyre and M. A. Pechura, ?Data compression using static Huffman code-decode tables,? Commun. ACM,28, No. 6, 612?616 (1985). · doi:10.1145/3812.3815
[68] V. S. Miller, ?Data compression algorithms,? Proc. Symp. Appl. Math.,34, 107?118 (1986). · Zbl 0595.94004 · doi:10.1090/psapm/034/846854
[69] J. E. Mulford and R. K. Riddall, ?Data compression technique for economic processing of large commercial files,? ACM Symp. on Inform. Storage and Retrieval, 1971, College Park, Md. (1971), pp. 207?215.
[70] O. Nevelainen, M. Jacobson, and R. Berg, ?Compression of clustered inverted files,? in: Math. Foundations of Computer Sci., Springer, Berlin (1978), pp. 393?402.
[71] W. R. Nugent, ?Compression word coding techniques for information retrieval,? J. Lib. Autom.,1, No. 4, 250?260 (1968).
[72] M. Pechura, ?File archival techiques using data compression,? Commun. ACM,25, No. 9, 605?609 (1982). · doi:10.1145/358628.358635
[73] T. Radhakrishnan, ?Selection of prefix and postfix word fragments for data compression,? Inform. Process. Manag.,14, No. 2, 97?106 (1978). · doi:10.1016/0306-4573(78)90067-5
[74] C. Ramamoorthy, ?Document compaction by variable length encoding,? Proc. ASIS,1, 507?513 (1964).
[75] H. K. Reghbati, ?An overview of data compression techniques,? Computer,14, 71?75 (1981). · doi:10.1109/C-M.1981.220416
[76] J. Rissanen, ?Generalized Kraft inequality and arithmetic coding,? IBM J. Res. Dev.,20, 198?203 (1976). · Zbl 0339.94009 · doi:10.1147/rd.203.0198
[77] J. Rissanen, ?Arithmetic coding as number representations,? Acta Pol. Scand. Math.,31, 44?51 (1979). · Zbl 0421.94006
[78] J. Rissanen and G. G. Langdon, ?Arithmetic coding,? IBM J. Res. Dev.,23, 149?162 (1979). · Zbl 0404.94005 · doi:10.1147/rd.232.0149
[79] J. Rissanen and G. G. Langdon, ?Universal modeling and coding,? IEEE Trans. Inform. Theory,27, 12?23 (1981). · Zbl 0456.94009 · doi:10.1109/TIT.1981.1056282
[80] M. Rodeh, V. R. Pratt, and S. Even, ?Linear algorithm for data compression via matching,? J. ACM,28, No. 1, 16?24 (1981). · Zbl 0462.94015 · doi:10.1145/322234.322237
[81] F. Rubin, ?Experiments in text file compression,? Commun. ACM,19, No. 11, 617?623 (1976). · doi:10.1145/360363.360368
[82] S. S. Ruth and P. J. Kreutzer, ?Data compression for large business files,? Datamation,18, No. 8, 62?66 (1972).
[83] J. Schalwijk, ?An algorithm for source coding,? IEEE Trans. Inform. Theory,18, 395 (1972). · Zbl 0233.94005 · doi:10.1109/TIT.1972.1054832
[84] W. D. Schieber and G. W. Thomas, ?An algorithm for compaction of alphanumeric data,? J. Lib. Autom.,4, No. 4, 198?206 (1971).
[85] E. J. Schuegraf, ?A survey of data compression methods for nonnumeric records,? Can. J. Inform. Sci.,2, No. 1, 93?105 (1976). · Zbl 0362.68026
[86] E. J. Schuegraf and H. S. Heaps, ?Selection of equifrequent word fragments for information retrieval,? Inform. Storage and Retrieval,9, No. 12, 697?711 (1973). · doi:10.1016/0020-0271(73)90011-9
[87] E. J. Schuegraf and H. S. Heaps, ?A comparison of algorithms for database compression by use of fragments as language elements,? Inform. Storage and Retrieval,10, No. 9, 309?319 (1974). · Zbl 0298.68030 · doi:10.1016/0020-0271(74)90069-2
[88] E. S. Schwanz, ?Dictionary for minimum redundancy coding,? J. ACM, No. 10, 413?439 (1963).
[89] E. S. Schwartz and A. J. Kleinbomer, ?A language element for compression coding,? Inform. Contr.,10, No. 3, 315?333 (1967). · doi:10.1016/S0019-9958(67)90314-2
[90] D. G. Severance, ?A practitioner’s guide to data base compression,? Inform. Syst.,8, No. 1, 51?62 (1983). · doi:10.1016/0306-4379(83)90030-3
[91] C. E. Shannon, ?A mathematical theory of communication,? Bell. Syst. Tech. J.,27, 379?423 (1948). · Zbl 1154.94303 · doi:10.1002/j.1538-7305.1948.tb01338.x
[92] C. E. Shannon, ?Prediction and entropy of printed English,? Bell. Syst. Tech. J.,30, 50?64 (1951). · Zbl 1165.94313 · doi:10.1002/j.1538-7305.1951.tb01366.x
[93] M. Snyderman and B. Hunt, ?The myriad virtues of text compaction,? Datamation, No. 16, 36?40 (1970).
[94] J. Storer and T. Szymanski, ?Data compression via textual substitution,? J. ACM,29, No. 4, 928?951 (1982). · Zbl 0489.68041 · doi:10.1145/322344.322346
[95] H. Tanaka and S. Kaneku, ?Data compression approach to cryptography,? Proc. 1979 Carnahan Conf. on Crime Countermeasures, Lexington, Kentucky (1979), pp. 159?163.
[96] T. C. Ting, ?Compacting homogeneous text for minimzing storage space,? Int. J. Comput. Inform. Sci.,6, No. 3, 211?221 (1977). · Zbl 0403.68032 · doi:10.1007/BF01002332
[97] R. Tropper, ?Binary-coded text, a text compression method,? BYTE,7, No. 4, 398?413 (1982).
[98] B. F. Varn, ?Optimal variable length codes,? Inform. Contr.,19, 289?301 (1971). · Zbl 0222.94012 · doi:10.1016/S0019-9958(71)90155-0
[99] R. A. Wagner, ?Common phrases and minimum text storage,? Commun. ACM,16, No. 3, 148?153 (1973). · doi:10.1145/361972.361982
[100] R. A. Wagner, ?An algorithm for extracting phrases in a space optimal fashion,? Commun. ACM,16, No. 4, 183?185 (1973). · doi:10.1145/361972.361998
[101] V. R. Walker, ?Compaction of names by X-grams,? Proc. ASIS,6, 129?135 (1969).
[102] S. F. Weiss and R. L. Vernor, ?A word-based compression technique for text files,? J. Lib. Autom.,11, No. 2, 97?105 (1978).
[103] T. Welch, ?A technique for high-performance data compression,? Computer,17, No. 6, 8?19 (1984). · doi:10.1109/MC.1984.1659158
[104] M. Wells, ?File compression using variable length encodings,? Comput. J.,15, No. 4, 308?313 (1972). · doi:10.1093/comjnl/15.4.308
[105] H. E. White, ?Printed English compression by dictionary encodings,? Proc. IEEE,55, No. 3, 390?396 (1967). · doi:10.1109/PROC.1967.5496
[106] P. W. Williams, ?Criteria for choosing subsets to obtain maximum relative entropy,? Comput. J.,21, No. 1, 57?65 (1978). · Zbl 0366.68070 · doi:10.1093/comjnl/21.1.57
[107] J. L. Wisniewski, ?Effective text compression with simultaneous bigram and trigram encoding,? J. Inform. Sci., No. 13, 159?164 (1987). · doi:10.1177/016555158701300306
[108] J. L. Wisniewski, ?Compression of index-term dictionary in an inverted-file-oriented data base: some effective algorithms,? Inform. Process. Manag., No. 6, 493?501 (1986). · doi:10.1016/0306-4573(86)90100-7
[109] J. G. Wolf, ?Recording of natural language for economy of transmission and storage,? Comput. J.,21, No. 1, 42?44 (1978). · doi:10.1093/comjnl/21.1.42
[110] J. G. Wolf, ?An algorithm for the segmentation of an artificial language dialogue,? British J. Psychol.,66, 79?90 (1975). · doi:10.1111/j.2044-8295.1975.tb01442.x
[111] D. Yavuz, ?Zipf’s law and entropy,? IEEE Trans. Inform. Theory,20, No. 5, 650?657 (1974). · Zbl 0295.94048 · doi:10.1109/TIT.1974.1055269
[112] T. Y. Young and P. S. Liu, ?Overhead storage considerations and a multilinear method for data file compression,? IEEE Trans. Software Eng.,6, No. 4, 340?347 (1980). · Zbl 0432.68015 · doi:10.1109/TSE.1980.234490
[113] G. K. Zipf, Human Behaviour and the Principle of Least Effort, Addison Wesley, Cambridge, Mass. (1949).
[114] J. Ziv and A. Lempel, ?A universal algorithm for sequential data compression,? IEEE Trans. Inform. Theory,23, No. 3, 337?343 (1977). · Zbl 0379.94010 · doi:10.1109/TIT.1977.1055714
[115] J. Ziv and A. Lempel, ?Compression of individual sequences via variable rate coding,? IEEE Trans. Inform. Theory,24, No. 5, 530?536 (1978). · Zbl 0392.94004 · doi:10.1109/TIT.1978.1055934
[116] V. A. Amel’kin, Numeration Coding Methods [in Russian], Nauka, Novosibirsk (1986).
[117] G. G. Belonogov and R. G. Kotov, Computerized Information-Retrieval Systems [in Russian], Sovet-skoe Radio, Moscow (1968).
[118] E. V. Bondar’, ?Data compression methods for production MIS database organization,? in: Methods of MIS Design and Organization [in Russian], Kiev (1985), pp. 66?70.
[119] E. E. Brizgal and L. N. Meierovich, ?Compression of patent databases,? in: Methods of MIS Design and Organization [in Russian], Kiev (1985), pp. 58?61.
[120] B. Ya. Kavalerchik and V. I. Grishkan, ?On improving computer system productivity,? USiM, No. 4, 23?27 (1986).
[121] G. S. Kolesnikov, V. I. Pakhomov, V. T. Perevertkin, and A. V. Shambazov, ?One algorithm for text compression,? in: Computer Software [in Russian], Moscow (1984), pp. 107?117.
[122] A. N. Kolmogorov, ?Three approaches to the definition of the concept of ?amount of information?,? Prob. Peredachi Inform.,1, No. 1, 11?14 (1965).
[123] E. I. Korunets, ?One approach to compression of binary data files,? in: Methods of MIS Design and Organization [in Russian], Kiev (1985), pp. 101?107.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.