Published in

Proceedings of the ACM on Management of Data, 4(1), p. 1-28, 2023

DOI: 10.1145/3626765

Links

Tools

Export citation

Search in Google Scholar

Homomorphic Compression: Making Text Processing on Compression Unlimited

This paper was not found in any repository, but could be made available legally by the author.
This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Green circle
Preprint: archiving allowed
Green circle
Postprint: archiving allowed
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

Lossless data compression is an effective way to handle the huge transmission and storage overhead of massive text data. Its utility is even more significant today when data volumes are skyrocketing. The concept of operating on compressed data infuses new blood into efficient text management by enabling mainly access-oriented text processing tasks to be done directly on compressed data without decompression. Facing limitations of the existing compressed text processing schemes such as limited types of operations supported, low efficiency, and high space occupation, we address these problems by proposing a homomorphic compression theory. It enables the generalization and characterization of algorithms with compression processing capabilities. On this basis, we develop HOCO, an efficient text data management engine that supports a variety of processing tasks on compressed text. We select three representative compression schemes and implement them combined with homomorphism in HOCO. HOCO supports the extension of homomorphic compression schemes through a modular and object-oriented design and has convenient interfaces for text processing tasks. We evaluate HOCO on six real-world datasets. The three schemes implemented in HOCO show trade-offs in terms of compression ratio, supported operation types, and efficiency. Experiments also show that HOCO can achieve higher throughput in random access and modification operations (averagely 9.18× than the state-of-the-art) and lower latency in text analytic tasks (averagely 7.16× than processing on uncompressed text) without compromising compression efficacy.