-
Exploiting New Properties of String Net Frequency for Efficient Computation
Authors:
Peaker Guo,
Patrick Eades,
Anthony Wirth,
Justin Zobel
Abstract:
Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings w…
▽ More
Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, \textsc{single-nf}, how to compute the net frequency of a given string of length $m$, in an input text of length $n$ over an alphabet size $σ$. Second, \textsc{all-nf}, given length-$n$ input text, how to report every string of positive net frequency. Our methods leverage suffix arrays, components of the Burrows-Wheeler transform, and solution to the coloured range listing problem. We show that, for both problems, our data structure has $O(n)$ construction cost: with this structure, we solve \textsc{single-nf} in $O(m + σ)$ time and \textsc{all-nf} in $O(n)$ time. Experimentally, we find our method to be around 100 times faster than reasonable baselines for \textsc{single-nf}. For \textsc{all-nf}, our results show that, even with prior knowledge of the set of strings with positive net frequency, simply confirming that their net frequency is positive takes longer than with our purpose-designed method.
△ Less
Submitted 23 April, 2024; v1 submitted 19 April, 2024;
originally announced April 2024.
-
Scalable Methods for Calculating Term Co-Occurrence Frequencies
Authors:
Bodo Billerbeck,
Justin Zobel,
Nicholas Lester,
Nick Craswell
Abstract:
Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms co-occur. In this paper we propose alternative methods for computing this statistic, a challenging task because the number of distinct pairs of terms is vast -…
▽ More
Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms co-occur. In this paper we propose alternative methods for computing this statistic, a challenging task because the number of distinct pairs of terms is vast -- around 100,000 in a typical 1000-word news article, for example. In contrast, we do not employ approximation algorithms, as we want to be able to find exact counts. We explore their efficiency, finding that a naïve approach based on a dictionary is indeed very slow, while methods based on a combination of inverted indexes and linear scanning provide both massive speed-ups and better observed asymptotic behaviour. Our careful implementation shows that, with our novel list-pairs approach it is possible to process over several hundred thousand documents per hour.
△ Less
Submitted 16 July, 2020;
originally announced July 2020.
-
Reference Sequence Construction for Relative Compression of Genomes
Authors:
Shanika Kuruppu,
Simon Puglisi,
Justin Zobel
Abstract:
Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform and also supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In th…
▽ More
Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform and also supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In this paper, we explore using the dictionary of repeats generated by Comrad, Re-pair and Dna-x algorithms as reference sequences for relative compression. We show this technique allows better compression and supports random access just as well. The technique also allows more general repetitive datasets to be compressed using relative compression.
△ Less
Submitted 19 June, 2011;
originally announced June 2011.
-
Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections
Authors:
Christopher Hoobin,
Simon J. Puglisi,
Justin Zobel
Abstract:
Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression…
▽ More
Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.
△ Less
Submitted 8 December, 2011; v1 submitted 13 June, 2011;
originally announced June 2011.