subscribe to arXiv mailings

Exploiting New Properties of String Net Frequency for Efficient Computation

Authors: Peaker Guo, Patrick Eades, Anthony Wirth, Justin Zobel

Abstract: Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings w… ▽ More Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, \textsc{single-nf}, how to compute the net frequency of a given string of length $m$, in an input text of length $n$ over an alphabet size $σ$. Second, \textsc{all-nf}, given length-$n$ input text, how to report every string of positive net frequency. Our methods leverage suffix arrays, components of the Burrows-Wheeler transform, and solution to the coloured range listing problem. We show that, for both problems, our data structure has $O(n)$ construction cost: with this structure, we solve \textsc{single-nf} in $O(m + σ)$ time and \textsc{all-nf} in $O(n)$ time. Experimentally, we find our method to be around 100 times faster than reasonable baselines for \textsc{single-nf}. For \textsc{all-nf}, our results show that, even with prior knowledge of the set of strings with positive net frequency, simply confirming that their net frequency is positive takes longer than with our purpose-designed method. △ Less

Submitted 23 April, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

Comments: Full version of a paper to be published at the 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)

arXiv:2007.08709 [pdf, other]

Scalable Methods for Calculating Term Co-Occurrence Frequencies

Authors: Bodo Billerbeck, Justin Zobel, Nicholas Lester, Nick Craswell

Abstract: Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms co-occur. In this paper we propose alternative methods for computing this statistic, a challenging task because the number of distinct pairs of terms is vast -… ▽ More Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms co-occur. In this paper we propose alternative methods for computing this statistic, a challenging task because the number of distinct pairs of terms is vast -- around 100,000 in a typical 1000-word news article, for example. In contrast, we do not employ approximation algorithms, as we want to be able to find exact counts. We explore their efficiency, finding that a naïve approach based on a dictionary is indeed very slow, while methods based on a combination of inverted indexes and linear scanning provide both massive speed-ups and better observed asymptotic behaviour. Our careful implementation shows that, with our novel list-pairs approach it is possible to process over several hundred thousand documents per hour. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Comments: 5 pages, 1 table, 2 figures

arXiv:1106.3791 [pdf, other]

Reference Sequence Construction for Relative Compression of Genomes

Authors: Shanika Kuruppu, Simon Puglisi, Justin Zobel

Abstract: Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform and also supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In th… ▽ More Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform and also supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In this paper, we explore using the dictionary of repeats generated by Comrad, Re-pair and Dna-x algorithms as reference sequences for relative compression. We show this technique allows better compression and supports random access just as well. The technique also allows more general repetitive datasets to be compressed using relative compression. △ Less

Submitted 19 June, 2011; originally announced June 2011.

Comments: 12 pages, 2 figures, to appear in the Proceedings of SPIRE2011 as a short paper

arXiv:1106.2587 [pdf, ps, other]

Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections

Authors: Christopher Hoobin, Simon J. Puglisi, Justin Zobel

Abstract: Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression… ▽ More Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression. △ Less

Submitted 8 December, 2011; v1 submitted 13 June, 2011; originally announced June 2011.

Comments: VLDB2012

Report number: vol5no3/p265_christopherhoobin_vldb2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 3, pp. 265-273 (2011)

Showing 1–4 of 4 results for author: Zobel, J