Skip to main content

Showing 1–4 of 4 results for author: Zobel, J

  1. arXiv:2404.12701  [pdf, other

    cs.DS

    Exploiting New Properties of String Net Frequency for Efficient Computation

    Authors: Peaker Guo, Patrick Eades, Anthony Wirth, Justin Zobel

    Abstract: Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings w… ▽ More

    Submitted 23 April, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

    Comments: Full version of a paper to be published at the 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)

  2. arXiv:2007.08709  [pdf, other

    cs.IR

    Scalable Methods for Calculating Term Co-Occurrence Frequencies

    Authors: Bodo Billerbeck, Justin Zobel, Nicholas Lester, Nick Craswell

    Abstract: Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms co-occur. In this paper we propose alternative methods for computing this statistic, a challenging task because the number of distinct pairs of terms is vast -… ▽ More

    Submitted 16 July, 2020; originally announced July 2020.

    Comments: 5 pages, 1 table, 2 figures

  3. arXiv:1106.3791  [pdf, other

    q-bio.QM cs.CE cs.IT

    Reference Sequence Construction for Relative Compression of Genomes

    Authors: Shanika Kuruppu, Simon Puglisi, Justin Zobel

    Abstract: Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform and also supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In th… ▽ More

    Submitted 19 June, 2011; originally announced June 2011.

    Comments: 12 pages, 2 figures, to appear in the Proceedings of SPIRE2011 as a short paper

  4. arXiv:1106.2587  [pdf, ps, other

    cs.DS cs.DB cs.IR

    Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections

    Authors: Christopher Hoobin, Simon J. Puglisi, Justin Zobel

    Abstract: Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression… ▽ More

    Submitted 8 December, 2011; v1 submitted 13 June, 2011; originally announced June 2011.

    Comments: VLDB2012

    Report number: vol5no3/p265_christopherhoobin_vldb2012

    Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 3, pp. 265-273 (2011)