subscribe to arXiv mailings

Repairing Reed-Solomon Codes with Side Information

Authors: Thi Xinh Dinh, Ba Thong Le, Son Hoang Dau, Serdar Boztas, Stanislav Kruglik, Han Mao Kiah, Emanuele Viterbo, Tuvi Etzion, Yeow Meng Chee

Abstract: We generalize the problem of recovering a lost/erased symbol in a Reed-Solomon code to the scenario in which some side information about the lost symbol is known. The side information is represented as a set $S$ of linearly independent combinations of the sub-symbols of the lost symbol. When $S = \varnothing$, this reduces to the standard problem of repairing a single codeword symbol. When $S$ is… ▽ More We generalize the problem of recovering a lost/erased symbol in a Reed-Solomon code to the scenario in which some side information about the lost symbol is known. The side information is represented as a set $S$ of linearly independent combinations of the sub-symbols of the lost symbol. When $S = \varnothing$, this reduces to the standard problem of repairing a single codeword symbol. When $S$ is a set of sub-symbols of the erased one, this becomes the repair problem with partially lost/erased symbol. We first establish that the minimum repair bandwidth depends on $|S|$ and not the content of $S$ and construct a lower bound on the repair bandwidth of a linear repair scheme with side information $S$. We then consider the well-known subspace-polynomial repair schemes and show that their repair bandwidths can be optimized by choosing the right subspaces. Finally, we demonstrate several parameter regimes where the optimal bandwidths can be achieved for full-length Reed-Solomon codes. △ Less

Submitted 12 May, 2024; originally announced May 2024.

MSC Class: 94B05; 94B60 ACM Class: E.4

arXiv:2405.06870 [pdf, other]

Noise-Tolerant Codebooks for Semi-Quantitative Group Testing: Application to Spatial Genomics

Authors: Kok Hao Chen, Duc Tu Dao, Han Mao Kiah, Van Long Phuoc Pham, Eitan Yaakobi

Abstract: Motivated by applications in spatial genomics, we revisit group testing (Dorfman~1943) and propose the class of $λ$-{\sf ADD}-codes, studying such codes with certain distance $d$ and codelength $n$. When $d$ is constant, we provide explicit code constructions with rates close to $1/2$. When $d$ is proportional to $n$, we provide a GV-type lower bound whose rates are efficiently computable. Upper b… ▽ More Motivated by applications in spatial genomics, we revisit group testing (Dorfman~1943) and propose the class of $λ$-{\sf ADD}-codes, studying such codes with certain distance $d$ and codelength $n$. When $d$ is constant, we provide explicit code constructions with rates close to $1/2$. When $d$ is proportional to $n$, we provide a GV-type lower bound whose rates are efficiently computable. Upper bounds for such codes are also studied. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: To appear in ISIT 2024 Proceedings

arXiv:2405.06583 [pdf, other]

Private Repair of a Single Erasure in Reed-Solomon Codes

Authors: Stanislav Kruglik, Han Mao Kiah, Son Hoang Dau, Eitan Yaakobi

Abstract: We investigate the problem of privately recovering a single erasure for Reed-Solomon codes with low communication bandwidths. For an $[n,k]_{q^\ell}$ code with $n-k\geq q^{m}+t-1$, we construct a repair scheme that allows a client to recover an arbitrary codeword symbol without leaking its index to any set of $t$ colluding helper nodes at a repair bandwidth of $(n-1)(\ell-m)$ sub-symbols in… ▽ More We investigate the problem of privately recovering a single erasure for Reed-Solomon codes with low communication bandwidths. For an $[n,k]_{q^\ell}$ code with $n-k\geq q^{m}+t-1$, we construct a repair scheme that allows a client to recover an arbitrary codeword symbol without leaking its index to any set of $t$ colluding helper nodes at a repair bandwidth of $(n-1)(\ell-m)$ sub-symbols in $\mathbb{F}_q$. When $t=1$, this reduces to the bandwidth of existing repair schemes based on subspace polynomials. We prove the optimality of the proposed scheme when $n=q^\ell$ under a reasonable assumption about the schemes being used. Our private repair scheme can also be transformed into a private retrieval scheme for data encoded by Reed-Solomon codes. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: Full version of the paper accepted for the 2024 IEEE International Symposium on Information Theory (ISIT)

arXiv:2405.03614 [pdf, ps, other]

Repairing with Zero Skip Cost

Authors: Wenqin Zhang, Yeow Meng Chee, Son Hoang Dau, Tuvi Etzion, Han Mao Kiah, Yuan Luo

Abstract: To measure repair latency at helper nodes, we introduce a new metric called skip cost that quantifies the number of contiguous sections accessed on a disk. We provide explicit constructions of zigzag codes and fractional repetition codes that incur zero skip cost To measure repair latency at helper nodes, we introduce a new metric called skip cost that quantifies the number of contiguous sections accessed on a disk. We provide explicit constructions of zigzag codes and fractional repetition codes that incur zero skip cost △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2405.02080 [pdf, ps, other]

Coding for Synthesis Defects

Authors: Ziyang Lu, Han Mao Kiah, Yiwei Zhang, Robert N. Grass, Eitan Yaakobi

Abstract: Motivated by DNA based data storage system, we investigate the errors that occur when synthesizing DNA strands in parallel, where each strand is appended one nucleotide at a time by the machine according to a template supersequence. If there is a cycle such that the machine fails, then the strands meant to be appended at this cycle will not be appended, and we refer to this as a synthesis defect.… ▽ More Motivated by DNA based data storage system, we investigate the errors that occur when synthesizing DNA strands in parallel, where each strand is appended one nucleotide at a time by the machine according to a template supersequence. If there is a cycle such that the machine fails, then the strands meant to be appended at this cycle will not be appended, and we refer to this as a synthesis defect. In this paper, we present two families of codes correcting synthesis defects, which are t-known-synthesis-defect correcting codes and t-synthesis-defect correcting codes. For the first one, it is assumed that the defective cycles are known, and each of the codeword is a quaternary sequence. We provide constructions for this family of codes for t = 1, 2, with redundancy log 4 and log n+18 log 3, respectively. For the second one, the codeword is a set of M ordered sequences, and we give constructions for t = 1, 2 to show a strategy for constructing this family of codes. Finally, we derive a lower bound on the redundancy for single-known-synthesis-defect correcting codes, which assures that our construction is almost optimal. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2403.20170 [pdf, ps, other]

Recovery Sets of Subspaces from a Simplex Code

Authors: Yeow Meng Chee, Tuvi Etzion, Han Mao Kiah, Hui Zhang

Abstract: Recovery sets for vectors and subspaces are important in the construction of distributed storage system codes. These concepts are also interesting in their own right. In this paper, we consider the following very basic recovery question: what is the maximum number of possible pairwise disjoint recovery sets for each recovered element? The recovered elements in this work are d-dimensional subspaces… ▽ More Recovery sets for vectors and subspaces are important in the construction of distributed storage system codes. These concepts are also interesting in their own right. In this paper, we consider the following very basic recovery question: what is the maximum number of possible pairwise disjoint recovery sets for each recovered element? The recovered elements in this work are d-dimensional subspaces of a $k$-dimensional vector space over GF(q). Each server stores one representative for each distinct one-dimensional subspace of the k-dimensional vector space, or equivalently a distinct point of PG(k-1,q). As column vectors, the associated vectors of the stored one-dimensional subspaces form the generator matrix of the $[(q^k -1)/(q-1),k,q^{k-1}]$ simplex code over GF(q). Lower bounds and upper bounds on the maximum number of such recovery sets are provided. It is shown that generally, these bounds are either tight or very close to being tight. △ Less

Submitted 29 March, 2024; originally announced March 2024.

arXiv:2403.15827 [pdf, ps, other]

Permutation Recovery Problem against Deletion Errors for DNA Data Storage

Authors: Shubhransh Singhvi, Charchit Gupta, Avital Boruchovsky, Yuval Goldberg, Han Mao Kiah, Eitan Yaakobi

Abstract: Owing to its immense storage density and durability, DNA has emerged as a promising storage medium. However, due to technological constraints, data can only be written onto many short DNA molecules called data blocks that are stored in an unordered way. To handle the unordered nature of DNA data storage systems, a unique address is typically prepended to each data block to form a DNA strand. Howev… ▽ More Owing to its immense storage density and durability, DNA has emerged as a promising storage medium. However, due to technological constraints, data can only be written onto many short DNA molecules called data blocks that are stored in an unordered way. To handle the unordered nature of DNA data storage systems, a unique address is typically prepended to each data block to form a DNA strand. However, DNA storage systems are prone to errors and generate multiple noisy copies of each strand called DNA reads. Thus, we study the permutation recovery problem against deletions errors for DNA data storage. The permutation recovery problem for DNA data storage requires one to reconstruct the addresses or in other words to uniquely identify the noisy reads. By successfully reconstructing the addresses, one can essentially determine the correct order of the data blocks, effectively solving the clustering problem. We first show that we can almost surely identify all the noisy reads under certain mild assumptions. We then propose a permutation recovery procedure and analyze its complexity. △ Less

Submitted 23 March, 2024; originally announced March 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2305.04597

arXiv:2403.07754 [pdf, other]

An Optimal Sequence Reconstruction Algorithm for Reed-Solomon Codes

Authors: Shubhransh Singhvi, Roni Con, Han Mao Kiah, Eitan Yaakobi

Abstract: The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a scenario where the sender transmits a codeword from some codebook, and the receiver obtains $N$ noisy outputs of the codeword. We study the problem of efficient reconstruction using $N$ outputs that are each corrupted by at most $t$ substitutions. Specifically, for the ubiquitous Reed-Solomon codes, we adapt the Ko… ▽ More The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a scenario where the sender transmits a codeword from some codebook, and the receiver obtains $N$ noisy outputs of the codeword. We study the problem of efficient reconstruction using $N$ outputs that are each corrupted by at most $t$ substitutions. Specifically, for the ubiquitous Reed-Solomon codes, we adapt the Koetter-Vardy soft-decoding algorithm, presenting a reconstruction algorithm capable of correcting beyond Johnson radius. Furthermore, the algorithm uses $\mathcal{O}(nN)$ field operations, where $n$ is the codeword length. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: Submitted to IEEE ISIT 2024

arXiv:2402.18869 [pdf, other]

Evaluating the Gilbert-Varshamov Bound for Constrained Systems

Authors: Keshav Goyal, Han Mao Kiah

Abstract: We revisit the well-known Gilbert-Varshamov (GV) bound for constrained systems. In 1991, Kolesnik and Krachkovsky showed that GV bound can be determined via the solution of some optimization problem. Later, Marcus and Roth (1992) modified the optimization problem and improved the GV bound in many instances. In this work, we provide explicit numerical procedures to solve these two optimization prob… ▽ More We revisit the well-known Gilbert-Varshamov (GV) bound for constrained systems. In 1991, Kolesnik and Krachkovsky showed that GV bound can be determined via the solution of some optimization problem. Later, Marcus and Roth (1992) modified the optimization problem and improved the GV bound in many instances. In this work, we provide explicit numerical procedures to solve these two optimization problems and hence, compute the bounds. We then show the procedures can be further simplified when we plot the respective curves. In the case where the graph presentation comprise a single state, we provide explicit formulas for both bounds. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: 27 Pages, 5 figures, submitted to Entropy

arXiv:2402.14712 [pdf, other]

Gilbert-Varshamov Bound for Codes in $L_1$ Metric using Multivariate Analytic Combinatorics

Authors: Keshav Goyal, Duc Tu Dao, Mladen Kovačević, Han Mao Kiah

Abstract: Analytic combinatorics in several variables refers to a suite of tools that provide sharp asymptotic estimates for certain combinatorial quantities. In this paper, we apply these tools to determine the Gilbert--Varshamov lower bound on the rate of optimal codes in $L_1$ metric. Several different code spaces are analyzed, including the simplex and the hypercube in $\mathbb{Z^n}$, all of which are i… ▽ More Analytic combinatorics in several variables refers to a suite of tools that provide sharp asymptotic estimates for certain combinatorial quantities. In this paper, we apply these tools to determine the Gilbert--Varshamov lower bound on the rate of optimal codes in $L_1$ metric. Several different code spaces are analyzed, including the simplex and the hypercube in $\mathbb{Z^n}$, all of which are inspired by concrete data storage and transmission models such as the sticky insertion channel, the permutation channel, the adjacent transposition (bit-shift) channel, the multilevel flash memory channel, etc. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 33 pages, 3 figures, submitted to IEEE Transactions on Information Theory

arXiv:2305.07274 [pdf, other]

Deletion Correcting Codes for Efficient DNA Synthesis

Authors: Johan Chrisnata, Han Mao Kiah, Van Long Phuoc Pham

Abstract: The synthesis of DNA strands remains the most costly part of the DNA storage system. Thus, to make DNA storage system more practical, the time and materials used in the synthesis process have to be optimized. We consider the most common type of synthesis process where multiple DNA strands are synthesized in parallel from a common alternating supersequence, one nucleotide at a time. The synthesis t… ▽ More The synthesis of DNA strands remains the most costly part of the DNA storage system. Thus, to make DNA storage system more practical, the time and materials used in the synthesis process have to be optimized. We consider the most common type of synthesis process where multiple DNA strands are synthesized in parallel from a common alternating supersequence, one nucleotide at a time. The synthesis time or the number of synthesis cycles is then determined by the length of this common supersequence. In this model, we design quaternary codes that minimizes synthesis time that can correct deletions or insertions, which are the most prevalent types of error in array-based synthesis. We also propose polynomial-time algorithms that encode binary strings into these codes and show that the rate is close to capacity. △ Less

Submitted 12 May, 2023; originally announced May 2023.

Comments: A shorter version of this paper will be presented in in ISIT 2023

arXiv:2305.04597 [pdf, other]

Data-Driven Bee Identification for DNA Strands

Authors: Shubhransh Singhvi, Avital Boruchovsky, Han Mao Kiah, Eitan Yaakobi

Abstract: We study a data-driven approach to the bee identification problem for DNA strands. The bee-identification problem, introduced by Tandon et al. (2019), requires one to identify $M$ bees, each tagged by a unique barcode, via a set of $M$ noisy measurements. Later, Chrisnata et al. (2022) extended the model to case where one observes $N$ noisy measurements of each bee, and applied the model to addres… ▽ More We study a data-driven approach to the bee identification problem for DNA strands. The bee-identification problem, introduced by Tandon et al. (2019), requires one to identify $M$ bees, each tagged by a unique barcode, via a set of $M$ noisy measurements. Later, Chrisnata et al. (2022) extended the model to case where one observes $N$ noisy measurements of each bee, and applied the model to address the unordered nature of DNA storage systems. In such systems, a unique address is typically prepended to each DNA data block to form a DNA strand, but the address may possibly be corrupted. While clustering is usually used to identify the address of a DNA strand, this requires $\mathcal{M}^2$ data comparisons (when $\mathcal{M}$ is the number of reads). In contrast, the approach of Chrisnata et al. (2022) avoids data comparisons completely. In this work, we study an intermediate, data-driven approach to this identification task. For the binary erasure channel, we first show that we can almost surely correctly identify all DNA strands under certain mild assumptions. Then we propose a data-driven pruning procedure and demonstrate that on average the procedure uses only a fraction of $\mathcal{M}^2$ data comparisons. Specifically, for $\mathcal{M}= 2^n$ and erasure probability $p$, the expected number of data comparisons performed by the procedure is $κ\mathcal{M}^2$, where $\left(\frac{1+2p-p^2}{2}\right)^n \leq κ\leq \left(\frac{1+p}{2}\right)^n $. △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: Conference paper accepted at ISIT 2023

arXiv:2305.03442 [pdf, other]

Repair of Reed-Solomon Codes in the Presence of Erroneous Nodes

Authors: Stanislav Kruglik, Gaojun Luo, Wilton Kim, Shubhransh Singhvi, Han Mao Kiah, San Ling, Huaxiong Wang

Abstract: We consider the repair scheme of Guruswami-Wootters for the Reed-Solomon code and ask: can we correctly repair a failed node in the presence of erroneous nodes? Equivalently, we consider the collection of downloaded traces as a code and investigate its code-distance properties. We propose three lower bounds on its minimum distance and study methods to efficiently correct errors close to these boun… ▽ More We consider the repair scheme of Guruswami-Wootters for the Reed-Solomon code and ask: can we correctly repair a failed node in the presence of erroneous nodes? Equivalently, we consider the collection of downloaded traces as a code and investigate its code-distance properties. We propose three lower bounds on its minimum distance and study methods to efficiently correct errors close to these bounds. △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: Accepted to IEEE International Symposium on Information Theory 2023

arXiv:2302.13714 [pdf, other]

On the Design of Codes for DNA Computing: Secondary Structure Avoidance Codes

Authors: Tuan Thanh Nguyen, Kui Cai, Han Mao Kiah, Duc Tu Dao, Kees A. Schouhamer Immink

Abstract: In this work, we investigate a challenging problem, which has been considered to be an important criterion in designing codewords for DNA computing purposes, namely secondary structure avoidance in single-stranded DNA molecules. In short, secondary structure refers to the tendency of a single-stranded DNA sequence to fold back upon itself, thus becoming inactive in the computation process. While s… ▽ More In this work, we investigate a challenging problem, which has been considered to be an important criterion in designing codewords for DNA computing purposes, namely secondary structure avoidance in single-stranded DNA molecules. In short, secondary structure refers to the tendency of a single-stranded DNA sequence to fold back upon itself, thus becoming inactive in the computation process. While some design criteria that reduces the possibility of secondary structure formation has been proposed by Milenkovic and Kashyap (2006), the main contribution of this work is to provide an explicit construction of DNA codes that completely avoid secondary structure of arbitrary stem length. Formally, given codeword length n and arbitrary integer m>=2, we provide efficient methods to construct DNA codes of length n that avoid secondary structure of any stem length more than or equal to m. Particularly, when m = 3, our constructions yield a family of DNA codes of rate 1.3031 bits/nt, while the highest rate found in the prior art was 1.1609 bits/nt. In addition, for m>=3log n + 4, we provide an efficient encoder that incurs only one redundant symbol. △ Less

Submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.02230 [pdf, ps, other]

$k$-server Byzantine-Resistant PIR Scheme with Optimal Download Rate and Optimal File Size

Authors: Stanislav Kruglik, Son Hoang Dau, Han Mao Kiah, Huaxiong Wang

Abstract: We consider the problem of designing a Private Information Retrieval (PIR) scheme on $m$ files replicated on $k$ servers that can collude or, even worse, can return incorrect answers. Our goal is to correctly retrieve a specific message while keeping its identity private from the database servers. We consider the asymptotic information-theoretic capacity of this problem defined as the maximum rati… ▽ More We consider the problem of designing a Private Information Retrieval (PIR) scheme on $m$ files replicated on $k$ servers that can collude or, even worse, can return incorrect answers. Our goal is to correctly retrieve a specific message while keeping its identity private from the database servers. We consider the asymptotic information-theoretic capacity of this problem defined as the maximum ratio of the number of correctly retrieved symbols to the downloaded one for a large enough number of stored files. We propose an achievable scheme with a small file size and prove that such a file size is minimal for the fixed number of retrieved symbols, solving the problem pointed out by Banawan and Ulukus. △ Less

Submitted 5 May, 2023; v1 submitted 4 February, 2023; originally announced February 2023.

Comments: Accepted to IEEE International Symposium on Information Theory 2023

arXiv:2302.01733 [pdf, other]

Committed Private Information Retrieval

Authors: Quang Cao, Hong Yen Tran, Son Hoang Dau, Xun Yi, Emanuele Viterbo, Chen Feng, Yu-Chih Huang, Jingge Zhu, Stanislav Kruglik, Han Mao Kiah

Abstract: A private information retrieval (PIR) scheme allows a client to retrieve a data item $x_i$ among $n$ items $x_1,x_2,\ldots,x_n$ from $k$ servers, without revealing what $i$ is even when $t < k$ servers collude and try to learn $i$. Such a PIR scheme is said to be $t$-private. A PIR scheme is $v$-verifiable if the client can verify the correctness of the retrieved $x_i$ even when $v \leq k$ servers… ▽ More A private information retrieval (PIR) scheme allows a client to retrieve a data item $x_i$ among $n$ items $x_1,x_2,\ldots,x_n$ from $k$ servers, without revealing what $i$ is even when $t < k$ servers collude and try to learn $i$. Such a PIR scheme is said to be $t$-private. A PIR scheme is $v$-verifiable if the client can verify the correctness of the retrieved $x_i$ even when $v \leq k$ servers collude and try to fool the client by sending manipulated data. Most of the previous works in the literature on PIR assumed that $v < k$, leaving the case of all-colluding servers open. We propose a generic construction that combines a linear map commitment (LMC) and an arbitrary linear PIR scheme to produce a $k$-verifiable PIR scheme, termed a committed PIR scheme. Such a scheme guarantees that even in the worst scenario, when all servers are under the control of an attacker, although the privacy is unavoidably lost, the client won't be fooled into accepting an incorrect $x_i$. We demonstrate the practicality of our proposal by implementing the committed PIR schemes based on the Lai-Malavolta LMC and three well-known PIR schemes using the GMP library and blst, the current fastest C library for elliptic curve pairings. △ Less

Submitted 25 September, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

Comments: Accepted at ESORICS 2023

arXiv:2301.11730 [pdf, ps, other]

Two-Server Private Information Retrieval with Optimized Download Rate and Result Verification

Authors: Stanislav Kruglik, Son Hoang Dau, Han Mao Kiah, Huaxiong Wang

Abstract: Private Information Retrieval (PIR) schemes allow a client to retrieve any file of interest, while hiding the file identity from the database servers. In contrast to most existing PIR schemes that assume honest-but-curious servers, we study the case of dishonest servers. The latter provide incorrect answers and try to persuade the client to output the wrong result. We introduce several PIR schemes… ▽ More Private Information Retrieval (PIR) schemes allow a client to retrieve any file of interest, while hiding the file identity from the database servers. In contrast to most existing PIR schemes that assume honest-but-curious servers, we study the case of dishonest servers. The latter provide incorrect answers and try to persuade the client to output the wrong result. We introduce several PIR schemes with information-theoretic privacy and result verification for the case of two servers. Security guarantees can be information-theoretical or computational, and the verification keys can be public or private. In this work, our main performance metric is the download rate. △ Less

Submitted 8 June, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

Comments: Accepted to IEEE International Symposium on Information Theory 2023

arXiv:2212.09952 [pdf, other]

Efficient Algorithms for the Bee-Identification Problem

Authors: Han Mao Kiah, Alexander Vardy, Hanwen Yao

Abstract: The bee-identification problem, formally defined by Tandon, Tan and Varshney (2019), requires the receiver to identify "bees" using a set of unordered noisy measurements. In this previous work, Tandon, Tan, and Varshney studied error exponents and showed that decoding the measurements jointly results in a significantly smaller error exponent. In this work, we study algorithms related to this joi… ▽ More The bee-identification problem, formally defined by Tandon, Tan and Varshney (2019), requires the receiver to identify "bees" using a set of unordered noisy measurements. In this previous work, Tandon, Tan, and Varshney studied error exponents and showed that decoding the measurements jointly results in a significantly smaller error exponent. In this work, we study algorithms related to this joint decoder. First, we demonstrate how to perform joint decoding efficiently. By reducing to the problem of finding perfect matching and minimum-cost matchings, we obtain joint decoders that run in time quadratic and cubic in the number of "bees" for the binary erasure (BEC) and binary symmetric channels (BSC), respectively. Next, by studying the matching algorithms in the context of channel coding, we further reduce the running times by using classical tools like peeling decoders and list-decoders. In particular, we show that our identifier algorithms when used with Reed-Muller codes terminate in almost linear and quadratic time for BEC and BSC, respectively. Finally, for explicit codebooks, we study when these joint decoders fail to identify the "bees" correctly. Specifically, we provide practical methods of estimating the probability of erroneous identification for given codebooks. △ Less

Submitted 19 December, 2022; originally announced December 2022.

MSC Class: 94B99

arXiv:2212.07091 [pdf, other]

Verifiable Coded Computation of Multiple Functions

Authors: Wilton Kim, Stanislav Kruglik, Han Mao Kiah

Abstract: We consider the problem of evaluating distinct multivariate polynomials over several massive datasets in a distributed computing system with a single master node and multiple worker nodes. We focus on the general case when each multivariate polynomial is evaluated over its corresponding dataset and propose a generalization of the Lagrange Coded Computing framework (Yu et al. 2019) to perform all c… ▽ More We consider the problem of evaluating distinct multivariate polynomials over several massive datasets in a distributed computing system with a single master node and multiple worker nodes. We focus on the general case when each multivariate polynomial is evaluated over its corresponding dataset and propose a generalization of the Lagrange Coded Computing framework (Yu et al. 2019) to perform all computations simultaneously while providing robustness against stragglers who do not respond in time, adversarial workers who respond with wrong computation and information-theoretic security of dataset against colluding workers. Our scheme introduces a small computation overhead which results in a reduction in download cost and also offers comparable resistance to stragglers over existing solutions. On top of it, we also propose two verification schemes to detect the presence of adversaries, which leads to incorrect results, without involving additional nodes. △ Less

Submitted 22 August, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

Comments: 13 pages, 1 figure, 2 tables

arXiv:2209.03251 [pdf, other]

Explicit Low-Bandwidth Evaluation Schemes for Weighted Sums of Reed-Solomon-Coded Symbols

Authors: Han Mao Kiah, Wilton Kim, Stanislav Kruglik, San Ling, Huaxiong Wang

Abstract: Motivated by applications in distributed storage, distributed computing, and homomorphic secret sharing, we study communication-efficient schemes for computing linear combinations of coded symbols. Specifically, we design low-bandwidth schemes that evaluate the weighted sum of $\ell$ coded symbols in a codeword $\pmb{c}\in\mathbb{F}^n$, when we are given access to $d$ of the remaining components i… ▽ More Motivated by applications in distributed storage, distributed computing, and homomorphic secret sharing, we study communication-efficient schemes for computing linear combinations of coded symbols. Specifically, we design low-bandwidth schemes that evaluate the weighted sum of $\ell$ coded symbols in a codeword $\pmb{c}\in\mathbb{F}^n$, when we are given access to $d$ of the remaining components in $\pmb{c}$. Formally, suppose that $\mathbb{F}$ is a field extension of $\mathbb{B}$ of degree $t$. Let $\pmb{c}$ be a codeword in a Reed-Solomon code of dimension $k$ and our task is to compute the weighted sum of $\ell$ coded symbols. In this paper, for some $s<t$, we provide an explicit scheme that performs this task by downloading $d(t-s)$ sub-symbols in $\mathbb{B}$ from $d$ available nodes, whenever $d\geq \ell|\mathbb{B}|^s-\ell+k$. In many cases, our scheme outperforms previous schemes in the literature. Furthermore, we provide a characterization of evaluation schemes for general linear codes. Then in the special case of Reed-Solomon codes, we use this characterization to derive a lower bound for the evaluation bandwidth. △ Less

Submitted 7 May, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

Comments: Accepted to 2023 IEEE International Symposium on Information Theory

arXiv:2208.09138 [pdf, ps, other]

Two dimensional RC/Subarray Constrained Codes: Bounded Weight and Almost Balanced Weight

Authors: Tuan Thanh Nguyen, Kui Cai, Han Mao Kiah, Kees A. Schouhamer Immink, Yeow Meng Chee

Abstract: In this work, we study two types of constraints on two-dimensional binary arrays. In particular, given $p,ε>0$, we study (i) The $p$-bounded constraint: a binary vector of size $m$ is said to be $p$-bounded if its weight is at most $pm$, and (ii) The $ε$-balanced constraint: a binary vector of size $m$ is said to be $ε$-balanced if its weight is within $[(0.5-ε)*m,(0.5+ε)*m]$. Such constraints are… ▽ More In this work, we study two types of constraints on two-dimensional binary arrays. In particular, given $p,ε>0$, we study (i) The $p$-bounded constraint: a binary vector of size $m$ is said to be $p$-bounded if its weight is at most $pm$, and (ii) The $ε$-balanced constraint: a binary vector of size $m$ is said to be $ε$-balanced if its weight is within $[(0.5-ε)*m,(0.5+ε)*m]$. Such constraints are crucial in several data storage systems, those regard the information data as two-dimensional (2D) instead of one-dimensional (1D), such as the crossbar resistive memory arrays and the holographic data storage. In this work, efficient encoding/decoding algorithms are presented for binary arrays so that the weight constraint (either $p$-bounded constraint or $ε$-balanced constraint) is enforced over every row and every column, regarded as 2D row-column (RC) constrained codes; or over every subarray, regarded as 2D subarray constrained codes. While low-complexity designs have been proposed in the literature, mostly focusing on 2D RC constrained codes where $p = 1/2$ and $ε= 0$, this work provides efficient coding methods that work for both 2D RC constrained codes and 2D subarray constrained codes, and more importantly, the methods are applicable for arbitrary values of $p$ and $ε$. Furthermore, for certain values of $p$ and $ε$, we show that, for sufficiently large array size, there exists linear-time encoding/decoding algorithm that incurs at most one redundant bit. △ Less

Submitted 18 August, 2022; originally announced August 2022.

arXiv:2204.13831 [pdf, other]

Average Redundancy of Variable-Length Balancing Schemes à la Knuth

Authors: Duc Tu Dao, Han Mao Kiah, Tuan Thanh Nguyen

Abstract: We study and propose schemes that map messages onto constant-weight codewords using variable-length prefixes. We provide polynomial-time computable formulas that estimate the average number of redundant bits incurred by our schemes. In addition to the exact formulas, we also perform an asymptotic analysis and demonstrate that our scheme uses $\frac12 \log n+O(1)$ redundant bits to encode messages… ▽ More We study and propose schemes that map messages onto constant-weight codewords using variable-length prefixes. We provide polynomial-time computable formulas that estimate the average number of redundant bits incurred by our schemes. In addition to the exact formulas, we also perform an asymptotic analysis and demonstrate that our scheme uses $\frac12 \log n+O(1)$ redundant bits to encode messages into length-$n$ words with weight $(n/2)+{\sf q}$ for constant ${\sf q}$. We also propose schemes that map messages into balanced codebooks with error-correcting capabilities. For such schemes, we provide methods to enumerate the average number of redundant bits. △ Less

Submitted 3 July, 2023; v1 submitted 28 April, 2022; originally announced April 2022.

Comments: Extended version with new results

arXiv:2111.04255 [pdf, ps, other]

Sequence Reconstruction Problem for Deletion Channels: A Complete Asymptotic Solution

Authors: Van Long Phuoc Pham, Keshav Goyal, Han Mao Kiah

Abstract: Transmit a codeword $x$, that belongs to an $(\ell-1)$-deletion-correcting code of length $n$, over a $t$-deletion channel for some $1\le \ell\le t<n$. Levenshtein, in 2001, proposed the problem of determining $N(n,\ell,t)+1$, the minimum number of distinct channel outputs required to uniquely reconstruct $x$. Prior to this work, $N(n,\ell,t)$ is known only when $\ell\in\{1,2\}$. Here, we provide… ▽ More Transmit a codeword $x$, that belongs to an $(\ell-1)$-deletion-correcting code of length $n$, over a $t$-deletion channel for some $1\le \ell\le t<n$. Levenshtein, in 2001, proposed the problem of determining $N(n,\ell,t)+1$, the minimum number of distinct channel outputs required to uniquely reconstruct $x$. Prior to this work, $N(n,\ell,t)$ is known only when $\ell\in\{1,2\}$. Here, we provide an asymptotically exact solution for all values of $\ell$ and $t$. Specifically, we show that $N(n,\ell,t)=\binom{2\ell}{\ell}/(t-\ell)! n^{t-\ell} - O(n^{t-\ell-1})$ and in the special instance where $\ell=t$, we show that $N(n,\ell,\ell)=\binom{2\ell}{\ell}$. We also provide a conjecture on the exact value of $N(n,\ell,t)$ for all values of $n$, $\ell$, and $t$. △ Less

Submitted 7 November, 2021; originally announced November 2021.

MSC Class: 94B99

arXiv:2107.07377 [pdf, other]

Computing Permanents on a Trellis

Authors: Han Mao Kiah, Alexander Vardy, Hanwen Yao

Abstract: The problem of computing the permanent of a matrix has attracted interest since the work of Ryser(1963) and Valiant(1979). On the other hand, trellises were extensively studied in coding theory since the 1960s. In this work, we establish a connection between the two domains. We introduce the canonical trellis $T_n$ that represents all permutations, and show that the permanent of a $n$ by $n$ matri… ▽ More The problem of computing the permanent of a matrix has attracted interest since the work of Ryser(1963) and Valiant(1979). On the other hand, trellises were extensively studied in coding theory since the 1960s. In this work, we establish a connection between the two domains. We introduce the canonical trellis $T_n$ that represents all permutations, and show that the permanent of a $n$ by $n$ matrix $A$ can be computed as a flow on this trellis. Under certain normalization, the trellis-based method invokes slightly less operations than best known exact methods. Moreover, if $A$ has structure, then $T_n$ becomes amenable to vertex merging, thereby significantly reducing its complexity. - Repeated rows: Suppose $A$ has only $t<n$ distinct rows. The best known method to compute $per(A)$, due to Clifford and Clifford (2020), has complexity $O(n^{t+1})$. Merging vertices in $T_n$, we obtain a reduced trellis that has complexity $O(n^t)$. - Order statistics: Using trellises, we compute the joint distribution of $t$ order statistics of $n$ independent, but not identically distributed, random variables in time $O(n^{t+1})$. Previously, polynomial-time methods were known only when the variables are drawn from two non-identical distributions. - Sparse matrices: Suppose each entry in $A$ is nonzero with probability $d/n$ with $d$ is constant. We show that $T_n$ can be pruned to exponentially fewer vertices, resulting in complexity $O(φ^n)$ with $φ<2$. - TSP: Intersecting $T_n$ with another trellis that represents walks, we obtain a trellis that represents circular permutations. Using the latter trellis to solve the traveling salesperson problem recovers the well-known Held-Karp algorithm. Notably, in all cases, the reduced trellis are obtained using known techniques in trellis theory. We expect other trellis-theoretic results to apply to other structured matrices. △ Less

Submitted 15 July, 2021; originally announced July 2021.

arXiv:2007.15253 [pdf, other]

Repairing Reed-Solomon Codes via Subspace Polynomials

Authors: Hoang Dau, Dinh Thi Xinh, Han Mao Kiah, Tran Thi Luong, Olgica Milenkovic

Abstract: We propose new repair schemes for Reed-Solomon codes that use subspace polynomials and hence generalize previous works in the literature that employ trace polynomials. The Reed-Solomon codes are over $\mathbb{F}_{q^\ell}$ and have redundancy $r = n-k \geq q^m$, $1\leq m\leq \ell$, where $n$ and $k$ are the code length and dimension, respectively. In particular, for one erasure, we show that our sc… ▽ More We propose new repair schemes for Reed-Solomon codes that use subspace polynomials and hence generalize previous works in the literature that employ trace polynomials. The Reed-Solomon codes are over $\mathbb{F}_{q^\ell}$ and have redundancy $r = n-k \geq q^m$, $1\leq m\leq \ell$, where $n$ and $k$ are the code length and dimension, respectively. In particular, for one erasure, we show that our schemes can achieve optimal repair bandwidths whenever $n=q^\ell$ and $r = q^m,$ for all $1 \leq m \leq \ell$. For two erasures, our schemes use the same bandwidth per erasure as the single erasure schemes, for $\ell/m$ is a power of $q$, and for $\ell=q^a$, $m=q^b-1>1$ ($a \geq b \geq 1$), and for $m\geq \ell/2$ when $\ell$ is even and $q$ is a power of two. △ Less

Submitted 30 July, 2020; originally announced July 2020.

arXiv:2005.03102 [pdf, ps, other]

Constrained de Bruijn Codes: Properties, Enumeration, Constructions, and Applications

Authors: Yeow Meng Chee, Tuvi Etzion, Han Mao Kiah, Alexander Vardy, Van Khu Vu, Eitan yaakobi

Abstract: The de Bruijn graph, its sequences, and their various generalizations, have found many applications in information theory, including many new ones in the last decade. In this paper, motivated by a coding problem for emerging memory technologies, a set of sequences which generalize sequences in the de Bruijn graph are defined. These sequences can be also defined and viewed as constrained sequences.… ▽ More The de Bruijn graph, its sequences, and their various generalizations, have found many applications in information theory, including many new ones in the last decade. In this paper, motivated by a coding problem for emerging memory technologies, a set of sequences which generalize sequences in the de Bruijn graph are defined. These sequences can be also defined and viewed as constrained sequences. Hence, they will be called constrained de Bruijn sequences and a set of such sequences will be called a constrained de Bruijn code. Several properties and alternative definitions for such codes are examined and they are analyzed as generalized sequences in the de Bruijn graph (and its generalization) and as constrained sequences. Various enumeration techniques are used to compute the total number of sequences for any given set of parameters. A construction method of such codes from the theory of shift-register sequences is proposed. Finally, we show how these constrained de Bruijn sequences and codes can be applied in constructions of codes for correcting synchronization errors in the $\ell$-symbol read channel and in the racetrack memory channel. For this purpose, these codes are superior in their size on previously known codes. △ Less

Submitted 6 May, 2020; originally announced May 2020.

arXiv:2004.06032 [pdf, ps, other]

Optimal Reconstruction Codes for Deletion Channels

Authors: Johan Chrisnata, Han Mao Kiah, Eitan Yaakobi

Abstract: The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a communication scenario where the sender transmits a codeword from some codebook and the receiver obtains multiple noisy reads of the codeword. Motivated by modern storage devices, we introduced a variant of the problem where the number of noisy reads $N$ is fixed (Kiah et al. 2020). Of significance, for the single-… ▽ More The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a communication scenario where the sender transmits a codeword from some codebook and the receiver obtains multiple noisy reads of the codeword. Motivated by modern storage devices, we introduced a variant of the problem where the number of noisy reads $N$ is fixed (Kiah et al. 2020). Of significance, for the single-deletion channel, using $\log_2\log_2 n +O(1)$ redundant bits, we designed a reconstruction code of length $n$ that reconstructs codewords from two distinct noisy reads. In this work, we show that $\log_2\log_2 n -O(1)$ redundant bits are necessary for such reconstruction codes, thereby, demonstrating the optimality of our previous construction. Furthermore, we show that these reconstruction codes can be used in $t$-deletion channels (with $t\ge 2$) to uniquely reconstruct codewords from $n^{t-1}+O\left(n^{t-2}\right)$ distinct noisy reads. △ Less

Submitted 13 April, 2020; originally announced April 2020.

arXiv:2001.02839 [pdf, other]

Capacity-Approaching Constrained Codes with Error Correction for DNA-Based Data Storage

Authors: Tuan Thanh Nguyen, Kui Cai, Kees A. Schouhamer Immink, Han Mao Kiah

Abstract: We propose coding techniques that limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given $\ell, ε > 0$, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the… ▽ More We propose coding techniques that limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given $\ell, ε > 0$, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following properties: (i) Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, (ii) GC-content constraint: the GC-content of each codeword is within $[0.5-ε, 0.5+ε]$, (iii) Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. For practical values of $\ell$ and $ε$, we show that our encoders achieve much higher rates than existing results in the literature and approach the capacity. Our methods have low encoding/decoding complexity and limited error propagation. △ Less

Submitted 8 January, 2020; originally announced January 2020.

arXiv:2001.01376 [pdf, other]

Coding for Sequence Reconstruction for Single Edits

Authors: Kui Cai, Han Mao Kiah, Tuan Thanh Nguyen, Eitan Yaakobi

Abstract: The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a communication scenario where the sender transmits a codeword from some codebook and the receiver obtains multiple noisy reads of the codeword. The common setup assumes the codebook to be the entire space and the problem is to determine the minimum number of distinct reads that is required to reconstruct the transmi… ▽ More The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a communication scenario where the sender transmits a codeword from some codebook and the receiver obtains multiple noisy reads of the codeword. The common setup assumes the codebook to be the entire space and the problem is to determine the minimum number of distinct reads that is required to reconstruct the transmitted codeword. Motivated by modern storage devices, we study a variant of the problem where the number of noisy reads $N$ is fixed. Specifically, we design reconstruction codes that reconstruct a codeword from $N$ distinct noisy reads. We focus on channels that introduce single edit error (i.e. a single substitution, insertion, or deletion) and their variants, and design reconstruction codes for all values of $N$. In particular, for the case of a single edit, we show that as the number of noisy reads increases, the number of redundant bits required can be gracefully reduced from $\log n+O(1)$ to $\log \log n+O(1)$, and then to $O(1)$, where $n$ denotes the length of a codeword. We also show that the redundancy of certain reconstruction codes is within one bit of optimality. △ Less

Submitted 14 June, 2020; v1 submitted 5 January, 2020; originally announced January 2020.

arXiv:1912.13301 [pdf, other]

Robust Positioning Patterns with Low Redundancy

Authors: Yeow Meng Chee, Duc Tu Dao, Han Mao Kiah, San Ling, Hengjia Wei

Abstract: A robust positioning pattern is a large array that allows a mobile device to locate its position by reading a possibly corrupted small window around it. In this paper, we provide constructions of binary positioning patterns, equipped with efficient locating algorithms, that are robust to a constant number of errors and have redundancy within a constant factor of optimality. Furthermore, we modify… ▽ More A robust positioning pattern is a large array that allows a mobile device to locate its position by reading a possibly corrupted small window around it. In this paper, we provide constructions of binary positioning patterns, equipped with efficient locating algorithms, that are robust to a constant number of errors and have redundancy within a constant factor of optimality. Furthermore, we modify our constructions to correct rank errors and obtain binary positioning patterns robust to any errors of rank less than a constant number. Additionally, we construct $q$-ary robust positioning sequences robust to a large number of errors, some of which have length attaining the upper bound. Our construction of binary positioning sequences that are robust to a constant number of errors has the least known redundancy amongst those explicit constructions with efficient locating algorithms. On the other hand, for binary robust positioning arrays, our construction is the first explicit construction whose redundancy is within a constant factor of optimality. The locating algorithms accompanying both constructions run in time cubic in sequence length or array dimension. △ Less

Submitted 31 December, 2019; originally announced December 2019.

Comments: Extended Version of SODA 2019 Paper

MSC Class: 05B30; 94C30

arXiv:1912.11617 [pdf, ps, other]

Efficient Algorithm for the Linear Complexity of Sequences and Some Related Consequences

Authors: Yeow Meng Chee, Johan Chrisnata, Tuvi Etzion, Han Mao Kiah

Abstract: The linear complexity of a sequence $s$ is one of the measures of its predictability. It represents the smallest degree of a linear recursion which the sequence satisfies. There are several algorithms to find the linear complexity of a periodic sequence $s$ of length $N$ (where $N$ is of some given form) over a finite field $F_q$ in $O(N)$ symbol field operations. The first such algorithm is The G… ▽ More The linear complexity of a sequence $s$ is one of the measures of its predictability. It represents the smallest degree of a linear recursion which the sequence satisfies. There are several algorithms to find the linear complexity of a periodic sequence $s$ of length $N$ (where $N$ is of some given form) over a finite field $F_q$ in $O(N)$ symbol field operations. The first such algorithm is The Games-Chan Algorithm which considers binary sequences of period $2^n$, and is known for its extreme simplicity. We generalize this algorithm and apply it efficiently for several families of binary sequences. Our algorithm is very simple, it requires $βN$ bit operations for a small constant $β$, where $N$ is the period of the sequence. We make an analysis on the number of bit operations required by the algorithm and compare it with previous algorithms. In the process, the algorithm also finds the recursion for the shortest linear feedback shift-register which generates the sequence. Some other interesting properties related to shift-register sequences, which might not be too surprising but generally unnoted, are also consequences of our exposition. △ Less

Submitted 25 December, 2019; originally announced December 2019.

arXiv:1910.06501 [pdf, ps, other]

Optimal Codes Correcting a Single Indel / Edit for DNA-Based Data Storage

Authors: Kui Cai, Yeow Meng Chee, Ryan Gabrys, Han Mao Kiah, Tuan Thanh Nguyen

Abstract: An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this paper, we investigate codes that combat either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with… ▽ More An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this paper, we investigate codes that combat either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with log n + O(log log n) redundancy bits, while the other corrects a single indel with log n + 2 redundant bits. These two encoders are order-optimal. The former encoder is the first known order-optimal encoder that corrects a single edit, while the latter encoder (that corrects a single indel) reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits. Over the DNA alphabet, we impose an additional constraint: the GC-balanced constraint and require that exactly half of the symbols of any DNA codeword to be either C or G. In particular, via a modification of Knuth's balancing technique, we provide a linear-time map that translates binary messages into GC-balanced codewords and the resulting codebook is able to correct a single indel or a single edit. These are the first known constructions of GC-balanced codes that correct a single indel or a single edit. △ Less

Submitted 14 October, 2019; originally announced October 2019.

Comments: 15 pages

arXiv:1907.02944 [pdf]

Proceedings of the 11th Asia-Europe Workshop on Concepts in Information Theory

Authors: A. J. Han Vinck, Kees A. Schouhamer Immink, Tadashi Wadayama, Van Khu Vu, Akiko Manada, Kui Cai, Shunsuke Horii, Yoshiki Abe, Mitsugu Iwamoto, Kazuo Ohta, Xingwei Zhong, Zhen Mei, Renfei Bu, J. H. Weber, Vitaly Skachek, Hiroyoshi Morita, N. Hovhannisyan, Hiroshi Kamabe, Shan Lu, Hirosuke Yamamoto, Kengo Hasimoto, O. Ytrehus, Shigeaki Kuzuoaka, Mikihiko Nishiara, Han Mao Kiah , et al. (2 additional authors not shown)

Abstract: This year, 2019 we celebrate 30 years of our friendship between Asian and European scientists at the AEW11 in Rotterdam, the Netherlands. Many of the 1989 participants are also present at the 2019 event. This year we have many participants from different parts of Asia and Europe. It shows the importance of this event. It is a good tradition to pay a tribute to a special lecturer in our community.… ▽ More This year, 2019 we celebrate 30 years of our friendship between Asian and European scientists at the AEW11 in Rotterdam, the Netherlands. Many of the 1989 participants are also present at the 2019 event. This year we have many participants from different parts of Asia and Europe. It shows the importance of this event. It is a good tradition to pay a tribute to a special lecturer in our community. This year we selected Hiroyoshi Morita, who is a well known information theorist with many original contributions. △ Less

Submitted 26 June, 2019; originally announced July 2019.

arXiv:1901.01023 [pdf, ps, other]

Efficient and Explicit Balanced Primer Codes

Authors: Yeow Meng Chee, Han Mao Kiah, Hengjia Wei

Abstract: To equip DNA-based data storage with random-access capabilities, Yazdi et al. (2018) prepended DNA strands with specially chosen address sequences called primers and provided certain design criteria for these primers. We provide explicit constructions of error-correcting codes that are suitable as primer addresses and equip these constructions with efficient encoding algorithms. Specifically, ou… ▽ More To equip DNA-based data storage with random-access capabilities, Yazdi et al. (2018) prepended DNA strands with specially chosen address sequences called primers and provided certain design criteria for these primers. We provide explicit constructions of error-correcting codes that are suitable as primer addresses and equip these constructions with efficient encoding algorithms. Specifically, our constructions take cyclic or linear codes as inputs and produce sets of primers with similar error-correcting capabilities. Using certain classes of BCH codes, we obtain infinite families of primer sets of length $n$, minimum distance $d$ with $(d + 1) \log_4 n + O(1)$ redundant symbols. Our techniques involve reversible cyclic codes (1964), an encoding method of Tavares et al. (1971) and Knuth's balancing technique (1986). In our investigation, we also construct efficient and explicit binary balanced error-correcting codes and codes for DNA computing. △ Less

Submitted 4 January, 2019; originally announced January 2019.

arXiv:1901.00387 [pdf, ps, other]

Generalized Sphere-Packing Bound for Subblock-Constrained Codes

Authors: Han Mao Kiah, Anshoo Tandon, Mehul Motani

Abstract: We apply the generalized sphere-packing bound to two classes of subblock-constrained codes. A la Fazeli et al. (2015), we made use of automorphism to significantly reduce the number of variables in the associated linear programming problem. In particular, we study binary constant subblock-composition codes (CSCCs), characterized by the property that the number of ones in each subblock is constant,… ▽ More We apply the generalized sphere-packing bound to two classes of subblock-constrained codes. A la Fazeli et al. (2015), we made use of automorphism to significantly reduce the number of variables in the associated linear programming problem. In particular, we study binary constant subblock-composition codes (CSCCs), characterized by the property that the number of ones in each subblock is constant, and binary subblock energy-constrained codes (SECCs), characterized by the property that the number of ones in each subblock exceeds a certain threshold. For CSCCs, we show that the optimization problem is equivalent to finding the minimum of $N$ variables, where $N$ is independent of the number of subblocks. We then provide closed-form solutions for the generalized sphere-packing bounds for single- and double-error correcting CSCCs. For SECCs, we provide closed-form solutions for the generalized sphere-packing bounds for single errors in certain special cases. We also obtain improved bounds on the optimal asymptotic rate for CSCCs and SECCs, and provide numerical examples to highlight the improvement. △ Less

Submitted 2 January, 2019; originally announced January 2019.

arXiv:1808.09535 [pdf, ps, other]

Low-Power Cooling Codes with Efficient Encoding and Decoding

Authors: Yeow Meng Chee, Tuvi Etzion, Han Mao Kiah, Alexander Vardy, Hengjia Wei

Abstract: A class of low-power cooling (LPC) codes, to control simultaneously both the peak temperature and the average power consumption of interconnects, was introduced recently. An $(n,t,w)$-LPC code is a coding scheme over $n$ wires that (A) avoids state transitions on the $t$ hottest wires (cooling), and (B) limits the number of transitions to $w$ in each transmission (low-power). A few constructions… ▽ More A class of low-power cooling (LPC) codes, to control simultaneously both the peak temperature and the average power consumption of interconnects, was introduced recently. An $(n,t,w)$-LPC code is a coding scheme over $n$ wires that (A) avoids state transitions on the $t$ hottest wires (cooling), and (B) limits the number of transitions to $w$ in each transmission (low-power). A few constructions for large LPC codes that have efficient encoding and decoding schemes, are given. In particular, when $w$ is fixed, we construct LPC codes of size $(n/w)^{w-1}$ and show that these LPC codes can be modified to correct errors efficiently. We further present a construction for large LPC codes based on a mapping from cooling codes to LPC codes. The efficiency of the encoding/decoding for the constructed LPC codes depends on the efficiency of the decoding/encoding for the related cooling codes and the ones for the mapping. △ Less

Submitted 19 March, 2019; v1 submitted 16 August, 2018; originally announced August 2018.

arXiv:1801.02310 [pdf, ps, other]

Efficient Encoding/Decoding of Irreducible Words for Codes Correcting Tandem Duplications

Authors: Yeow Meng Chee, Johan Chrisnata, Han Mao Kiah, Tuan Thanh Nguyen

Abstract: Tandem duplication is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain et al. (2017) proposed the study of codes that correct tandem duplications. Known code constructions are based on {\em irreducible words}. We study efficient encoding/decoding methods for irreducible words. First, we desc… ▽ More Tandem duplication is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain et al. (2017) proposed the study of codes that correct tandem duplications. Known code constructions are based on {\em irreducible words}. We study efficient encoding/decoding methods for irreducible words. First, we describe an $(\ell,m)$-finite state encoder and show that when $m=Θ(1/ε)$ and $\ell=Θ(1/ε)$, the encoder achieves rate that is $ε$ away from the optimal. Next, we provide ranking/unranking algorithms for irreducible words and modify the algorithms to reduce the space requirements for the finite state encoder. △ Less

Submitted 8 January, 2018; originally announced January 2018.

arXiv:1709.05214 [pdf, other]

Mutually Uncorrelated Primers for DNA-Based Data Storage

Authors: S. M. Hossein Tabatabaei Yazdi, Han Mao Kiah, Ryan Gabrys, Olgica Milenkovic

Abstract: We introduce the notion of weakly mutually uncorrelated (WMU) sequences, motivated by applications in DNA-based data storage systems and for synchronization of communication devices. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. WMU sequences used for primer design in DNA-based data storage systems ar… ▽ More We introduce the notion of weakly mutually uncorrelated (WMU) sequences, motivated by applications in DNA-based data storage systems and for synchronization of communication devices. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. WMU sequences used for primer design in DNA-based data storage systems are also required to be at large mutual Hamming distance from each other, have balanced compositions of symbols, and avoid primer-dimer byproducts. We derive bounds on the size of WMU and various constrained WMU codes and present a number of constructions for balanced, error-correcting, primer-dimer free WMU codes using Dyck paths, prefix-synchronized and cyclic codes. △ Less

Submitted 13 September, 2017; originally announced September 2017.

Comments: 14 pages, 3 figures, 1 Table. arXiv admin note: text overlap with arXiv:1601.08176

arXiv:1701.07872 [pdf, other]

Cooling Codes: Thermal-Management Coding for High-Performance Interconnects

Authors: Yeow Meng Chee, Tuvi Etzion, Han Mao Kiah, Alexander Vardy

Abstract: High temperatures have dramatic negative effects on interconnect performance and, hence, numerous techniques have been proposed to reduce the power consumption of on-chip buses. However, existing methods fall short of fully addressing the thermal challenges posed by high-performance interconnects. In this paper, we introduce new efficient coding schemes that make it possible to directly control th… ▽ More High temperatures have dramatic negative effects on interconnect performance and, hence, numerous techniques have been proposed to reduce the power consumption of on-chip buses. However, existing methods fall short of fully addressing the thermal challenges posed by high-performance interconnects. In this paper, we introduce new efficient coding schemes that make it possible to directly control the peak temperature of a bus by effectively cooling its hottest wires. This is achieved by avoiding state transitions on the hottest wires for as long as necessary until their temperature drops off. We also reduce the average power consumption by making sure that the total number of state transitions on all the wires is below a prescribed threshold. We show how each of these two features can be coded for separately or, alternatively, how both can be achieved at the same time. In addition, error-correction for the transmitted information can be provided while controlling the peak temperature and/or the average power consumption. In general, our cooling codes use $n > k$ wires to encode a given $k$-bit bus. One of our goals herein is to determine the minimum possible number of wires $n$ needed to encode $k$ bits while satisfying any combination of the three desired properties. We provide full theoretical analysis in each case. In particular, we show that $n = k+t+1$ suffices to cool the $t$ hottest wires, and this is the best possible. Moreover, although the proposed coding schemes make use of sophisticated tools from combinatorics, discrete geometry, linear algebra, and coding theory, the resulting encoders and decoders are fully practical. They do not require significant computational overhead and can be implemented without sacrificing a large circuit area. △ Less

Submitted 23 October, 2017; v1 submitted 26 January, 2017; originally announced January 2017.

arXiv:1701.07118 [pdf, other]

Repairing Reed-Solomon Codes With Two Erasures

Authors: Hoang Dau, Iwan Duursma, Han Mao Kiah, Olgica Milenkovic

Abstract: Despite their exceptional error-correcting properties, Reed-Solomon (RS) codes have been overlooked in distributed storage applications due to the common belief that they have poor repair bandwidth: A naive repair approach would require the whole file to be reconstructed in order to recover a single erased codeword symbol. In a recent work, Guruswami and Wootters (STOC'16) proposed a single-erasur… ▽ More Despite their exceptional error-correcting properties, Reed-Solomon (RS) codes have been overlooked in distributed storage applications due to the common belief that they have poor repair bandwidth: A naive repair approach would require the whole file to be reconstructed in order to recover a single erased codeword symbol. In a recent work, Guruswami and Wootters (STOC'16) proposed a single-erasure repair method for RS codes that achieves the optimal repair bandwidth amongst all linear encoding schemes. We extend their trace collection technique to cope with two erasures. △ Less

Submitted 14 May, 2017; v1 submitted 24 January, 2017; originally announced January 2017.

Comments: ISIT'17 (accepted), 5 pages. arXiv admin note: substantial text overlap with arXiv:1612.01361

arXiv:1701.06874 [pdf, ps, other]

Coding for Racetrack Memories

Authors: Yeow Meng Chee, Han Mao Kiah, Alexander Vardy, Van Khu Vu, Eitan Yaakobi

Abstract: Racetrack memory is a new technology which utilizes magnetic domains along a nanoscopic wire in order to obtain extremely high storage density. In racetrack memory, each magnetic domain can store a single bit of information, which can be sensed by a reading port (head). The memory has a tape-like structure which supports a shift operation that moves the domains to be read sequentially by the head.… ▽ More Racetrack memory is a new technology which utilizes magnetic domains along a nanoscopic wire in order to obtain extremely high storage density. In racetrack memory, each magnetic domain can store a single bit of information, which can be sensed by a reading port (head). The memory has a tape-like structure which supports a shift operation that moves the domains to be read sequentially by the head. In order to increase the memory's speed, prior work studied how to minimize the latency of the shift operation, while the no less important reliability of this operation has received only a little attention. In this work we design codes which combat shift errors in racetrack memory, called position errors. Namely, shifting the domains is not an error-free operation and the domains may be over-shifted or are not shifted, which can be modeled as deletions and sticky insertions. While it is possible to use conventional deletion and insertion-correcting codes, we tackle this problem with the special structure of racetrack memory, where the domains can be read by multiple heads. Each head outputs a noisy version of the stored data and the multiple outputs are combined in order to reconstruct the data. Under this paradigm, we will show that it is possible to correct, with at most a single bit of redundancy, $d$ deletions with $d+1$ heads if the heads are well-separated. Similar results are provided for burst of deletions, sticky insertions and combinations of both deletions and sticky insertions. △ Less

Submitted 24 January, 2017; originally announced January 2017.

arXiv:1701.04954 [pdf, ps, other]

Bounds on the Size and Asymptotic Rate of Subblock-Constrained Codes

Authors: Anshoo Tandon, Han Mao Kiah, Mehul Motani

Abstract: The study of subblock-constrained codes has recently gained attention due to their application in diverse fields. We present bounds on the size and asymptotic rate for two classes of subblock-constrained codes. The first class is binary constant subblock-composition codes (CSCCs), where each codeword is partitioned into equal sized subblocks, and every subblock has the same fixed weight. The secon… ▽ More The study of subblock-constrained codes has recently gained attention due to their application in diverse fields. We present bounds on the size and asymptotic rate for two classes of subblock-constrained codes. The first class is binary constant subblock-composition codes (CSCCs), where each codeword is partitioned into equal sized subblocks, and every subblock has the same fixed weight. The second class is binary subblock energy-constrained codes (SECCs), where the weight of every subblock exceeds a given threshold. We present novel upper and lower bounds on the code sizes and asymptotic rates for binary CSCCs and SECCs. For a fixed subblock length and small relative distance, we show that the asymptotic rate for CSCCs (resp. SECCs) is strictly lower than the corresponding rate for constant weight codes (CWCs) (resp. heavy weight codes (HWCs)). Further, for codes with high weight and low relative distance, we show that the asymptotic rates for CSCCs is strictly lower than that of SECCs, which contrasts that the asymptotic rate for CWCs is equal to that of HWCs. We also provide a correction to an earlier result by Chee et al. (2014) on the asymptotic CSCC rate. Additionally, we present several numerical examples comparing the rates for CSCCs and SECCs with those for constant weight codes and heavy weight codes. △ Less

Submitted 18 January, 2017; originally announced January 2017.

arXiv:1612.01361 [pdf, other]

doi 10.1109/TIT.2018.2827942

Repairing Reed-Solomon Codes With Multiple Erasures

Authors: Hoang Dau, Iwan Duursma, Han Mao Kiah, Olgica Milenkovic

Abstract: Despite their exceptional error-correcting properties, Reed-Solomon codes have been overlooked in distributed storage applications due to the common belief that they have poor repair bandwidth: A naive repair approach would require the whole file to be reconstructed in order to recover a single erased codeword symbol. In a recent work, Guruswami and Wootters (STOC'16) proposed a single-erasure rep… ▽ More Despite their exceptional error-correcting properties, Reed-Solomon codes have been overlooked in distributed storage applications due to the common belief that they have poor repair bandwidth: A naive repair approach would require the whole file to be reconstructed in order to recover a single erased codeword symbol. In a recent work, Guruswami and Wootters (STOC'16) proposed a single-erasure repair method for Reed-Solomon codes that achieves the optimal repair bandwidth amongst all linear encoding schemes. Their key idea is to recover the erased symbol by collecting a sufficiently large number of its traces, each of which can be constructed from a number of traces of other symbols. We extend the trace collection technique to cope with two and three erasures. △ Less

Submitted 2 May, 2020; v1 submitted 28 November, 2016; originally announced December 2016.

Comments: 15 pages

Journal ref: IEEE Transactions on Information Theory (2018) 64(10) 6567-6582

arXiv:1607.02279 [pdf, other]

Rates of DNA Sequence Profiles for Practical Values of Read Lengths

Authors: Zuling Chang, Johan Chrisnata, Martianus Frederic Ezerman, Han Mao Kiah

Abstract: A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size $q$, read length $\ell$, and word length $n$.Consequently, we demonstrate that for $q\ge 2$ and $n\le q^{\ell/2-1}$, the number of profile vectors is at least $q^{κn}$ with… ▽ More A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size $q$, read length $\ell$, and word length $n$.Consequently, we demonstrate that for $q\ge 2$ and $n\le q^{\ell/2-1}$, the number of profile vectors is at least $q^{κn}$ with $κ$ very close to one.In addition to enumeration results, we provide a set of efficient encoding and decoding algorithms for each of two particular families of profile vectors. △ Less

Submitted 8 July, 2016; originally announced July 2016.

arXiv:1601.08176 [pdf, ps, other]

Weakly Mutually Uncorrelated Codes

Authors: S. M. Hossein Tabatabaei Yazdi, Han Mao Kiah, Olgica Milenkovic

Abstract: We introduce the notion of weakly mutually uncorrelated (WMU) sequences, motivated by applications in DNA-based storage systems and synchronization protocols. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. In addition, WMU sequences used in DNA-based storage systems are required to have balanced compos… ▽ More We introduce the notion of weakly mutually uncorrelated (WMU) sequences, motivated by applications in DNA-based storage systems and synchronization protocols. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. In addition, WMU sequences used in DNA-based storage systems are required to have balanced compositions of symbols and to be at large mutual Hamming distance from each other. We present a number of constructions for balanced, error-correcting WMU codes using Dyck paths, Knuth's balancing principle, prefix synchronized and cyclic codes. △ Less

Submitted 29 January, 2016; originally announced January 2016.

Comments: 1 table

arXiv:1507.01611 [pdf, other]

DNA-Based Storage: Trends and Methods

Authors: S. M. Hossein Tabatabaei Yazdi, Han Mao Kiah, Eva Ruiz Garcia, Jian Ma, Huimin Zhao, Olgica Milenkovic

Abstract: We provide an overview of current approaches to DNA-based storage system design and accompanying synthesis, sequencing and editing methods. We also introduce and analyze a suite of new constrained coding schemes for both archival and random access DNA storage channels. The mathematical basis of our work is the construction and design of sequences over discrete alphabets that avoid pre-specified ad… ▽ More We provide an overview of current approaches to DNA-based storage system design and accompanying synthesis, sequencing and editing methods. We also introduce and analyze a suite of new constrained coding schemes for both archival and random access DNA storage channels. The mathematical basis of our work is the construction and design of sequences over discrete alphabets that avoid pre-specified address patterns, have balanced base content, and exhibit other relevant substring constraints. These schemes adapt the stored signals to the DNA medium and thereby reduce the inherent error-rate of the system. △ Less

Submitted 6 July, 2015; originally announced July 2015.

arXiv:1506.03724 [pdf, ps, other]

Product Construction of Affine Codes

Authors: Yeow Meng Chee, Han Mao Kiah, Punarbasu Purkayastha, Patrick Solé

Abstract: Binary matrix codes with restricted row and column weights are a desirable method of coded modulation for power line communication. In this work, we construct such matrix codes that are obtained as products of affine codes - cosets of binary linear codes. Additionally, the constructions have the property that they are systematic. Subsequently, we generalize our construction to irregular product of… ▽ More Binary matrix codes with restricted row and column weights are a desirable method of coded modulation for power line communication. In this work, we construct such matrix codes that are obtained as products of affine codes - cosets of binary linear codes. Additionally, the constructions have the property that they are systematic. Subsequently, we generalize our construction to irregular product of affine codes, where the component codes are affine codes of different rates. △ Less

Submitted 11 June, 2015; originally announced June 2015.

Comments: 13 pages, to appear in SIAM Journal on Discrete Mathematics

MSC Class: 94B05; 94B60 (Primary); 94B20 (Secondary)

arXiv:1506.02349 [pdf, ps, other]

Local Codes with Addition Based Repair

Authors: Han Mao Kiah, Son Hoang Dau, Wentu Song, Chau Yuen

Abstract: We consider the complexities of repair algorithms for locally repairable codes and propose a class of codes that repair single node failures using addition operations only, or codes with addition based repair. We construct two families of codes with addition based repair. The first family attains distance one less than the Singleton-like upper bound, while the second family attains the Singleton-l… ▽ More We consider the complexities of repair algorithms for locally repairable codes and propose a class of codes that repair single node failures using addition operations only, or codes with addition based repair. We construct two families of codes with addition based repair. The first family attains distance one less than the Singleton-like upper bound, while the second family attains the Singleton-like upper bound. △ Less

Submitted 8 June, 2015; originally announced June 2015.

arXiv:1506.00740 [pdf, other]

Asymmetric Lee Distance Codes for DNA-Based Storage

Authors: Ryan Gabrys, Han Mao Kiah, Olgica Milenkovic

Abstract: We consider a new family of codes, termed asymmetric Lee distance codes, that arise in the design and implementation of DNA-based storage systems and systems with parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes; furthermore, symbol confusability is dictated by their underlying binary representatio… ▽ More We consider a new family of codes, termed asymmetric Lee distance codes, that arise in the design and implementation of DNA-based storage systems and systems with parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes; furthermore, symbol confusability is dictated by their underlying binary representation. Our contributions are two-fold. First, we demonstrate that the new distance represents a linear combination of the Lee and Hamming distance and derive upper bounds on the size of the codes under this metric based on linear programming techniques. Second, we propose a number of code constructions which imply lower bounds. △ Less

Submitted 14 December, 2016; v1 submitted 1 June, 2015; originally announced June 2015.

arXiv:1504.04926 [pdf, other]

Locally Encodable and Decodable Codes for Distributed Storage Systems

Authors: Son Hoang Dau, Han Mao Kiah, Wentu Song, Chau Yuen

Abstract: We consider the locality of encoding and decoding operations in distributed storage systems (DSS), and propose a new class of codes, called locally encodable and decodable codes (LEDC), that provides a higher degree of operational locality compared to currently known codes. For a given locality structure, we derive an upper bound on the global distance and demonstrate the existence of an optimal L… ▽ More We consider the locality of encoding and decoding operations in distributed storage systems (DSS), and propose a new class of codes, called locally encodable and decodable codes (LEDC), that provides a higher degree of operational locality compared to currently known codes. For a given locality structure, we derive an upper bound on the global distance and demonstrate the existence of an optimal LEDC for sufficiently large field size. In addition, we also construct two families of optimal LEDC for fields with size linear in code length. △ Less

Submitted 19 April, 2015; originally announced April 2015.

Comments: 7 pages

Showing 1–50 of 60 results for author: Kiah, H M