-
Coding for the unsourced B-channel with erasures: enhancing the linked loop code
Authors:
William W. Zheng,
Jamison R. Ebert,
Stefano Rini,
Jean-Francois Chamberland
Abstract:
In [1], the linked loop code (LLC) is presented as a promising code for the unsourced A-channel with erasures (UACE). The UACE is an unsourced multiple access channel in which active users' transmitted symbols are erased with a given probability and the channel output is obtained as the union of the non-erased symbols. In this paper, we extend the UACE channel model to the unsourced B-channel with…
▽ More
In [1], the linked loop code (LLC) is presented as a promising code for the unsourced A-channel with erasures (UACE). The UACE is an unsourced multiple access channel in which active users' transmitted symbols are erased with a given probability and the channel output is obtained as the union of the non-erased symbols. In this paper, we extend the UACE channel model to the unsourced B-channel with erasures (UBCE). The UBCE differs from the UACE in that the channel output is the multiset union, or bag union, of the non-erased input symbols. In other words, the UBCE preserves the symbol multiplicity of the channel output while the UACE does not. Both the UACE and UBCE find applications in modeling aspects of unsourced random access. The LLC from [1] is enhanced and shown to outperform the tree code over the UBCE. Findings are supported by numerical simulations.
△ Less
Submitted 20 May, 2024;
originally announced June 2024.
-
Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?
Authors:
Alexander Arno Weber,
Klaudia Thellmann,
Jan Ebert,
Nicolas Flores-Herr,
Jens Lehmann,
Michael Fromm,
Mehdi Ali
Abstract:
The adaption of multilingual pre-trained Large Language Models (LLMs) into eloquent and helpful assistants is essential to facilitate their use across different language regions. In that spirit, we are the first to conduct an extensive study of the performance of multilingual models on parallel, multi-turn instruction-tuning benchmarks across a selection of the most-spoken Indo-European languages.…
▽ More
The adaption of multilingual pre-trained Large Language Models (LLMs) into eloquent and helpful assistants is essential to facilitate their use across different language regions. In that spirit, we are the first to conduct an extensive study of the performance of multilingual models on parallel, multi-turn instruction-tuning benchmarks across a selection of the most-spoken Indo-European languages. We systematically examine the effects of language and instruction dataset size on a mid-sized, multilingual LLM by instruction-tuning it on parallel instruction-tuning datasets. Our results demonstrate that instruction-tuning on parallel instead of monolingual corpora benefits cross-lingual instruction following capabilities by up to 4.6%. Furthermore, we show that the Superficial Alignment Hypothesis does not hold in general, as the investigated multilingual 7B parameter model presents a counter-example requiring large-scale instruction-tuning datasets. Finally, we conduct a human annotation study to understand the alignment between human-based and GPT-4-based evaluation within multilingual chat scenarios.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Multi-User SR-LDPC Codes via Coded Demixing with Applications to Cell-Free Systems
Authors:
Jamison R. Ebert,
Jean-Francois Chamberland,
Krishna R. Narayanan
Abstract:
Novel sparse regression LDPC (SR-LDPC) codes exhibit excellent performance over additive white Gaussian noise (AWGN) channels in part due to their natural provision of shaping gains. Though SR-LDPC-like codes have been considered within the context of single-user error correction and massive random access, they are yet to be examined as candidates for coordinated multi-user communication scenarios…
▽ More
Novel sparse regression LDPC (SR-LDPC) codes exhibit excellent performance over additive white Gaussian noise (AWGN) channels in part due to their natural provision of shaping gains. Though SR-LDPC-like codes have been considered within the context of single-user error correction and massive random access, they are yet to be examined as candidates for coordinated multi-user communication scenarios. This article explores this gap in the literature and demonstrates that SR-LDPC codes, when combined with coded demixing techniques, offer a new framework for efficient non-orthogonal multiple access (NOMA) in the context of coordinated multi-user communication channels. The ensuing communication scheme is referred to as MU-SR-LDPC coding. Empirical evidence suggests that, for a fixed SNR, MU-SR-LDPC coding can achieve a target bit error rate (BER) at a higher sum rate than orthogonal multiple access (OMA) techniques such as time division multiple access (TDMA) and frequency division multiple access (FDMA). Importantly, MU-SR-LDPC codes enable a pragmatic solution path for user-centric cell-free communication systems with (local) joint decoding. Results are supported by numerical simulations.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
Coding for the unsourced A-channel with erasures: the linked loop code
Authors:
William W. Zheng,
Jamison R. Ebert,
Stefano Rini,
Jean-Francois Chamberland
Abstract:
The A-channel is a noiseless multiple access channel in which users simultaneously transmit Q-ary symbols and the receiver observes the set of transmitted symbols, but not their multiplicities. An A-channel is said to be unsourced if, additionally, users transmissions are encoded across time using a common codebook and decoding of the transmitted messages is done without regard to the identities o…
▽ More
The A-channel is a noiseless multiple access channel in which users simultaneously transmit Q-ary symbols and the receiver observes the set of transmitted symbols, but not their multiplicities. An A-channel is said to be unsourced if, additionally, users transmissions are encoded across time using a common codebook and decoding of the transmitted messages is done without regard to the identities of the active users. An interesting variant of the unsourced A-channel is the unsourced A-channel with erasures (UACE), in which transmitted symbols are erased with a given independent and identically distributed probability. In this paper, we focus on designing a code that enables a list of transmitted codewords to be recovered despite the erasures of some of the transmitted symbols. To this end, we propose the linked-loop code (LLC), which uses parity bits to link each symbol to the previous M symbols in a tail-biting manner, i.e., the first symbols of the transmission are linked to the last ones. The decoding process occurs in two phases: the first phase decodes the codewords that do not suffer from any erasures, and the second phase attempts to recover the erased symbols using the available parities. We compare the performance of the LLC over the UACE with other codes in the literature and argue for the effectiveness of the construction. Our motivation for studying the UACE comes from its relevance in machine-type communication and coded compressed sensing.
△ Less
Submitted 19 September, 2023;
originally announced December 2023.
-
Sparse Regression LDPC Codes
Authors:
Jamison R. Ebert,
Jean-Francois Chamberland,
Krishna R. Narayanan
Abstract:
This article introduces a novel concatenated coding scheme called sparse regression LDPC (SR-LDPC) codes. An SR-LDPC code consists of an outer non-binary LDPC code and an inner sparse regression code (SPARC) whose respective field size and section sizes are equal. For such codes, an efficient decoding algorithm is proposed based on approximate message passing (AMP) that dynamically shares soft inf…
▽ More
This article introduces a novel concatenated coding scheme called sparse regression LDPC (SR-LDPC) codes. An SR-LDPC code consists of an outer non-binary LDPC code and an inner sparse regression code (SPARC) whose respective field size and section sizes are equal. For such codes, an efficient decoding algorithm is proposed based on approximate message passing (AMP) that dynamically shares soft information between inner and outer decoders. This dynamic exchange of information is facilitated by a denoiser that runs belief propagation (BP) on the factor graph of the outer LDPC code within each AMP iteration. It is shown that this denoiser falls within the class of non-separable pseudo-Lipschitz denoising functions and thus that state evolution holds for the proposed AMP-BP algorithm. Leveraging the rich structure of SR-LDPC codes, this article proposes an efficient low-dimensional approximate state evolution recursion that can be used for efficient hyperparameter tuning, thus paving the way for future work on optimal code design. Finally, numerical simulations demonstrate that SR-LDPC codes outperform contemporary codes over the AWGN channel for parameters of practical interest. SR-LDPC codes are shown to be viable means to obtain shaping gains over the AWGN channel.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Tokenizer Choice For LLM Training: Negligible or Crucial?
Authors:
Mehdi Ali,
Michael Fromm,
Klaudia Thellmann,
Richard Rutmann,
Max Lübbering,
Johannes Leveling,
Katrin Klug,
Jan Ebert,
Niclas Doll,
Jasper Schulze Buschhoff,
Charvi Jain,
Alexander Arno Weber,
Lena Jurkschat,
Hammam Abdelwahab,
Chelsea John,
Pedro Ortiz Suarez,
Malte Ostendorff,
Samuel Weinbach,
Rafet Sifa,
Stefan Kesselheim,
Nicolas Flores-Herr
Abstract:
The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream perf…
▽ More
The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.
△ Less
Submitted 17 March, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.
-
Physics informed Neural Networks applied to the description of wave-particle resonance in kinetic simulations of fusion plasmas
Authors:
Jai Kumar,
David Zarzoso,
Virginie Grandgirard,
Jan Ebert,
Stefan Kesselheim
Abstract:
The Vlasov-Poisson system is employed in its reduced form version (1D1V) as a test bed for the applicability of Physics Informed Neural Network (PINN) to the wave-particle resonance. Two examples are explored: the Landau damping and the bump-on-tail instability. PINN is first tested as a compression method for the solution of the Vlasov-Poisson system and compared to the standard neural networks.…
▽ More
The Vlasov-Poisson system is employed in its reduced form version (1D1V) as a test bed for the applicability of Physics Informed Neural Network (PINN) to the wave-particle resonance. Two examples are explored: the Landau damping and the bump-on-tail instability. PINN is first tested as a compression method for the solution of the Vlasov-Poisson system and compared to the standard neural networks. Second, the application of PINN to solving the Vlasov-Poisson system is also presented with the special emphasis on the integral part, which motivates the implementation of a PINN variant, called Integrable PINN (I-PINN), based on the automatic-differentiation to solve the partial differential equation and on the automatic-integration to solve the integral equation.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
StarCoder: may the source be with you!
Authors:
Raymond Li,
Loubna Ben Allal,
Yangtian Zi,
Niklas Muennighoff,
Denis Kocetkov,
Chenghao Mou,
Marc Marone,
Christopher Akiki,
Jia Li,
Jenny Chim,
Qian Liu,
Evgenii Zheltonozhskii,
Terry Yue Zhuo,
Thomas Wang,
Olivier Dehaene,
Mishig Davaadorj,
Joel Lamy-Poirier,
João Monteiro,
Oleh Shliazhko,
Nicolas Gontier,
Nicholas Meade,
Armel Zebaze,
Ming-Ho Yee,
Logesh Kumar Umapathi,
Jian Zhu
, et al. (42 additional authors not shown)
Abstract:
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large colle…
▽ More
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
△ Less
Submitted 13 December, 2023; v1 submitted 9 May, 2023;
originally announced May 2023.
-
On Sparse Regression LDPC Codes
Authors:
Jamison R. Ebert,
Jean-Francois Chamberland,
Krishna R. Narayanan
Abstract:
Belief propagation applied to iterative decoding and sparse recovery through approximate message passing (AMP) are two research areas that have seen monumental progress in recent decades. Inspired by these advances, this article introduces sparse regression LDPC codes and their decoding. Sparse regression codes (SPARCs) are a class of error correcting codes that build on ideas from compressed sens…
▽ More
Belief propagation applied to iterative decoding and sparse recovery through approximate message passing (AMP) are two research areas that have seen monumental progress in recent decades. Inspired by these advances, this article introduces sparse regression LDPC codes and their decoding. Sparse regression codes (SPARCs) are a class of error correcting codes that build on ideas from compressed sensing and can be decoded using AMP. In certain settings, SPARCs are known to achieve capacity; yet, their performance suffers at finite block lengths. Likewise, LDPC codes can be decoded efficiently using belief propagation and can also be capacity achieving. This article introduces a novel concatenated coding structure that combines an LDPC outer code with a SPARC-inspired inner code. Efficient decoding for such a code can be achieved using AMP with a denoiser that performs belief propagation on the factor graph of the outer LDPC code. The proposed framework exhibits performance improvements over SPARCs and standard LDPC codes for finite block lengths and results in a steep waterfall in error performance, a phenomenon not observed in uncoded SPARCs. Findings are supported by numerical results.
△ Less
Submitted 4 January, 2023;
originally announced January 2023.
-
Hearts Gym: Learning Reinforcement Learning as a Team Event
Authors:
Jan Ebert,
Danimir T. Doncevic,
Ramona Kloß,
Stefan Kesselheim
Abstract:
Amidst the COVID-19 pandemic, the authors of this paper organized a Reinforcement Learning (RL) course for a graduate school in the field of data science. We describe the strategy and materials for creating an exciting learning experience despite the ubiquitous Zoom fatigue and evaluate the course qualitatively. The key organizational features are a focus on a competitive hands-on setting in teams…
▽ More
Amidst the COVID-19 pandemic, the authors of this paper organized a Reinforcement Learning (RL) course for a graduate school in the field of data science. We describe the strategy and materials for creating an exciting learning experience despite the ubiquitous Zoom fatigue and evaluate the course qualitatively. The key organizational features are a focus on a competitive hands-on setting in teams, supported by a minimum of lectures providing the essential background on RL. The practical part of the course revolved around Hearts Gym, an RL environment for the card game Hearts that we developed as an entry-level tutorial to RL. Participants were tasked with training agents to explore reward shaping and other RL hyperparameters. For a final evaluation, the agents of the participants competed against each other.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
HashBeam: Enabling Feedback Through Downlink Beamforming in Unsourced Random Access
Authors:
Jamison R. Ebert,
Krishna R. Narayanan,
Jean-Francois Chamberland
Abstract:
Unsourced random access (URA) has emerged as a candidate paradigm for massive machine-type communication (MTC) in next-generation wireless networks. While many excellent uplink schemes have been developed for URA, these schemes do not specify a mechanism for providing feedback regarding whether a user's message was successfully decoded. While this may be acceptable in some MTC scenarios, the lack…
▽ More
Unsourced random access (URA) has emerged as a candidate paradigm for massive machine-type communication (MTC) in next-generation wireless networks. While many excellent uplink schemes have been developed for URA, these schemes do not specify a mechanism for providing feedback regarding whether a user's message was successfully decoded. While this may be acceptable in some MTC scenarios, the lack of feedback is inadequate for applications that demand a high level of reliability. However, the problem of providing feedback to active users is complicated by the fact that the base station does not know the identities of the active users. In this paper, a novel downlink beamforming scheme called HashBeam is presented that enables the base station to provide feedback to the active users within URA, despite not knowing their identities. The key idea of this scheme is that the users' channels and hashes of their messages may be used as proxies for their true identities. The proposed scheme may be adapted to any number of antennas at the base station and it is shown that the required number of channel uses is linear in the number of users to acknowledge. The idea of using channel gains in conjunction with user hashes as discriminating attributes of active users is novel and expands the design space of URA schemes.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Coded Demixing for Unsourced Random Access
Authors:
Jamison R. Ebert,
Vamsi K. Amalladinne,
Stefano Rini,
Jean-Francois Chamberland,
Krishna R. Narayanan
Abstract:
Unsourced random access (URA) is a recently proposed multiple access paradigm tailored to the uplink channel of machine-type communication networks. By exploiting a strong connection between URA and compressed sensing, the massive multiple access problem may be cast as a compressed sensing (CS) problem, albeit one in exceedingly large dimensions. To efficiently handle the dimensionality of the pro…
▽ More
Unsourced random access (URA) is a recently proposed multiple access paradigm tailored to the uplink channel of machine-type communication networks. By exploiting a strong connection between URA and compressed sensing, the massive multiple access problem may be cast as a compressed sensing (CS) problem, albeit one in exceedingly large dimensions. To efficiently handle the dimensionality of the problem, coded compressed sensing (CCS) has emerged as a pragmatic signal processing tool that, when applied to URA, offers good performance at low complexity. While CCS is effective at recovering a signal that is sparse with respect to a single basis, it is unable to jointly recover signals that are sparse with respect to separate bases. In this article, the CCS framework is extended to the demixing setting, yielding a novel technique called coded demixing. A generalized framework for coded demixing is presented and a low-complexity recovery algorithm based on approximate message passing (AMP) is developed. Coded demixing is applied to heterogeneous multi-class URA networks and traditional single-class networks. Its performance is analyzed and numerical simulations are presented to highlight the benefits of coded demixing.
△ Less
Submitted 27 June, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
An Enhanced Decoding Algorithm for Coded Compressed Sensing with Applications to Unsourced Random Access
Authors:
Vamsi K. Amalladinne,
Jamison R. Ebert,
Jean-Francois Chamberland,
Krishna R. Narayanan
Abstract:
Unsourced random access (URA) has emerged as a pragmatic framework for next-generation distributed sensor networks. Within URA, concatenated coding structures are often employed to ensure that the central base station can accurately recover the set of sent codewords during a given transmission period. Many URA algorithms employ independent inner and outer decoders, which can help reduce computatio…
▽ More
Unsourced random access (URA) has emerged as a pragmatic framework for next-generation distributed sensor networks. Within URA, concatenated coding structures are often employed to ensure that the central base station can accurately recover the set of sent codewords during a given transmission period. Many URA algorithms employ independent inner and outer decoders, which can help reduce computational complexity at the expense of a decay in performance. In this article, an enhanced decoding algorithm is presented for a concatenated coding structure consisting of a wide range of inner codes and an outer tree-based code. It is shown that this algorithmic enhancement has the potential to simultaneously improve error performance and decrease the computational complexity of the decoder. This enhanced decoding algorithm is applied to two existing URA algorithms and the performance benefits of the algorithm are characterized. Findings are supported by numerical simulations.
△ Less
Submitted 30 November, 2021;
originally announced December 2021.
-
JUWELS Booster -- A Supercomputer for Large-Scale AI Research
Authors:
Stefan Kesselheim,
Andreas Herten,
Kai Krajsek,
Jan Ebert,
Jenia Jitsev,
Mehdi Cherti,
Michael Langguth,
Bing Gong,
Scarlet Stadtler,
Amirpasha Mozaffari,
Gabriele Cavallaro,
Rocco Sedona,
Alexander Schug,
Alexandre Strube,
Roshni Kamath,
Martin G. Schultz,
Morris Riedel,
Thomas Lippert
Abstract:
In this article, we present JUWELS Booster, a recently commissioned high-performance computing system at the Jülich Supercomputing Center. With its system architecture, most importantly its large number of powerful Graphics Processing Units (GPUs) and its fast interconnect via InfiniBand, it is an ideal machine for large-scale Artificial Intelligence (AI) research and applications. We detail its s…
▽ More
In this article, we present JUWELS Booster, a recently commissioned high-performance computing system at the Jülich Supercomputing Center. With its system architecture, most importantly its large number of powerful Graphics Processing Units (GPUs) and its fast interconnect via InfiniBand, it is an ideal machine for large-scale Artificial Intelligence (AI) research and applications. We detail its system architecture, parallel, distributed model training, and benchmarks indicating its outstanding performance. We exemplify its potential for research application by presenting large-scale AI research highlights from various scientific fields that require such a facility.
△ Less
Submitted 30 June, 2021;
originally announced August 2021.
-
Stochastic Binning and Coded Demixing for Unsourced Random Access
Authors:
Jamison R. Ebert,
Vamsi K. Amalladinne,
Stefano Rini,
Jean-Francois Chamberland,
Krishna R. Narayanan
Abstract:
Unsourced random access is a novel communication paradigm designed for handling a large number of uncoordinated users that sporadically transmit very short messages. Under this model, coded compressed sensing (CCS) has emerged as a low-complexity scheme that exhibits good error performance. Yet, one of the challenges faced by CCS pertains to disentangling a large number of codewords present on a s…
▽ More
Unsourced random access is a novel communication paradigm designed for handling a large number of uncoordinated users that sporadically transmit very short messages. Under this model, coded compressed sensing (CCS) has emerged as a low-complexity scheme that exhibits good error performance. Yet, one of the challenges faced by CCS pertains to disentangling a large number of codewords present on a single factor graph. To mitigate this issue, this article introduces a modified CCS scheme whereby active devices stochastically partition themselves into groups that utilize separate sampling matrices with low cross-coherence for message transmission. At the receiver, ideas from the field of compressed demixing are employed for support recovery, and separate factor graphs are created for message disambiguation in each cluster. This reduces the number of active users on a factor graph, which improves performance significantly in typical scenarios. Indeed, coded demixing reduces the probability of error as the number of groups increases, up to a point. Findings are supported with numerical simulations.
△ Less
Submitted 21 July, 2021; v1 submitted 12 April, 2021;
originally announced April 2021.
-
A Hybrid Approach to Coded Compressed Sensing where Coupling Takes Place via the Outer Code
Authors:
Jamison R. Ebert,
Vamsi K. Amalladinne,
Jean-Francois Chamberland,
Krishna R. Narayanan
Abstract:
This article seeks to advance coded compressed sensing (CCS) as a practical scheme for unsourced random access. The original CCS algorithm features a concatenated structure where an inner code is tasked with support recovery, and an outer tree code conducts message disambiguation. Recently, a link between CCS and sparse regression codes was established, leading to the application of approximate me…
▽ More
This article seeks to advance coded compressed sensing (CCS) as a practical scheme for unsourced random access. The original CCS algorithm features a concatenated structure where an inner code is tasked with support recovery, and an outer tree code conducts message disambiguation. Recently, a link between CCS and sparse regression codes was established, leading to the application of approximate message passing (AMP) to CCS. This connection was subsequently strengthened by integrating AMP and belief propagation on the outer code through a dynamic denoiser. Along these lines, this work shows how block diagonal sensing matrices akin to those used in traditional CCS, together with the aforementioned dynamic denoiser, form an effective means to get good performance at low-complexity. This novel architecture can be used to scale this scheme to dimensions that were previously impractical. Findings are supported by numerical simulations.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.