subscribe to arXiv mailings

arXiv:2405.20551 [pdf, other]

doi 10.1145/3663529.3663803

EM-Assist: Safe Automated ExtractMethod Refactoring with LLMs

Authors: Dorin Pomian, Abhiram Bellur, Malinda Dilhara, Zarina Kurbatova, Egor Bogomolov, Andrey Sokolov, Timofey Bryksin, Danny Dig

Abstract: Excessively long methods, loaded with multiple responsibilities, are challenging to understand, debug, reuse, and maintain. The solution lies in the widely recognized Extract Method refactoring. While the application of this refactoring is supported in modern IDEs, recommending which code fragments to extract has been the topic of many research tools. However, they often struggle to replicate real… ▽ More Excessively long methods, loaded with multiple responsibilities, are challenging to understand, debug, reuse, and maintain. The solution lies in the widely recognized Extract Method refactoring. While the application of this refactoring is supported in modern IDEs, recommending which code fragments to extract has been the topic of many research tools. However, they often struggle to replicate real-world developer practices, resulting in recommendations that do not align with what a human developer would do in real life. To address this issue, we introduce EM-Assist, an IntelliJ IDEA plugin that uses LLMs to generate refactoring suggestions and subsequently validates, enhances, and ranks them. Finally, EM-Assist uses the IntelliJ IDE to apply the user-selected recommendation. In our extensive evaluation of 1,752 real-world refactorings that actually took place in open-source projects, EM-Assist's recall rate was 53.4% among its top-5 recommendations, compared to 39.4% for the previous best-in-class tool that relies solely on static analysis. Moreover, we conducted a usability survey with 18 industrial developers and 94.4% gave a positive rating. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: This paper is accepted to the tool demonstration track of the 32nd ACM Symposium on the Foundations of Software Engineering (FSE 2024). This is an author copy

arXiv:2405.18450 [pdf, other]

Distance based prefetching algorithms for mining of the sporadic requests associations

Authors: Vadim Voevodkin, Andrey Sokolov

Abstract: Modern storage systems intensively utilize data prefetching algorithms while processing sequences of the read requests. Performance of the prefetching algorithm (for instance increase of the cache hit ratio of the cache system - CHR) directly affects overall performance characteristics of the storage system (read latency, IOPS, etc.). There are widely known prefetching algorithms that are focused… ▽ More Modern storage systems intensively utilize data prefetching algorithms while processing sequences of the read requests. Performance of the prefetching algorithm (for instance increase of the cache hit ratio of the cache system - CHR) directly affects overall performance characteristics of the storage system (read latency, IOPS, etc.). There are widely known prefetching algorithms that are focused on the discovery of the sequential patterns in the stream of requests. This study examines a family of prefetching algorithms that is focused on mining of the pseudo random (sporadic) patterns between read requests - sporadic prefetching algorithms. The key contribution of this paper is that it discovers a new, lightweight family of distance-based sporadic prefetching algorithms (DBSP) that outperforms the best previously known results on MSR traces collection.Another important contribution of this paper is a thorough description of the procedure for comparing the performance of sporadic prefetchers. △ Less

Submitted 13 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

arXiv:2307.08426 [pdf, other]

Improving End-to-End Speech Translation by Imitation-Based Knowledge Distillation with Synthetic Transcripts

Authors: Rebekka Hubert, Artem Sokolov, Stefan Riezler

Abstract: End-to-end automatic speech translation (AST) relies on data that combines audio inputs with text translation outputs. Previous work used existing large parallel corpora of transcriptions and translations in a knowledge distillation (KD) setup to distill a neural machine translation (NMT) into an AST student model. While KD allows using larger pretrained models, the reliance of previous KD approac… ▽ More End-to-end automatic speech translation (AST) relies on data that combines audio inputs with text translation outputs. Previous work used existing large parallel corpora of transcriptions and translations in a knowledge distillation (KD) setup to distill a neural machine translation (NMT) into an AST student model. While KD allows using larger pretrained models, the reliance of previous KD approaches on manual audio transcripts in the data pipeline restricts the applicability of this framework to AST. We present an imitation learning approach where a teacher NMT system corrects the errors of an AST student without relying on manual transcripts. We show that the NMT teacher can recover from errors in automatic transcriptions and is able to correct erroneous translations of the AST student, leading to improvements of about 4 BLEU points over the standard AST end-to-end baseline on the English-German CoVoST-2 and MuST-C datasets, respectively. Code and data are publicly available.\footnote{\url{https://github.com/HubReb/imitkd_ast/releases/tag/v1.1}} △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: IWSLT 2023, corrected version

Journal ref: In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 89-101

arXiv:2304.10611 [pdf, other]

Joint Repetition Suppression and Content Moderation of Large Language Models

Authors: Minghui Zhang, Alex Sokolov, Weixin Cai, Si-Qing Chen

Abstract: Natural language generation (NLG) is one of the most impactful fields in NLP, and recent years have witnessed its evolution brought about by large language models (LLMs). As the key instrument for writing assistance applications, they are generally prone to replicating or extending offensive content provided in the input. In low-resource data regime, they can also lead to repetitive outputs. Usual… ▽ More Natural language generation (NLG) is one of the most impactful fields in NLP, and recent years have witnessed its evolution brought about by large language models (LLMs). As the key instrument for writing assistance applications, they are generally prone to replicating or extending offensive content provided in the input. In low-resource data regime, they can also lead to repetitive outputs. Usually, offensive content and repetitions are mitigated with post-hoc methods, including n-gram level blocklists, top-k and nucleus sampling. In this paper, we apply non-exact repetition suppression using token and sequence level unlikelihood loss, and further explore the framework of unlikelihood training objective in order to jointly endow the model with abilities to avoid generating offensive words and phrases from the beginning. Finally, with comprehensive experiments, we demonstrate that our proposed methods work exceptionally in controlling the repetition and content quality of LLM outputs. △ Less

Submitted 5 June, 2023; v1 submitted 20 April, 2023; originally announced April 2023.

arXiv:2304.08637 [pdf, other]

doi 10.1016/j.nlp.2023.100024

An Evaluation on Large Language Model Outputs: Discourse and Memorization

Authors: Adrian de Wynter, Xun Wang, Alex Sokolov, Qilong Gu, Si-Qing Chen

Abstract: We present an empirical evaluation of various outputs generated by nine of the most widely-available large language models (LLMs). Our analysis is done with off-the-shelf, readily-available tools. We find a correlation between percentage of memorized text, percentage of unique text, and overall output quality, when measured with respect to output pathologies such as counterfactual and logically-fl… ▽ More We present an empirical evaluation of various outputs generated by nine of the most widely-available large language models (LLMs). Our analysis is done with off-the-shelf, readily-available tools. We find a correlation between percentage of memorized text, percentage of unique text, and overall output quality, when measured with respect to output pathologies such as counterfactual and logically-flawed statements, and general failures like not staying on topic. Overall, 80.0% of the outputs evaluated contained memorized data, but outputs containing the most memorized content were also more likely to be considered of high quality. We discuss and evaluate mitigation strategies, showing that, in the models evaluated, the rate of memorized text being output is reduced. We conclude with a discussion on potential implications around what it means to learn, to memorize, and to evaluate quality text. △ Less

Submitted 17 April, 2023; originally announced April 2023.

Comments: Preprint. Under review

arXiv:2302.13959 [pdf, other]

Make Every Example Count: On the Stability and Utility of Self-Influence for Learning from Noisy NLP Datasets

Authors: Irina Bejan, Artem Sokolov, Katja Filippova

Abstract: Increasingly larger datasets have become a standard ingredient to advancing the state-of-the-art in NLP. However, data quality might have already become the bottleneck to unlock further gains. Given the diversity and the sizes of modern datasets, standard data filtering is not straight-forward to apply, because of the multifacetedness of the harmful data and elusiveness of filtering rules that wou… ▽ More Increasingly larger datasets have become a standard ingredient to advancing the state-of-the-art in NLP. However, data quality might have already become the bottleneck to unlock further gains. Given the diversity and the sizes of modern datasets, standard data filtering is not straight-forward to apply, because of the multifacetedness of the harmful data and elusiveness of filtering rules that would generalize across multiple tasks. We study the fitness of task-agnostic self-influence scores of training examples for data cleaning, analyze their efficacy in capturing naturally occurring outliers, and investigate to what extent self-influence based data cleaning can improve downstream performance in machine translation, question answering and text classification, building up on recent approaches to self-influence calculation and automated curriculum learning. △ Less

Submitted 17 October, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

Comments: Published at EMNLP 2023

arXiv:2207.09821 [pdf]

Journal Impact Factor and Peer Review Thoroughness and Helpfulness: A Supervised Machine Learning Study

Authors: Anna Severin, Michaela Strinzel, Matthias Egger, Tiago Barros, Alexander Sokolov, Julia Vilstrup Mouatt, Stefan Müller

Abstract: The journal impact factor (JIF) is often equated with journal quality and the quality of the peer review of the papers submitted to the journal. We examined the association between the content of peer review and JIF by analysing 10,000 peer review reports submitted to 1,644 medical and life sciences journals. Two researchers hand-coded a random sample of 2,000 sentences. We then trained machine le… ▽ More The journal impact factor (JIF) is often equated with journal quality and the quality of the peer review of the papers submitted to the journal. We examined the association between the content of peer review and JIF by analysing 10,000 peer review reports submitted to 1,644 medical and life sciences journals. Two researchers hand-coded a random sample of 2,000 sentences. We then trained machine learning models to classify all 187,240 sentences as contributing or not contributing to content categories. We examined the association between ten groups of journals defined by JIF deciles and the content of peer reviews using linear mixed-effects models, adjusting for the length of the review. The JIF ranged from 0.21 to 74.70. The length of peer reviews increased from the lowest (median number of words 185) to the JIF group (387 words). The proportion of sentences allocated to different content categories varied widely, even within JIF groups. For thoroughness, sentences on 'Materials and Methods' were more common in the highest JIF journals than in the lowest JIF group (difference of 7.8 percentage points; 95% CI 4.9 to 10.7%). The trend for 'Presentation and Reporting' went in the opposite direction, with the highest JIF journals giving less emphasis to such content (difference -8.9%; 95% CI -11.3 to -6.5%). For helpfulness, reviews for higher JIF journals devoted less attention to 'Suggestion and Solution' and provided fewer Examples than lower impact factor journals. No, or only small differences were evident for other content categories. In conclusion, peer review in journals with higher JIF tends to be more thorough in discussing the methods used but less helpful in terms of suggesting solutions and providing examples. Differences were modest and variability high, indicating that the JIF is a bad predictor for the quality of peer review of an individual manuscript. △ Less

Submitted 19 August, 2022; v1 submitted 20 July, 2022; originally announced July 2022.

Comments: 44 pages

arXiv:2204.09716 [pdf, ps, other]

Domain Specific Fine-tuning of Denoising Sequence-to-Sequence Models for Natural Language Summarization

Authors: Brydon Parker, Alik Sokolov, Mahtab Ahmed, Matt Kalebic, Sedef Akinli Kocak, Ofer Shai

Abstract: Summarization of long-form text data is a problem especially pertinent in knowledge economy jobs such as medicine and finance, that require continuously remaining informed on a sophisticated and evolving body of knowledge. As such, isolating and summarizing key content automatically using Natural Language Processing (NLP) techniques holds the potential for extensive time savings in these industrie… ▽ More Summarization of long-form text data is a problem especially pertinent in knowledge economy jobs such as medicine and finance, that require continuously remaining informed on a sophisticated and evolving body of knowledge. As such, isolating and summarizing key content automatically using Natural Language Processing (NLP) techniques holds the potential for extensive time savings in these industries. We explore applications of a state-of-the-art NLP model (BART), and explore strategies for tuning it to optimal performance using data augmentation and various fine-tuning strategies. We show that our end-to-end fine-tuning approach can result in a 5-6\% absolute ROUGE-1 improvement over an out-of-the-box pre-trained BART summarizer when tested on domain specific data, and make available our end-to-end pipeline to achieve these results on finance, medical, or other user-specified domains. △ Less

Submitted 6 April, 2022; originally announced April 2022.

Comments: 8 pages, 6 figures

ACM Class: I.2.7

arXiv:2202.06045 [pdf, other]

USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Authors: Bolaji Yusuf, Ankur Gandhe, Alex Sokolov

Abstract: Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text… ▽ More Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself. △ Less

Submitted 12 February, 2022; originally announced February 2022.

Comments: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022)

arXiv:2112.03052 [pdf, other]

Scaling Up Influence Functions

Authors: Andrea Schioppa, Polina Zablotskaia, David Vilar, Artem Sokolov

Abstract: We address efficient calculation of influence functions for tracking predictions back to the training data. We propose and analyze a new approach to speeding up the inverse Hessian calculation based on Arnoldi iteration. With this improvement, we achieve, to the best of our knowledge, the first successful implementation of influence functions that scales to full-size (language and vision) Transfor… ▽ More We address efficient calculation of influence functions for tracking predictions back to the training data. We propose and analyze a new approach to speeding up the inverse Hessian calculation based on Arnoldi iteration. With this improvement, we achieve, to the best of our knowledge, the first successful implementation of influence functions that scales to full-size (language and vision) Transformer models with several hundreds of millions of parameters. We evaluate our approach on image classification and sequence-to-sequence tasks with tens to a hundred of millions of training examples. Our code will be available at https://github.com/google-research/jax-influence. △ Less

Submitted 6 December, 2021; originally announced December 2021.

Comments: Published at AAAI-22

arXiv:2110.06997 [pdf, other]

Bandits Don't Follow Rules: Balancing Multi-Facet Machine Translation with Multi-Armed Bandits

Authors: Julia Kreutzer, David Vilar, Artem Sokolov

Abstract: Training data for machine translation (MT) is often sourced from a multitude of large corpora that are multi-faceted in nature, e.g. containing contents from multiple domains or different levels of quality or complexity. Naturally, these facets do not occur with equal frequency, nor are they equally important for the test scenario at hand. In this work, we propose to optimize this balance jointly… ▽ More Training data for machine translation (MT) is often sourced from a multitude of large corpora that are multi-faceted in nature, e.g. containing contents from multiple domains or different levels of quality or complexity. Naturally, these facets do not occur with equal frequency, nor are they equally important for the test scenario at hand. In this work, we propose to optimize this balance jointly with MT model parameters to relieve system developers from manual schedule design. A multi-armed bandit is trained to dynamically choose between facets in a way that is most beneficial for the MT system. We evaluate it on three different multi-facet applications: balancing translationese and natural training data, or data from multiple domains or multiple language pairs. We find that bandit learning leads to competitive MT systems across tasks, and our analysis provides insights into its learned strategies and the underlying data sets. △ Less

Submitted 13 October, 2021; originally announced October 2021.

Comments: EMNLP Findings 2021

arXiv:2109.07926 [pdf, other]

Don't Search for a Search Method -- Simple Heuristics Suffice for Adversarial Text Attacks

Authors: Nathaniel Berger, Stefan Riezler, Artem Sokolov, Sebastian Ebert

Abstract: Recently more attention has been given to adversarial attacks on neural networks for natural language processing (NLP). A central research topic has been the investigation of search algorithms and search constraints, accompanied by benchmark algorithms and tasks. We implement an algorithm inspired by zeroth order optimization-based attacks and compare with the benchmark results in the TextAttack f… ▽ More Recently more attention has been given to adversarial attacks on neural networks for natural language processing (NLP). A central research topic has been the investigation of search algorithms and search constraints, accompanied by benchmark algorithms and tasks. We implement an algorithm inspired by zeroth order optimization-based attacks and compare with the benchmark results in the TextAttack framework. Surprisingly, we find that optimization-based methods do not yield any improvement in a constrained setup and slightly benefit from approximate gradient information only in unconstrained setups where search spaces are larger. In contrast, simple heuristics exploiting nearest neighbors without querying the target function yield substantial success rates in constrained setups, and nearly full success rate in unconstrained setups, at an order of magnitude fewer queries. We conclude from these results that current TextAttack benchmark tasks are too easy and constraints are too strict, preventing meaningful research on black-box adversarial text attacks. △ Less

Submitted 4 October, 2021; v1 submitted 16 September, 2021; originally announced September 2021.

Comments: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP Main Conference)

arXiv:2109.04114 [pdf, other]

Fixing exposure bias with imitation learning needs powerful oracles

Authors: Luca Hormann, Artem Sokolov

Abstract: We apply imitation learning (IL) to tackle the NMT exposure bias problem with error-correcting oracles, and evaluate an SMT lattice-based oracle which, despite its excellent performance in an unconstrained oracle translation task, turned out to be too pruned and idiosyncratic to serve as the oracle for IL. We apply imitation learning (IL) to tackle the NMT exposure bias problem with error-correcting oracles, and evaluate an SMT lattice-based oracle which, despite its excellent performance in an unconstrained oracle translation task, turned out to be too pruned and idiosyncratic to serve as the oracle for IL. △ Less

Submitted 17 September, 2021; v1 submitted 9 September, 2021; originally announced September 2021.

arXiv:2104.10121 [pdf, other]

On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion Recognition: An Update for the Deep Learning Era

Authors: Shahin Amiriparian, Artem Sokolov, Ilhan Aslan, Lukas Christ, Maurice Gerczuk, Tobias Hübner, Dmitry Lamanov, Manuel Milling, Sandra Ottl, Ilya Poduremennykh, Evgeniy Shuranov, Björn W. Schuller

Abstract: Text encodings from automatic speech recognition (ASR) transcripts and audio representations have shown promise in speech emotion recognition (SER) ever since. Yet, it is challenging to explain the effect of each information stream on the SER systems. Further, more clarification is required for analysing the impact of ASR's word error rate (WER) on linguistic emotion recognition per se and in the… ▽ More Text encodings from automatic speech recognition (ASR) transcripts and audio representations have shown promise in speech emotion recognition (SER) ever since. Yet, it is challenging to explain the effect of each information stream on the SER systems. Further, more clarification is required for analysing the impact of ASR's word error rate (WER) on linguistic emotion recognition per se and in the context of fusion with acoustic information exploitation in the age of deep ASR systems. In order to tackle the above issues, we create transcripts from the original speech by applying three modern ASR systems, including an end-to-end model trained with recurrent neural network-transducer loss, a model with connectionist temporal classification loss, and a wav2vec framework for self-supervised learning. Afterwards, we use pre-trained textual models to extract text representations from the ASR outputs and the gold standard. For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep. Finally, we conduct decision-level fusion on both information streams -- acoustics and linguistics. Using the best development configuration, we achieve state-of-the-art unweighted average recall values of $73.6\,\%$ and $73.8\,\%$ on the speaker-independent development and test partitions of IEMOCAP, respectively. △ Less

Submitted 20 April, 2021; originally announced April 2021.

Comments: 5 pages, 1 figure

ACM Class: I.2.7; I.5.0

arXiv:2104.01923 [pdf, other]

Real-time Streaming Wave-U-Net with Temporal Convolutions for Multichannel Speech Enhancement

Authors: Vasiliy Kuzmin, Fyodor Kravchenko, Artem Sokolov, Jie Geng

Abstract: In this paper, we describe the work that we have done to participate in Task1 of the ConferencingSpeech2021 challenge. This task set a goal to develop the solution for multi-channel speech enhancement in a real-time manner. We propose a novel system for streaming speech enhancement. We employ Wave-U-Net architecture with temporal convolutions in encoder and decoder. We incorporate self-attention i… ▽ More In this paper, we describe the work that we have done to participate in Task1 of the ConferencingSpeech2021 challenge. This task set a goal to develop the solution for multi-channel speech enhancement in a real-time manner. We propose a novel system for streaming speech enhancement. We employ Wave-U-Net architecture with temporal convolutions in encoder and decoder. We incorporate self-attention in the decoder to apply attention mask retrieved from skip-connection on features from down-blocks. We explore history cache mechanisms that work like hidden states in recurrent networks and implemented them in proposal solution. It helps us to run an inference with chunks length 40ms and Real-Time Factor 0.4 with the same precision. △ Less

Submitted 5 April, 2021; originally announced April 2021.

Comments: Draft paper for InterSpeech 2021 processing

arXiv:2103.12028 [pdf, other]

doi 10.1162/tacl_a_00447

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Authors: Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller , et al. (27 additional authors not shown)

Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system… ▽ More With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases. △ Less

Submitted 21 February, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

Comments: Accepted at TACL; pre-MIT Press publication version

Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 50-72

arXiv:2006.14223 [pdf, other]

Neural Machine Translation For Paraphrase Generation

Authors: Alex Sokolov, Denis Filimonov

Abstract: Training a spoken language understanding system, as the one in Alexa, typically requires a large human-annotated corpus of data. Manual annotations are expensive and time consuming. In Alexa Skill Kit (ASK) user experience with the skill greatly depends on the amount of data provided by skill developer. In this work, we present an automatic natural language generation system, capable of generating… ▽ More Training a spoken language understanding system, as the one in Alexa, typically requires a large human-annotated corpus of data. Manual annotations are expensive and time consuming. In Alexa Skill Kit (ASK) user experience with the skill greatly depends on the amount of data provided by skill developer. In this work, we present an automatic natural language generation system, capable of generating both human-like interactions and annotations by the means of paraphrasing. Our approach consists of machine translation (MT) inspired encoder-decoder deep recurrent neural network. We evaluate our model on the impact it has on ASK skill, intent, named entity classification accuracy and sentence level coverage, all of which demonstrate significant improvements for unseen skills on natural language understanding (NLU) models, trained on the data augmented with paraphrases. △ Less

Submitted 25 June, 2020; originally announced June 2020.

Comments: Published in NIPS 2018: 2nd Conversational AI workshop

arXiv:2006.14194 [pdf, other]

Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion

Authors: Alex Sokolov, Tracy Rohlin, Ariya Rastrow

Abstract: Grapheme-to-phoneme (G2P) models are a key component in Automatic Speech Recognition (ASR) systems, such as the ASR system in Alexa, as they are used to generate pronunciations for out-of-vocabulary words that do not exist in the pronunciation lexicons (mappings like "e c h o" to "E k oU"). Most G2P systems are monolingual and based on traditional joint-sequence based n-gram models [1,2]. As an al… ▽ More Grapheme-to-phoneme (G2P) models are a key component in Automatic Speech Recognition (ASR) systems, such as the ASR system in Alexa, as they are used to generate pronunciations for out-of-vocabulary words that do not exist in the pronunciation lexicons (mappings like "e c h o" to "E k oU"). Most G2P systems are monolingual and based on traditional joint-sequence based n-gram models [1,2]. As an alternative, we present a single end-to-end trained neural G2P model that shares same encoder and decoder across multiple languages. This allows the model to utilize a combination of universal symbol inventories of Latin-like alphabets and cross-linguistically shared feature representations. Such model is especially useful in the scenarios of low resource languages and code switching/foreign words, where the pronunciations in one language need to be adapted to other locales or accents. We further experiment with word language distribution vector as an additional training target in order to improve system performance by helping the model decouple pronunciations across a variety of languages in the parameter space. We show 7.2% average improvement in phoneme error rate over low resource languages and no degradation over high resource ones compared to monolingual baselines. △ Less

Submitted 28 June, 2020; v1 submitted 25 June, 2020; originally announced June 2020.

Comments: Published in INTERSPEECH (2019)

arXiv:2006.01759 [pdf, other]

Sparse Perturbations for Improved Convergence in Stochastic Zeroth-Order Optimization

Authors: Mayumi Ohta, Nathaniel Berger, Artem Sokolov, Stefan Riezler

Abstract: Interest in stochastic zeroth-order (SZO) methods has recently been revived in black-box optimization scenarios such as adversarial black-box attacks to deep neural networks. SZO methods only require the ability to evaluate the objective function at random input points, however, their weakness is the dependency of their convergence speed on the dimensionality of the function to be evaluated. We pr… ▽ More Interest in stochastic zeroth-order (SZO) methods has recently been revived in black-box optimization scenarios such as adversarial black-box attacks to deep neural networks. SZO methods only require the ability to evaluate the objective function at random input points, however, their weakness is the dependency of their convergence speed on the dimensionality of the function to be evaluated. We present a sparse SZO optimization method that reduces this factor to the expected dimensionality of the random perturbation during learning. We give a proof that justifies this reduction for sparse SZO optimization for non-convex functions without making any assumptions on sparsity of objective function or gradient. Furthermore, we present experimental results for neural networks on MNIST and CIFAR that show faster convergence in training loss and test accuracy, and a smaller distance of the gradient approximation to the true gradient in sparse SZO compared to dense SZO. △ Less

Submitted 29 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

Comments: International Conference on Machine Learning, Optimization, and Data Science (LOD), Siena, Italy

Journal ref: LOD 2020

arXiv:1907.13444 [pdf, other]

Balanced Identification as an Intersection of Optimization and Distributed Computing

Authors: Alexander Sokolov, Vladimir Voloshinov

Abstract: Technology of formal quantitative estimation of the conformity of the mathematical models to the available dataset is presented. Main purpose of the technology is to make easier the model selection decision-making process for the researcher. The technology is a combination of approaches from the areas of data analysis, optimization and distributed computing including: cross-validation and regulari… ▽ More Technology of formal quantitative estimation of the conformity of the mathematical models to the available dataset is presented. Main purpose of the technology is to make easier the model selection decision-making process for the researcher. The technology is a combination of approaches from the areas of data analysis, optimization and distributed computing including: cross-validation and regularization methods, algebraic modelling in optimization and methods of optimization, automatic discretization of differential and integral equation, optimization REST-services. The technology is illustrated by a demo case study. General mathematical formulation of the method is presented. It is followed by description of the main aspects of algorithmic and software implementation. Success story list of the presented approach already is rather long. Nevertheless, domain of applicability and important unresolved issues are discussed. △ Less

Submitted 20 April, 2020; v1 submitted 31 July, 2019; originally announced July 2019.

Comments: 19 pages, 8 figures, 1 table, 32 references. Due to delay in publication (through no our fault) we uploaded revised text for reviewers of subsequent papers on the subject

arXiv:1810.01480 [pdf, other]

Learning to Segment Inputs for NMT Favors Character-Level Processing

Authors: Julia Kreutzer, Artem Sokolov

Abstract: Most modern neural machine translation (NMT) systems rely on presegmented inputs. Segmentation granularity importantly determines the input and output sequence lengths, hence the modeling depth, and source and target vocabularies, which in turn determine model size, computational costs of softmax normalization, and handling of out-of-vocabulary words. However, the current practice is to use static… ▽ More Most modern neural machine translation (NMT) systems rely on presegmented inputs. Segmentation granularity importantly determines the input and output sequence lengths, hence the modeling depth, and source and target vocabularies, which in turn determine model size, computational costs of softmax normalization, and handling of out-of-vocabulary words. However, the current practice is to use static, heuristic-based segmentations that are fixed before NMT training. This begs the question whether the chosen segmentation is optimal for the translation task. To overcome suboptimal segmentation choices, we present an algorithm for dynamic segmentation based on the Adaptative Computation Time algorithm (Graves 2016), that is trainable end-to-end and driven by the NMT objective. In an evaluation on four translation tasks we found that, given the freedom to navigate between different segmentation levels, the model prefers to operate on (almost) character level, providing support for purely character-level NMT models from a novel angle. △ Less

Submitted 5 November, 2018; v1 submitted 2 October, 2018; originally announced October 2018.

Comments: Technical report for IWSLT 2018 paper

arXiv:1806.04458 [pdf, other]

Sparse Stochastic Zeroth-Order Optimization with an Application to Bandit Structured Prediction

Authors: Artem Sokolov, Julian Hitschler, Mayumi Ohta, Stefan Riezler

Abstract: Stochastic zeroth-order (SZO), or gradient-free, optimization allows to optimize arbitrary functions by relying only on function evaluations under parameter perturbations, however, the iteration complexity of SZO methods suffers a factor proportional to the dimensionality of the perturbed function. We show that in scenarios with natural sparsity patterns as in structured prediction applications, t… ▽ More Stochastic zeroth-order (SZO), or gradient-free, optimization allows to optimize arbitrary functions by relying only on function evaluations under parameter perturbations, however, the iteration complexity of SZO methods suffers a factor proportional to the dimensionality of the perturbed function. We show that in scenarios with natural sparsity patterns as in structured prediction applications, this factor can be reduced to the expected number of active features over input-output pairs. We give a general proof that applies sparse SZO optimization to Lipschitz-continuous, nonconvex, stochastic objectives, and present an experimental evaluation on linear bandit structured prediction tasks with sparse word-based feature representations that confirm our theoretical results. △ Less

Submitted 10 November, 2020; v1 submitted 12 June, 2018; originally announced June 2018.

arXiv:1712.05690 [pdf, other]

Sockeye: A Toolkit for Neural Machine Translation

Authors: Felix Hieber, Tobias Domhan, Michael Denkowski, David Vilar, Artem Sokolov, Ann Clifton, Matt Post

Abstract: We describe Sockeye (version 1.12), an open-source sequence-to-sequence toolkit for Neural Machine Translation (NMT). Sockeye is a production-ready framework for training and applying models as well as an experimental platform for researchers. Written in Python and built on MXNet, the toolkit offers scalable training and inference for the three most prominent encoder-decoder architectures: attenti… ▽ More We describe Sockeye (version 1.12), an open-source sequence-to-sequence toolkit for Neural Machine Translation (NMT). Sockeye is a production-ready framework for training and applying models as well as an experimental platform for researchers. Written in Python and built on MXNet, the toolkit offers scalable training and inference for the three most prominent encoder-decoder architectures: attentional recurrent neural networks, self-attentional transformers, and fully convolutional networks. Sockeye also supports a wide range of optimizers, normalization and regularization techniques, and inference improvements from current NMT literature. Users can easily run standard training recipes, explore different model settings, and incorporate new ideas. In this paper, we highlight Sockeye's features and benchmark it against other NMT toolkits on two language arcs from the 2017 Conference on Machine Translation (WMT): English-German and Latvian-English. We report competitive BLEU scores across all three architectures, including an overall best score for Sockeye's transformer implementation. To facilitate further comparison, we release all system outputs and training scripts used in our experiments. The Sockeye toolkit is free software released under the Apache 2.0 license. △ Less

Submitted 1 June, 2018; v1 submitted 15 December, 2017; originally announced December 2017.

arXiv:1707.09118 [pdf, other]

Counterfactual Learning from Bandit Feedback under Deterministic Logging: A Case Study in Statistical Machine Translation

Authors: Carolin Lawrence, Artem Sokolov, Stefan Riezler

Abstract: The goal of counterfactual learning for statistical machine translation (SMT) is to optimize a target SMT system from logged data that consist of user feedback to translations that were predicted by another, historic SMT system. A challenge arises by the fact that risk-averse commercial SMT systems deterministically log the most probable translation. The lack of sufficient exploration of the SMT o… ▽ More The goal of counterfactual learning for statistical machine translation (SMT) is to optimize a target SMT system from logged data that consist of user feedback to translations that were predicted by another, historic SMT system. A challenge arises by the fact that risk-averse commercial SMT systems deterministically log the most probable translation. The lack of sufficient exploration of the SMT output space seemingly contradicts the theoretical requirements for counterfactual learning. We show that counterfactual learning from deterministic bandit logs is possible nevertheless by smoothing out deterministic components in learning. This can be achieved by additive and multiplicative control variates that avoid degenerate behavior in empirical risk minimization. Our simulation experiments show improvements of up to 2 BLEU points by counterfactual learning from deterministic bandit feedback. △ Less

Submitted 14 December, 2017; v1 submitted 28 July, 2017; originally announced July 2017.

Comments: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017, Copenhagen, Denmark

arXiv:1707.09050 [pdf, other]

A Shared Task on Bandit Learning for Machine Translation

Authors: Artem Sokolov, Julia Kreutzer, Kellen Sunderland, Pavel Danchenko, Witold Szymaniak, Hagen Fürstenau, Stefan Riezler

Abstract: We introduce and describe the results of a novel shared task on bandit learning for machine translation. The task was organized jointly by Amazon and Heidelberg University for the first time at the Second Conference on Machine Translation (WMT 2017). The goal of the task is to encourage research on learning machine translation from weak user feedback instead of human references or post-edits. On e… ▽ More We introduce and describe the results of a novel shared task on bandit learning for machine translation. The task was organized jointly by Amazon and Heidelberg University for the first time at the Second Conference on Machine Translation (WMT 2017). The goal of the task is to encourage research on learning machine translation from weak user feedback instead of human references or post-edits. On each of a sequence of rounds, a machine translation system is required to propose a translation for an input, and receives a real-valued estimate of the quality of the proposed translation for learning. This paper describes the shared task's learning and evaluation setup, using services hosted on Amazon Web Services (AWS), the data and evaluation metrics, and the results of various machine translation architectures and learning protocols. △ Less

Submitted 27 July, 2017; originally announced July 2017.

Comments: Conference on Machine Translation (WMT) 2017

arXiv:1705.07154 [pdf, other]

doi 10.1070/QEL16469

Demonstration of a quantum key distribution network in urban fibre-optic communication lines

Authors: E. O. Kiktenko, N. O. Pozhar, A. V. Duplinskiy, A. A. Kanapin, A. S. Sokolov, S. S. Vorobey, A. V. Miller, V. E. Ustimchik, M. N. Anufriev, A. S. Trushechkin, R. R. Yunusov, V. L. Kurochkin, Y. V. Kurochkin, A. K. Fedorov

Abstract: We report the results of the implementation of a quantum key distribution (QKD) network using standard fibre communication lines in Moscow. The developed QKD network is based on the paradigm of trusted repeaters and allows a common secret key to be generated between users via an intermediate trusted node. The main feature of the network is the integration of the setups using two types of encoding,… ▽ More We report the results of the implementation of a quantum key distribution (QKD) network using standard fibre communication lines in Moscow. The developed QKD network is based on the paradigm of trusted repeaters and allows a common secret key to be generated between users via an intermediate trusted node. The main feature of the network is the integration of the setups using two types of encoding, i.e. polarisation encoding and phase encoding. One of the possible applications of the developed QKD network is the continuous key renewal in existing symmetric encryption devices with a key refresh time of up to 14 s. △ Less

Submitted 2 October, 2017; v1 submitted 19 May, 2017; originally announced May 2017.

Comments: 5 pages, 3 figures

Journal ref: Quantum Electron. 47, 798 (2017)

arXiv:1704.06497 [pdf, other]

Bandit Structured Prediction for Neural Sequence-to-Sequence Learning

Authors: Julia Kreutzer, Artem Sokolov, Stefan Riezler

Abstract: Bandit structured prediction describes a stochastic optimization framework where learning is performed from partial feedback. This feedback is received in the form of a task loss evaluation to a predicted output structure, without having access to gold standard structures. We advance this framework by lifting linear bandit learning to neural sequence-to-sequence learning problems using attention-b… ▽ More Bandit structured prediction describes a stochastic optimization framework where learning is performed from partial feedback. This feedback is received in the form of a task loss evaluation to a predicted output structure, without having access to gold standard structures. We advance this framework by lifting linear bandit learning to neural sequence-to-sequence learning problems using attention-based recurrent neural networks. Furthermore, we show how to incorporate control variates into our learning algorithms for variance reduction and improved generalization. We present an evaluation on a neural machine translation task that shows improvements of up to 5.89 BLEU points for domain adaptation from simulated bandit feedback. △ Less

Submitted 13 December, 2018; v1 submitted 21 April, 2017; originally announced April 2017.

Comments: ACL 2017

arXiv:1606.00739 [pdf, ps, other]

Stochastic Structured Prediction under Bandit Feedback

Authors: Artem Sokolov, Julia Kreutzer, Christopher Lo, Stefan Riezler

Abstract: Stochastic structured prediction under bandit feedback follows a learning protocol where on each of a sequence of iterations, the learner receives an input, predicts an output structure, and receives partial feedback in form of a task loss evaluation of the predicted structure. We present applications of this learning scenario to convex and non-convex objectives for structured prediction and analy… ▽ More Stochastic structured prediction under bandit feedback follows a learning protocol where on each of a sequence of iterations, the learner receives an input, predicts an output structure, and receives partial feedback in form of a task loss evaluation of the predicted structure. We present applications of this learning scenario to convex and non-convex objectives for structured prediction and analyze them as stochastic first-order methods. We present an experimental evaluation on problems of natural language processing over exponential output spaces, and compare convergence speed across different objectives under the practical criterion of optimal task performance on development data and the optimization-theoretic criterion of minimal squared gradient norm. Best results under both criteria are obtained for a non-convex objective for pairwise preference learning under bandit feedback. △ Less

Submitted 2 November, 2016; v1 submitted 2 June, 2016; originally announced June 2016.

Comments: 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain

arXiv:1601.04468 [pdf, ps, other]

Bandit Structured Prediction for Learning from Partial Feedback in Statistical Machine Translation

Authors: Artem Sokolov, Stefan Riezler, Tanguy Urvoy

Abstract: We present an approach to structured prediction from bandit feedback, called Bandit Structured Prediction, where only the value of a task loss function at a single predicted point, instead of a correct structure, is observed in learning. We present an application to discriminative reranking in Statistical Machine Translation (SMT) where the learning algorithm only has access to a 1-BLEU loss evalu… ▽ More We present an approach to structured prediction from bandit feedback, called Bandit Structured Prediction, where only the value of a task loss function at a single predicted point, instead of a correct structure, is observed in learning. We present an application to discriminative reranking in Statistical Machine Translation (SMT) where the learning algorithm only has access to a 1-BLEU loss evaluation of a predicted translation instead of obtaining a gold standard reference translation. In our experiment bandit feedback is obtained by evaluating BLEU on reference translations without revealing them to the algorithm. This can be thought of as a simulation of interactive machine translation where an SMT system is personalized by a user who provides single point feedback to predicted translations. Our experiments show that our approach improves translation quality and is comparable to approaches that employ more informative feedback in learning. △ Less

Submitted 18 January, 2016; originally announced January 2016.

Comments: In Proceedings of MT Summit XV, 2015. Miami, FL

Showing 1–29 of 29 results for author: Sokolov, A