-
Yin Yang Convolutional Nets: Image Manifold Extraction by the Analysis of Opposites
Authors:
Augusto Seben da Rosa,
Frederico Santos de Oliveira,
Anderson da Silva Soares,
Arnaldo Candido Junior
Abstract:
Computer vision in general presented several advances such as training optimizations, new architectures (pure attention, efficient block, vision language models, generative models, among others). This have improved performance in several tasks such as classification, and others. However, the majority of these models focus on modifications that are taking distance from realistic neuroscientific app…
▽ More
Computer vision in general presented several advances such as training optimizations, new architectures (pure attention, efficient block, vision language models, generative models, among others). This have improved performance in several tasks such as classification, and others. However, the majority of these models focus on modifications that are taking distance from realistic neuroscientific approaches related to the brain. In this work, we adopt a more bio-inspired approach and present the Yin Yang Convolutional Network, an architecture that extracts visual manifold, its blocks are intended to separate analysis of colors and forms at its initial layers, simulating occipital lobe's operations. Our results shows that our architecture provides State-of-the-Art efficiency among low parameter architectures in the dataset CIFAR-10. Our first model reached 93.32\% test accuracy, 0.8\% more than the older SOTA in this category, while having 150k less parameters (726k in total). Our second model uses 52k parameters, losing only 3.86\% test accuracy. We also performed an analysis on ImageNet, where we reached 66.49\% validation accuracy with 1.6M parameters. We make the code publicly available at: https://github.com/NoSavedDATA/YinYang_CNN.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages
Authors:
Frederico S. Oliveira,
Edresson Casanova,
Arnaldo Cândido Júnior,
Anderson S. Soares,
Arlindo R. Galvão Filho
Abstract:
In this paper, we present CML-TTS, a recursive acronym for CML-Multi-Lingual-TTS, a new Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG). CML-TTS is based on Multilingual LibriSpeech (MLS) and adapted for training TTS models, consisting of audiobooks in seven languages: Dutch, French, German, Italian, Port…
▽ More
In this paper, we present CML-TTS, a recursive acronym for CML-Multi-Lingual-TTS, a new Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG). CML-TTS is based on Multilingual LibriSpeech (MLS) and adapted for training TTS models, consisting of audiobooks in seven languages: Dutch, French, German, Italian, Portuguese, Polish, and Spanish. Additionally, we provide the YourTTS model, a multi-lingual TTS model, trained using 3,176.13 hours from CML-TTS and also with 245.07 hours from LibriTTS, in English. Our purpose in creating this dataset is to open up new research possibilities in the TTS area for multi-lingual models. The dataset is publicly available under the CC-BY 4.0 license1.
△ Less
Submitted 16 June, 2023;
originally announced June 2023.
-
Evaluation of Speech Representations for MOS prediction
Authors:
Frederico S. Oliveira,
Edresson Casanova,
Arnaldo Cândido Júnior,
Lucas R. S. Gris,
Anderson S. Soares,
Arlindo R. Galvão Filho
Abstract:
In this paper, we evaluate feature extraction models for predicting speech quality. We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models to predict the metric MOS. Our experiments were performed on the VCC2018 dataset and a Brazilian-Portuguese dataset called BRSpeechMOS, which was creat…
▽ More
In this paper, we evaluate feature extraction models for predicting speech quality. We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models to predict the metric MOS. Our experiments were performed on the VCC2018 dataset and a Brazilian-Portuguese dataset called BRSpeechMOS, which was created for this work. The results show that the Whisper model is appropriate in all scenarios: with both the VCC2018 and BRSpeech- MOS datasets. Among the supervised and self-supervised learning models using BRSpeechMOS, Whisper-Small achieved the best linear correlation of 0.6980, and the speaker verification model, SpeakerNet, had linear correlation of 0.6963. Using VCC2018, the best supervised and self-supervised learning model, Whisper-Large, achieved linear correlation of 0.7274, and the best model speaker verification, TitaNet, achieved a linear correlation of 0.6933. Although the results of the speaker verification models are slightly lower, the SpeakerNet model has only 5M parameters, making it suitable for real-time applications, and the TitaNet model produces an embedding of size 192, the smallest among all the evaluated models. The experiment results are reproducible with publicly available source-code1 .
△ Less
Submitted 16 June, 2023;
originally announced June 2023.
-
Thinness and its variations on some graph families and coloring graphs of bounded thinness
Authors:
Flavia Bonomo-Braberman,
Eric Brandwein,
Fabiano S. Oliveira,
Moysés S. Sampaio Jr.,
Agustin Sansone,
Jayme L. Szwarcfiter
Abstract:
Interval graphs and proper interval graphs are well known graph classes, for which several generalizations have been proposed in the literature. In this work, we study the (proper) thinness, and several variations, for the classes of cographs, crowns graphs and grid graphs.
We provide the exact values for several variants of thinness (proper, independent, complete, precedence, and combinations o…
▽ More
Interval graphs and proper interval graphs are well known graph classes, for which several generalizations have been proposed in the literature. In this work, we study the (proper) thinness, and several variations, for the classes of cographs, crowns graphs and grid graphs.
We provide the exact values for several variants of thinness (proper, independent, complete, precedence, and combinations of them) for the crown graphs $CR_n$. For cographs, we prove that the precedence thinness can be determined in polynomial time. We also improve known bounds for the thinness of $n \times n$ grids $GR_n$ and $m \times n$ grids $GR_{m,n}$, proving that $\left \lceil \frac{n-1}{3} \right \rceil \leq \mbox{thin}(GR_n) \leq \left \lceil \frac{n+1}{2} \right \rceil$. Regarding the precedence thinness, we prove that $\mbox{prec-thin}(GR_{n,2}) = \left \lceil \frac{n+1}{2} \right \rceil$ and that $\left \lceil \frac{n-1}{3} \right \rceil \left \lceil\frac{n-1}{2} \right \rceil + 1 \leq \mbox{prec-thin}(GR_n) \leq \left \lceil\frac{n-1}{2} \right \rceil^2+1$. As applications, we show that the $k$-coloring problem is NP-complete for precedence $2$-thin graphs and for proper $2$-thin graphs, when $k$ is part of the input. On the positive side, it is polynomially solvable for precedence proper $2$-thin graphs, given the order and partition.
△ Less
Submitted 2 February, 2024; v1 submitted 10 March, 2023;
originally announced March 2023.
-
Analysis of account behaviors in Ethereum during an economic impact event
Authors:
Pedro Henrique F. S. Oliveira,
Daniel Muller Rezende,
Heder Soares Bernardino,
Saulo Moraes Villela,
Alex Borges Vieira
Abstract:
One of the main events that involve the world economy in 2022 is the conflict between Russia and Ukraine. This event offers a rare opportunity to analyze how events of this magnitude can reflect the use of cryptocurrencies. This work aims to investigate the behavior of accounts and their transactions on the Ethereum cryptocurrency during this event. To this end, we collected all transactions that…
▽ More
One of the main events that involve the world economy in 2022 is the conflict between Russia and Ukraine. This event offers a rare opportunity to analyze how events of this magnitude can reflect the use of cryptocurrencies. This work aims to investigate the behavior of accounts and their transactions on the Ethereum cryptocurrency during this event. To this end, we collected all transactions that occurred two weeks before and two weeks after the beginning of the conflict, organized into two groups: the collection of the accounts involved in these transactions and the subset of these ones that interacted with a service in Ethereum, called Flashbots Auction. We modeled temporal graphs where each node represents an account, and each edge represents a transaction between two accounts. Then, we analyzed the behavior of these accounts with graph metrics for both groups during each observed week. The results showed changes in the behavior and activity of users and their accounts, as well as variations in the daily volume of transactions.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
Edge Intersection Graphs of Paths on a Triangular Grid
Authors:
Vitor T. F. de Luca,
María Pía Mazzoleni,
Fabiano S. Oliveira,
Tanilson D. Santos,
Jayme L. Szwarcfiter
Abstract:
We introduce a new class of intersection graphs, the edge intersection graphs of paths on a triangular grid, called EPGt graphs. We show similarities and differences from this new class to the well-known class of EPG graphs. A turn of a path at a grid point is called a bend. An EPGt representation in which every path has at most $k$ bends is called a B$_k$-EPGt representation and the corresponding…
▽ More
We introduce a new class of intersection graphs, the edge intersection graphs of paths on a triangular grid, called EPGt graphs. We show similarities and differences from this new class to the well-known class of EPG graphs. A turn of a path at a grid point is called a bend. An EPGt representation in which every path has at most $k$ bends is called a B$_k$-EPGt representation and the corresponding graphs are called B$_k$-EPGt graphs. We provide examples of B$_{2}$-EPG graphs that are B$_{1}$-EPGt. We characterize the representation of cliques with three vertices and chordless 4-cycles in B$_{1}$-EPGt representations. We also prove that B$_{1}$-EPGt graphs have Strong Helly number $3$. Furthermore, we prove that B$_{1}$-EPGt graphs are $7$-clique colorable.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
MaxCut on Permutation Graphs is NP-complete
Authors:
Celina M. H. de Figueiredo,
Alexsander A. de Melo,
Fabiano S. Oliveira,
Ana Silva
Abstract:
In this paper, we prove that the MaxCut problem is NP-complete on permutation graphs, settling a long-standing open problem that appeared in the 1985 column of the "Ongoing Guide to NP-completeness" by David S. Johnson.
In this paper, we prove that the MaxCut problem is NP-complete on permutation graphs, settling a long-standing open problem that appeared in the 1985 column of the "Ongoing Guide to NP-completeness" by David S. Johnson.
△ Less
Submitted 28 February, 2022;
originally announced February 2022.
-
CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese
Authors:
Arnaldo Candido Junior,
Edresson Casanova,
Anderson Soares,
Frederico Santos de Oliveira,
Lucas Oliveira,
Ricardo Corso Fernandes Junior,
Daniel Peixoto Pinto da Silva,
Fernando Gorgulho Fayet,
Bruno Baldissera Carlotto,
Lucas Rafael Stefanel Gris,
Sandra Maria Aluísio
Abstract:
Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were about 376 hours public available for ASR task until the second half of 2020. With the release of new datasets in early 2021, this number increased to 574 hours. The existing resources, however,…
▽ More
Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were about 376 hours public available for ASR task until the second half of 2020. With the release of new datasets in early 2021, this number increased to 574 hours. The existing resources, however, are composed of audios containing only read and prepared speech. There is a lack of datasets including spontaneous speech, which are essential in different ASR applications. This paper presents CORAA (Corpus of Annotated Audios) v1. with 290.77 hours, a publicly available dataset for ASR in BP containing validated pairs (audio-transcription). CORAA also contains European Portuguese audios (4.69 hours). We also present a public ASR model based on Wav2Vec 2.0 XLSR-53 and fine-tuned over CORAA. Our model achieved a Word Error Rate of 24.18% on CORAA test set and 20.08% on Common Voice test set. When measuring the Character Error Rate, we obtained 11.02% and 6.34% for CORAA and Common Voice, respectively. CORAA corpora were assembled to both improve ASR models in BP with phenomena from spontaneous speech and motivate young researchers to start their studies on ASR for Portuguese. All the corpora are publicly available at https://github.com/nilc-nlp/CORAA under the CC BY-NC-ND 4.0 license.
△ Less
Submitted 18 November, 2021; v1 submitted 14 October, 2021;
originally announced October 2021.
-
Minimum Number of Bends of Paths of Trees in a Grid Embedding
Authors:
V. T. F. Luca,
F. S. Oliveira,
J. L. Szwarcfiter
Abstract:
We are interested in embedding trees T with maximum degree at most four in a rectangular grid, such that the vertices of T correspond to grid points, while edges of T correspond to non-intersecting straight segments of the grid lines. Such embeddings are called straight models. While each edge is represented by a straight segment, a path of T is represented in the model by the union of the segment…
▽ More
We are interested in embedding trees T with maximum degree at most four in a rectangular grid, such that the vertices of T correspond to grid points, while edges of T correspond to non-intersecting straight segments of the grid lines. Such embeddings are called straight models. While each edge is represented by a straight segment, a path of T is represented in the model by the union of the segments corresponding to its edges, which may consist of a path in the model having several bends. The aim is to determine a straight model of a given tree T minimizing the maximum number of bends over all paths of T. We provide a quadratic-time algorithm for this problem. We also show how to construct straight models that have k as its minimum number of bends and with the least number of vertices possible. As an application of our algorithm, we provide an upper bound on the number of bends of EPG models of graphs that are both VPT and EPT.
△ Less
Submitted 6 September, 2021;
originally announced September 2021.
-
Brazilian Portuguese Speech Recognition Using Wav2vec 2.0
Authors:
Lucas Rafael Stefanel Gris,
Edresson Casanova,
Frederico Santos de Oliveira,
Anderson da Silva Soares,
Arnaldo Candido Junior
Abstract:
Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe an audio sentence in a sequence of written words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, such as Brazilian Portuguese (BP). In…
▽ More
Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe an audio sentence in a sequence of written words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, such as Brazilian Portuguese (BP). In this sense, this work presents the development of an public Automatic Speech Recognition (ASR) system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages, over BP data. The final model presents an average word error rate of 12.4% over 7 different datasets (10.5% when applying a language model). According to our knowledge, the obtained error is the lowest among open end-to-end (E2E) ASR models for BP.
△ Less
Submitted 22 December, 2021; v1 submitted 23 July, 2021;
originally announced July 2021.
-
B1-EPG representations using block-cutpoint trees
Authors:
V. T. F. Luca,
F. S. Oliveira,
J. L. Szwarcfiter
Abstract:
In this paper, we are interested in the edge intersection graphs of paths of a grid where each path has at most one bend, called B1-EPG graphs and first introduced by Golumbic et al (2009). We also consider a proper subclass of B1-EPG, the L-EPG graphs, which allows paths only in ``L'' shape. We show that two superclasses of trees are B1-EPG (one of them being the cactus graphs). On the other hand…
▽ More
In this paper, we are interested in the edge intersection graphs of paths of a grid where each path has at most one bend, called B1-EPG graphs and first introduced by Golumbic et al (2009). We also consider a proper subclass of B1-EPG, the L-EPG graphs, which allows paths only in ``L'' shape. We show that two superclasses of trees are B1-EPG (one of them being the cactus graphs). On the other hand, we show that the block graphs are L-EPG and provide a linear time algorithm to produce L-EPG representations of generalization of trees. These proofs employed a new technique from previous results in the area based on block-cutpoint trees of the respective graphs.
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Authors:
Edresson Casanova,
Christopher Shulby,
Eren Gölge,
Nicolas Michael Müller,
Frederico Santos de Oliveira,
Arnaldo Candido Junior,
Anderson da Silva Soares,
Sandra Maria Aluisio,
Moacir Antonelli Ponti
Abstract:
In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transform…
▽ More
In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.
△ Less
Submitted 15 June, 2021; v1 submitted 2 April, 2021;
originally announced April 2021.
-
Maximum cut on interval graphs of interval count four is NP-complete
Authors:
Celina M. H. de Figueiredo,
Alexsander A. de Melo,
Fabiano S. Oliveira,
Ana Silva
Abstract:
The computational complexity of the MaxCut problem restricted to interval graphs has been open since the 80's, being one of the problems proposed by Johnson on his Ongoing Guide to NP-completeness, and has been settled as NP-complete only recently by Adhikary, Bose, Mukherjee and Roy. On the other hand, many flawed proofs of polynomiality for MaxCut on the more restrictive class of unit/proper int…
▽ More
The computational complexity of the MaxCut problem restricted to interval graphs has been open since the 80's, being one of the problems proposed by Johnson on his Ongoing Guide to NP-completeness, and has been settled as NP-complete only recently by Adhikary, Bose, Mukherjee and Roy. On the other hand, many flawed proofs of polynomiality for MaxCut on the more restrictive class of unit/proper interval graphs (or graphs with interval count 1) have been presented along the years, and the classification of the problem is still unknown. In this paper, we present the first NP-completeness proof for MaxCut when restricted to interval graphs with bounded interval count, namely graphs with interval count 4.
△ Less
Submitted 29 November, 2022; v1 submitted 17 December, 2020;
originally announced December 2020.
-
Precedence thinness in graphs
Authors:
Flavia Bonomo-Braberman,
Fabiano S. Oliveira,
Moysés S. Sampaio Jr.,
Jayme L. Szwarcfiter
Abstract:
Interval and proper interval graphs are very well-known graph classes, for which there is a wide literature. As a consequence, some generalizations of interval graphs have been proposed, in which graphs in general are expressed in terms of $k$ interval graphs, by splitting the graph in some special way.
As a recent example of such an approach, the classes of $k$-thin and proper $k$-thin graphs h…
▽ More
Interval and proper interval graphs are very well-known graph classes, for which there is a wide literature. As a consequence, some generalizations of interval graphs have been proposed, in which graphs in general are expressed in terms of $k$ interval graphs, by splitting the graph in some special way.
As a recent example of such an approach, the classes of $k$-thin and proper $k$-thin graphs have been introduced generalizing interval and proper interval graphs, respectively. The complexity of the recognition of each of these classes is still open, even for fixed $k \geq 2$.
In this work, we introduce a subclass of $k$-thin graphs (resp. proper $k$-thin graphs), called precedence $k$-thin graphs (resp. precedence proper $k$-thin graphs). Concerning partitioned precedence $k$-thin graphs, we present a polynomial time recognition algorithm based on $PQ$-trees. With respect to partitioned precedence proper $k$-thin graphs, we prove that the related recognition problem is \NP-complete for an arbitrary $k$ and polynomial-time solvable when $k$ is fixed. Moreover, we present a characterization for these classes based on threshold graphs.
△ Less
Submitted 30 June, 2020;
originally announced June 2020.
-
Thinness of product graphs
Authors:
Flavia Bonomo-Braberman,
Carolina L. Gonzalez,
Fabiano S. Oliveira,
Moysés S. Sampaio Jr.,
Jayme L. Szwarcfiter
Abstract:
The thinness of a graph is a width parameter that generalizes some properties of interval graphs, which are exactly the graphs of thinness one. Many NP-complete problems can be solved in polynomial time for graphs with bounded thinness, given a suitable representation of the graph. In this paper we study the thinness and its variations of graph products. We show that the thinness behaves "well" in…
▽ More
The thinness of a graph is a width parameter that generalizes some properties of interval graphs, which are exactly the graphs of thinness one. Many NP-complete problems can be solved in polynomial time for graphs with bounded thinness, given a suitable representation of the graph. In this paper we study the thinness and its variations of graph products. We show that the thinness behaves "well" in general for products, in the sense that for most of the graph products defined in the literature, the thinness of the product of two graphs is bounded by a function (typically product or sum) of their thinness, or of the thinness of one of them and the size of the other. We also show for some cases the non-existence of such a function.
△ Less
Submitted 16 April, 2021; v1 submitted 30 June, 2020;
originally announced June 2020.
-
TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese
Authors:
Edresson Casanova,
Arnaldo Candido Junior,
Christopher Shulby,
Frederico Santos de Oliveira,
João Paulo Teixeira,
Moacir Antonelli Ponti,
Sandra Maria Aluisio
Abstract:
Speech provides a natural way for human-computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources fo…
▽ More
Speech provides a natural way for human-computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in Portuguese.
△ Less
Submitted 29 January, 2022; v1 submitted 11 May, 2020;
originally announced May 2020.
-
Linear-time Algorithms for Eliminating Claws in Graphs
Authors:
Flavia Bonomo-Braberman,
Julliano R. Nascimento,
Fabiano S. Oliveira,
Uéverton S. Souza,
Jayme L. Szwarcfiter
Abstract:
Since many NP-complete graph problems have been shown polynomial-time solvable when restricted to claw-free graphs, we study the problem of determining the distance of a given graph to a claw-free graph, considering vertex elimination as measure. CLAW-FREE VERTEX DELETION (CFVD) consists of determining the minimum number of vertices to be removed from a graph such that the resulting graph is claw-…
▽ More
Since many NP-complete graph problems have been shown polynomial-time solvable when restricted to claw-free graphs, we study the problem of determining the distance of a given graph to a claw-free graph, considering vertex elimination as measure. CLAW-FREE VERTEX DELETION (CFVD) consists of determining the minimum number of vertices to be removed from a graph such that the resulting graph is claw-free. Although CFVD is NP-complete in general and recognizing claw-free graphs is still a challenge, where the current best algorithm for a graph $G$ has the same running time of the best algorithm for matrix multiplication, we present linear-time algorithms for CFVD on weighted block graphs and weighted graphs with bounded treewidth. Furthermore, we show that this problem can be solved in linear time by a simpler algorithm on forests, and we determine the exact values for full $k$-ary trees. On the other hand, we show that CLAW-FREE VERTEX DELETION is NP-complete even when the input graph is a split graph. We also show that the problem is hard to approximate within any constant factor better than $2$, assuming the Unique Games Conjecture.
△ Less
Submitted 12 April, 2020;
originally announced April 2020.
-
Speech2Phone: A Novel and Efficient Method for Training Speaker Recognition Models
Authors:
Edresson Casanova,
Arnaldo Candido Junior,
Christopher Shulby,
Frederico Santos de Oliveira,
Lucas Rafael Stefanel Gris,
Hamilton Pereira da Silva,
Sandra Maria Aluisio,
Moacir Antonelli Ponti
Abstract:
In this paper we present an efficient method for training models for speaker recognition using small or under-resourced datasets. This method requires less data than other SOTA (State-Of-The-Art) methods, e.g. the Angular Prototypical and GE2E loss functions, while achieving similar results to those methods. This is done using the knowledge of the reconstruction of a phoneme in the speaker's voice…
▽ More
In this paper we present an efficient method for training models for speaker recognition using small or under-resourced datasets. This method requires less data than other SOTA (State-Of-The-Art) methods, e.g. the Angular Prototypical and GE2E loss functions, while achieving similar results to those methods. This is done using the knowledge of the reconstruction of a phoneme in the speaker's voice. For this purpose, a new dataset was built, composed of 40 male speakers, who read sentences in Portuguese, totaling approximately 3h. We compare the three best architectures trained using our method to select the best one, which is the one with a shallow architecture. Then, we compared this model with the SOTA method for the speaker recognition task: the Fast ResNet-34 trained with approximately 2,000 hours, using the loss functions Angular Prototypical and GE2E. Three experiments were carried out with datasets in different languages. Among these three experiments, our model achieved the second best result in two experiments and the best result in one of them. This highlights the importance of our method, which proved to be a great competitor to SOTA speaker recognition models, with 500x less data and a simpler approach.
△ Less
Submitted 18 June, 2021; v1 submitted 25 February, 2020;
originally announced February 2020.