Skip to main content

Showing 1–13 of 13 results for author: Balikas, G

  1. arXiv:2111.15278  [pdf, other

    cs.CL

    Bilingual Topic Models for Comparable Corpora

    Authors: Georgios Balikas, Massih-Reza Amini, Marianne Clausel

    Abstract: Probabilistic topic models like Latent Dirichlet Allocation (LDA) have been previously extended to the bilingual setting. A fundamental modeling assumption in several of these extensions is that the input corpora are in the form of document pairs whose constituent documents share a single topic distribution. However, this assumption is strong for comparable corpora that consist of documents themat… ▽ More

    Submitted 30 November, 2021; originally announced November 2021.

    Comments: 32 pages, 2 figures

  2. arXiv:2012.06238  [pdf, other

    cs.LG cs.IR

    Query Understanding for Natural Language Enterprise Search

    Authors: Francisco Borges, Georgios Balikas, Marc Brette, Guillaume Kempf, Arvind Srikantan, Matthieu Landos, Darya Brazouskaya, Qianqian Shi

    Abstract: Natural Language Search (NLS) extends the capabilities of search engines that perform keyword search allowing users to issue queries in a more "natural" language. The engine tries to understand the meaning of the queries and to map the query words to the symbols it supports like Persons, Organizations, Time Expressions etc.. It, then, retrieves the information that satisfies the user's need in dif… ▽ More

    Submitted 11 December, 2020; originally announced December 2020.

    Comments: accepted at DeepNLP @ SIGIR 2020

  3. arXiv:1910.11005  [pdf, ps, other

    cs.CL

    Wasserstein distances for evaluating cross-lingual embeddings

    Authors: Georgios Balikas, Ioannis Partalas

    Abstract: Word embeddings are high dimensional vector representations of words that capture their semantic similarity in the vector space. There exist several algorithms for learning such embeddings both for a single language as well as for several languages jointly. In this work we propose to evaluate collections of embeddings by adapting downstream natural language tasks to the optimal transport framework… ▽ More

    Submitted 11 November, 2019; v1 submitted 24 October, 2019; originally announced October 2019.

  4. arXiv:1809.08935  [pdf, other

    cs.CL cs.AI

    Lexical Bias In Essay Level Prediction

    Authors: Georgios Balikas

    Abstract: Automatically predicting the level of non-native English speakers given their written essays is an interesting machine learning problem. In this work I present the system "balikasg" that achieved the state-of-the-art performance in the CAp 2018 data science challenge among 14 systems. I detail the feature extraction, feature engineering and model selection steps and I evaluate how these decisions… ▽ More

    Submitted 21 September, 2018; originally announced September 2018.

    Comments: CAp 2018

  5. arXiv:1807.10076  [pdf, other

    cs.CL

    Concurrent Learning of Semantic Relations

    Authors: Georgios Balikas, Gaël Dias, Rumen Moraliyski, Massih-Reza Amini

    Abstract: Discovering whether words are semantically related and identifying the specific semantic relation that holds between them is of crucial importance for NLP as it is essential for tasks like query expansion in IR. Within this context, different methodologies have been proposed that either exclusively focus on a single lexical relation (e.g. hypernymy vs. random) or learn specific classifiers capable… ▽ More

    Submitted 30 July, 2018; v1 submitted 26 July, 2018; originally announced July 2018.

    Comments: 10 pages

  6. arXiv:1805.04437  [pdf, other

    cs.CL stat.ML

    Cross-lingual Document Retrieval using Regularized Wasserstein Distance

    Authors: Georgios Balikas, Charlotte Laclau, Ievgen Redko, Massih-Reza Amini

    Abstract: Many information retrieval algorithms rely on the notion of a good distance that allows to efficiently compare objects of different nature. Recently, a new promising metric called Word Mover's Distance was proposed to measure the divergence between text passages. In this paper, we demonstrate that this metric can be extended to incorporate term-weighting schemes and provide more accurate and compu… ▽ More

    Submitted 11 May, 2018; originally announced May 2018.

    Comments: ECIR 2018

  7. arXiv:1707.07568  [pdf, other

    cs.CL

    CAp 2017 challenge: Twitter Named Entity Recognition

    Authors: Cédric Lopez, Ioannis Partalas, Georgios Balikas, Nadia Derbas, Amélie Martin, Coralie Reutenauer, Frédérique Segond, Massih-Reza Amini

    Abstract: The paper describes the CAp 2017 challenge. The challenge concerns the problem of Named Entity Recognition (NER) for tweets written in French. We first present the data preparation steps we followed for constructing the dataset released in the framework of the challenge. We begin by demonstrating why NER for tweets is a challenging problem especially when the number of entities increases. We detai… ▽ More

    Submitted 24 July, 2017; originally announced July 2017.

    Comments: Presented at CAp 2017 (French Conference on Machine Learning)

  8. arXiv:1707.03569  [pdf, other

    cs.IR cs.CL cs.LG

    Multitask Learning for Fine-Grained Twitter Sentiment Analysis

    Authors: Georgios Balikas, Simon Moura, Massih-Reza Amini

    Abstract: Traditional sentiment analysis approaches tackle problems like ternary (3-category) and fine-grained (5-category) classification by learning the tasks separately. We argue that such classification tasks are correlated and we propose a multitask approach based on a recurrent neural network that benefits by jointly learning them. Our study demonstrates the potential of multitask models on this type… ▽ More

    Submitted 12 July, 2017; originally announced July 2017.

    Comments: International ACM SIGIR Conference on Research and Development in Information Retrieval 2017

  9. arXiv:1705.01265  [pdf, ps, other

    cs.CL

    On the effectiveness of feature set augmentation using clusters of word embeddings

    Authors: Georgios Balikas, Ioannis Partalas

    Abstract: Word clusters have been empirically shown to offer important performance improvements on various tasks. Despite their importance, their incorporation in the standard pipeline of feature engineering relies more on a trial-and-error procedure where one evaluates several hyper-parameters, like the number of clusters to be used. In order to better understand the role of such features we systematically… ▽ More

    Submitted 30 July, 2018; v1 submitted 3 May, 2017; originally announced May 2017.

    Comments: SwissText 2018; oral presentations

  10. arXiv:1606.06623  [pdf, other

    cs.CL cs.IR

    An empirical study on large scale text classification with skip-gram embeddings

    Authors: Georgios Balikas, Massih-Reza Amini

    Abstract: We investigate the integration of word embeddings as classification features in the setting of large scale text classification. Such representations have been used in a plethora of tasks, however their application in classification scenarios with thousands of classes has not been extensively researched, partially due to hardware limitations. In this work, we examine efficient composition functions… ▽ More

    Submitted 21 June, 2016; originally announced June 2016.

  11. arXiv:1606.04351  [pdf, other

    cs.CL cs.IR cs.LG

    TwiSE at SemEval-2016 Task 4: Twitter Sentiment Classification

    Authors: Georgios Balikas, Massih-Reza Amini

    Abstract: This paper describes the participation of the team "TwiSE" in the SemEval 2016 challenge. Specifically, we participated in Task 4, namely "Sentiment Analysis in Twitter" for which we implemented sentiment classification systems for subtasks A, B, C and D. Our approach consists of two steps. In the first step, we generate and validate diverse feature sets for twitter sentiment evaluation, inspired… ▽ More

    Submitted 14 June, 2016; originally announced June 2016.

  12. arXiv:1606.02854  [pdf, ps, other

    cs.LG cs.AI

    e-Commerce product classification: our participation at cDiscount 2015 challenge

    Authors: Ioannis Partalas, Georgios Balikas

    Abstract: This report describes our participation in the cDiscount 2015 challenge where the goal was to classify product items in a predefined taxonomy of products. Our best submission yielded an accuracy score of 64.20\% in the private part of the leaderboard and we were ranked 10th out of 175 participating teams. We followed a text classification approach employing mainly linear models. The final solution… ▽ More

    Submitted 9 June, 2016; originally announced June 2016.

    Comments: Technical report

  13. arXiv:1606.00253  [pdf, other

    cs.CL cs.IR cs.LG

    On a Topic Model for Sentences

    Authors: Georgios Balikas, Massih-Reza Amini, Marianne Clausel

    Abstract: Probabilistic topic models are generative models that describe the content of documents by discovering the latent topics underlying them. However, the structure of the textual input, and for instance the grouping of words in coherent text spans such as sentences, contains much information which is generally lost with these models. In this paper, we propose sentenceLDA, an extension of LDA whose go… ▽ More

    Submitted 1 June, 2016; originally announced June 2016.