Skip to main content

Showing 1–41 of 41 results for author: Petroni, F

  1. arXiv:2307.03172  [pdf, other

    cs.CL

    Lost in the Middle: How Language Models Use Long Contexts

    Authors: Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang

    Abstract: While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing t… ▽ More

    Submitted 20 November, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

    Comments: 18 pages, 16 figures. Accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 2023

  2. arXiv:2302.09865  [pdf, other

    cs.CL cs.AI cs.LG

    Can discrete information extraction prompts generalize across language models?

    Authors: Nathanaël Carraz Rakotonirina, Roberto Dessì, Fabio Petroni, Sebastian Riedel, Marco Baroni

    Abstract: We study whether automatically-induced prompts that effectively extract information from a language model can also be used, out-of-the-box, to probe other language models for the same information. After confirming that discrete prompts induced with the AutoPrompt algorithm outperform manual and semi-manual prompts on the slot-filling task, we demonstrate a drop in performance for AutoPrompt prompt… ▽ More

    Submitted 7 March, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

    Comments: Published as conference paper at ICLR 2023

  3. arXiv:2209.13331  [pdf, other

    cs.CL cs.LG

    EditEval: An Instruction-Based Benchmark for Text Improvements

    Authors: Jane Dwivedi-Yu, Timo Schick, Zhengbao Jiang, Maria Lomeli, Patrick Lewis, Gautier Izacard, Edouard Grave, Sebastian Riedel, Fabio Petroni

    Abstract: Evaluation of text generation to date has primarily focused on content created sequentially, rather than improvements on a piece of text. Writing, however, is naturally an iterative and incremental process that requires expertise in different modular skills such as fixing outdated information or making the style more consistent. Even so, comprehensive evaluation of a model's capacity to perform th… ▽ More

    Submitted 27 September, 2022; originally announced September 2022.

  4. arXiv:2209.06148  [pdf, other

    cs.IR

    Entity Tagging: Extracting Entities in Text Without Mention Supervision

    Authors: Christina Du, Kashyap Popat, Louis Martin, Fabio Petroni

    Abstract: Detection and disambiguation of all entities in text is a crucial task for a wide range of applications. The typical formulation of the problem involves two stages: detect mention boundaries and link all mentions to a knowledge base. For a long time, mention detection has been considered as a necessary step for extracting all entities in a piece of text, even if the information about mention spans… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

  5. arXiv:2208.11663  [pdf, other

    cs.CL

    PEER: A Collaborative Language Model

    Authors: Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio Petroni, Patrick Lewis, Gautier Izacard, Qingfei You, Christoforos Nalmpantis, Edouard Grave, Sebastian Riedel

    Abstract: Textual content is often the output of a collaborative writing process: We start with an initial draft, ask for suggestions, and repeatedly make changes. Agnostic of this process, today's language models are trained to generate only the final result. As a consequence, they lack several abilities crucial for collaborative writing: They are unable to update existing texts, difficult to control and i… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

  6. arXiv:2208.03299  [pdf, other

    cs.CL

    Atlas: Few-shot Learning with Retrieval Augmented Language Models

    Authors: Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, Edouard Grave

    Abstract: Large language models have shown impressive few-shot results on a wide range of tasks. However, when knowledge is key for such results, as is the case for tasks such as question answering and fact checking, massive parameter counts to store knowledge seem to be needed. Retrieval augmented models are known to excel at knowledge intensive tasks without the need for as many parameters, but it is uncl… ▽ More

    Submitted 16 November, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

  7. arXiv:2207.06220  [pdf, other

    cs.IR cs.AI

    Improving Wikipedia Verifiability with AI

    Authors: Fabio Petroni, Samuel Broscheit, Aleksandra Piktus, Patrick Lewis, Gautier Izacard, Lucas Hosseini, Jane Dwivedi-Yu, Maria Lomeli, Timo Schick, Pierre-Emmanuel Mazaré, Armand Joulin, Edouard Grave, Sebastian Riedel

    Abstract: Verifiability is a core content policy of Wikipedia: claims that are likely to be challenged need to be backed by citations. There are millions of articles available online and thousands of new articles are released each month. For this reason, finding relevant sources is a difficult task: many claims do not have any references that support them. Furthermore, even existing citations might not supp… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

  8. arXiv:2205.12570  [pdf, other

    cs.CL

    EDIN: An End-to-end Benchmark and Pipeline for Unknown Entity Discovery and Indexing

    Authors: Nora Kassner, Fabio Petroni, Mikhail Plekhanov, Sebastian Riedel, Nicola Cancedda

    Abstract: Existing work on Entity Linking mostly assumes that the reference knowledge base is complete, and therefore all mentions can be linked. In practice this is hardly ever the case, as knowledge bases are incomplete and because novel concepts arise constantly. This paper created the Unknown Entity Discovery and Indexing (EDIN) benchmark where unknown entities, that is entities without a description in… ▽ More

    Submitted 25 May, 2022; originally announced May 2022.

  9. arXiv:2205.05812  [pdf, other

    cs.CL cs.LG

    Open Vocabulary Extreme Classification Using Generative Models

    Authors: Daniel Simig, Fabio Petroni, Pouya Yanki, Kashyap Popat, Christina Du, Sebastian Riedel, Majid Yazdani

    Abstract: The extreme multi-label classification (XMC) task aims at tagging content with a subset of labels from an extremely large label set. The label vocabulary is typically defined in advance by domain experts and assumed to capture all necessary tags. However in real world scenarios this label set, although large, is often incomplete and experts frequently need to refine it. To develop systems that sim… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

  10. arXiv:2204.10628  [pdf, other

    cs.CL cs.IR

    Autoregressive Search Engines: Generating Substrings as Document Identifiers

    Authors: Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Wen-tau Yih, Sebastian Riedel, Fabio Petroni

    Abstract: Knowledge-intensive language tasks require NLP systems to both provide the correct answer and retrieve supporting evidence for it in a given corpus. Autoregressive language models are emerging as the de-facto standard for generating answers, with newer and more powerful systems emerging at an astonishing pace. In this paper we argue that all this (and future) progress can be directly applied to th… ▽ More

    Submitted 22 April, 2022; originally announced April 2022.

    Comments: 9 pages

  11. arXiv:2202.13844  [pdf, other

    cs.DM math.CO

    All Graphs with at most 8 nodes are 2-interval-PCGs

    Authors: Tiziana Calamoneri, Angelo Monti, Fabrizio Petroni

    Abstract: A graph G is a multi-interval PCG if there exist an edge weighted tree T with non-negative real values and disjoint intervals of the non-negative real half-line such that each node of G is uniquely associated to a leaf of T and there is an edge between two nodes in G if and only if the weighted distance between their corresponding leaves in T lies within any such intervals. If the number of interv… ▽ More

    Submitted 22 May, 2024; v1 submitted 28 February, 2022; originally announced February 2022.

    Comments: 7 pages, 3 figures, never published

  12. arXiv:2201.10990  [pdf, other

    cs.CV

    Learning To Recognize Procedural Activities with Distant Supervision

    Authors: Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

    Abstract: In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal d… ▽ More

    Submitted 16 June, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

    Comments: CVPR 2022. Code will be released here https://github.com/facebookresearch/video-distant-supervision

  13. arXiv:2112.09924  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus

    Authors: Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Oğuz, Edouard Grave, Wen-tau Yih, Sebastian Riedel

    Abstract: In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus t… ▽ More

    Submitted 24 May, 2022; v1 submitted 18 December, 2021; originally announced December 2021.

  14. arXiv:2112.08340  [pdf, other

    cs.CL cs.LG stat.ML

    GenIE: Generative Information Extraction

    Authors: Martin Josifoski, Nicola De Cao, Maxime Peyrard, Fabio Petroni, Robert West

    Abstract: Structured and grounded representation of text is typically formalized by closed information extraction, the problem of extracting an exhaustive set of (subject, relation, object) triplets that are consistent with a predefined set of entities and relations from a knowledge base schema. Most existing works are pipelines prone to error accumulation, and all approaches are only applicable to unrealis… ▽ More

    Submitted 13 April, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

    Comments: Accepted at NAACL 2022

  15. arXiv:2112.07771  [pdf, other

    cs.CL cs.IR

    Boosted Dense Retriever

    Authors: Patrick Lewis, Barlas Oğuz, Wenhan Xiong, Fabio Petroni, Wen-tau Yih, Sebastian Riedel

    Abstract: We propose DrBoost, a dense retrieval ensemble inspired by boosting. DrBoost is trained in stages: each component model is learned sequentially and specialized by focusing only on retrieval mistakes made by the current ensemble. The final representation is the concatenation of the output vectors of all the component models, making it a drop-in replacement for standard dense retrievers at test time… ▽ More

    Submitted 14 December, 2021; originally announced December 2021.

  16. arXiv:2109.13202  [pdf, other

    cs.LG stat.ML

    MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research

    Authors: Mikayel Samvelyan, Robert Kirk, Vitaly Kurin, Jack Parker-Holder, Minqi Jiang, Eric Hambro, Fabio Petroni, Heinrich Küttler, Edward Grefenstette, Tim Rocktäschel

    Abstract: Progress in deep reinforcement learning (RL) is heavily driven by the availability of challenging benchmarks used for training agents. However, benchmarks that are widely adopted by the community are not explicitly designed for evaluating specific capabilities of RL methods. While there exist environments for assessing particular open problems in RL (such as exploration, transfer learning, unsuper… ▽ More

    Submitted 16 November, 2021; v1 submitted 27 September, 2021; originally announced September 2021.

    Comments: NeurIPS 2021: Datasets and Benchmarks Track

  17. arXiv:2106.13353  [pdf, other

    cs.CL cs.LG

    Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models

    Authors: Robert L. Logan IV, Ivana Balažević, Eric Wallace, Fabio Petroni, Sameer Singh, Sebastian Riedel

    Abstract: Prompting language models (LMs) with training examples and task descriptions has been seen as critical to recent successes in few-shot learning. In this work, we show that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering. In fact, one can use null prompts, prompts that contain neither task-specific templates nor training examples, and achieve competiti… ▽ More

    Submitted 1 July, 2021; v1 submitted 24 June, 2021; originally announced June 2021.

  18. arXiv:2104.00353  [pdf, other

    eess.AS cs.LG

    CycleDRUMS: Automatic Drum Arrangement For Bass Lines Using CycleGAN

    Authors: Giorgio Barnabò, Giovanni Trappolini, Lorenzo Lastilla, Cesare Campagnano, Angela Fan, Fabio Petroni, Fabrizio Silvestri

    Abstract: The two main research threads in computer-based music generation are: the construction of autonomous music-making systems, and the design of computer-based environments to assist musicians. In the symbolic domain, the key problem of automatically arranging a piece music was extensively studied, while relatively fewer systems tackled this challenge in the audio domain. In this contribution, we prop… ▽ More

    Submitted 9 April, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: 9 pages, 5 figures, submitted to IEEE Transactions on Multimedia, the authors contributed equally to this work

  19. arXiv:2103.12528  [pdf, other

    cs.CL cs.AI stat.ML

    Multilingual Autoregressive Entity Linking

    Authors: Nicola De Cao, Ledell Wu, Kashyap Popat, Mikel Artetxe, Naman Goyal, Mikhail Plekhanov, Luke Zettlemoyer, Nicola Cancedda, Sebastian Riedel, Fabio Petroni

    Abstract: We present mGENRE, a sequence-to-sequence system for the Multilingual Entity Linking (MEL) problem -- the task of resolving language-specific mentions to a multilingual Knowledge Base (KB). For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token in an autoregressive fashion. The autoregressive formulation allows us to effectively cross-encode… ▽ More

    Submitted 23 March, 2021; originally announced March 2021.

    Comments: 20 pages, 8 figures, and 11 tables

  20. arXiv:2101.00133  [pdf, other

    cs.CL cs.AI

    NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned

    Authors: Sewon Min, Jordan Boyd-Graber, Chris Alberti, Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, Jennimaria Palomaki, Colin Raffel, Adam Roberts, Tom Kwiatkowski, Patrick Lewis, Yuxiang Wu, Heinrich Küttler, Linqing Liu, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, Sohee Yang, Minjoon Seo, Gautier Izacard, Fabio Petroni, Lucas Hosseini , et al. (28 additional authors not shown)

    Abstract: We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage conte… ▽ More

    Submitted 19 September, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

    Comments: 26 pages; Published in Proceedings of Machine Learning Research (PMLR), NeurIPS 2020 Competition and Demonstration Track

  21. arXiv:2101.00117  [pdf, other

    cs.CL

    Multi-task Retrieval for Knowledge-Intensive Tasks

    Authors: Jean Maillard, Vladimir Karpukhin, Fabio Petroni, Wen-tau Yih, Barlas Oğuz, Veselin Stoyanov, Gargi Ghosh

    Abstract: Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data. Driven by the question of whether a neural retrieval model can be universal and perform robustly on a wide va… ▽ More

    Submitted 31 December, 2020; originally announced January 2021.

  22. arXiv:2012.15156  [pdf, other

    cs.CL

    A Memory Efficient Baseline for Open Domain Question Answering

    Authors: Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola De Cao, Sebastian Riedel, Edouard Grave

    Abstract: Recently, retrieval systems based on dense representations have led to important improvements in open-domain question answering, and related tasks. While very effective, this approach is also memory intensive, as the dense vectors for the whole knowledge source need to be kept in memory. In this paper, we study how the memory footprint of dense retriever-reader systems can be reduced. We consider… ▽ More

    Submitted 30 December, 2020; originally announced December 2020.

  23. arXiv:2011.05448  [pdf, other

    cs.CL

    Generating Fact Checking Briefs

    Authors: Angela Fan, Aleksandra Piktus, Fabio Petroni, Guillaume Wenzek, Marzieh Saeidi, Andreas Vlachos, Antoine Bordes, Sebastian Riedel

    Abstract: Fact checking at scale is difficult -- while the number of active fact checking websites is growing, it remains too small for the needs of the contemporary media ecosystem. However, despite good intentions, contributions from volunteers are often error-prone, and thus in practice restricted to claim detection. We investigate how to increase the accuracy and efficiency of fact checking by providing… ▽ More

    Submitted 10 November, 2020; originally announced November 2020.

  24. arXiv:2010.00904  [pdf, other

    cs.CL cs.IR cs.LG stat.ML

    Autoregressive Entity Retrieval

    Authors: Nicola De Cao, Gautier Izacard, Sebastian Riedel, Fabio Petroni

    Abstract: Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. Current approaches can be understood as classifiers among atomic… ▽ More

    Submitted 24 March, 2021; v1 submitted 2 October, 2020; originally announced October 2020.

    Comments: Accepted (spotlight) at International Conference on Learning Representations (ICLR) 2021. Code at https://github.com/facebookresearch/GENRE. 20 pages, 9 figures, 8 tables

  25. arXiv:2009.02252  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    KILT: a Benchmark for Knowledge Intensive Language Tasks

    Authors: Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, Sebastian Riedel

    Abstract: Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research… ▽ More

    Submitted 27 May, 2021; v1 submitted 4 September, 2020; originally announced September 2020.

    Comments: accepted at NAACL 2021

  26. arXiv:2006.07203   

    cs.CV

    Video Understanding as Machine Translation

    Authors: Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani

    Abstract: With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positi… ▽ More

    Submitted 17 September, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: The authors have temporarily withdrawn this paper to reassess some of the experimental results

  27. arXiv:2006.00937  [pdf, ps, other

    cs.LG cs.IR stat.ML

    Concept Matching for Low-Resource Classification

    Authors: Federico Errica, Ludovic Denoyer, Bora Edizel, Fabio Petroni, Vassilis Plachouras, Fabrizio Silvestri, Sebastian Riedel

    Abstract: We propose a model to tackle classification tasks in the presence of very little training data. To this aim, we approximate the notion of exact match with a theoretically sound mechanism that computes a probability of matching in the input space. Importantly, the model learns to focus on elements of the input that are relevant for the task at hand; by leveraging highlighted portions of the trainin… ▽ More

    Submitted 1 June, 2020; originally announced June 2020.

  28. arXiv:2005.11401  [pdf, other

    cs.CL cs.LG

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Authors: Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela

    Abstract: Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for… ▽ More

    Submitted 12 April, 2021; v1 submitted 22 May, 2020; originally announced May 2020.

    Comments: Accepted at NeurIPS 2020

  29. arXiv:2005.04611  [pdf, other

    cs.CL

    How Context Affects Language Models' Factual Predictions

    Authors: Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel

    Abstract: When pre-trained on large unsupervised textual corpora, language models are able to store and retrieve factual knowledge to some extent, making it possible to use them directly for zero-shot cloze-style question answering. However, storing factual knowledge in a fixed number of weights of a language model clearly has limitations. Previous approaches have successfully provided access to information… ▽ More

    Submitted 10 May, 2020; originally announced May 2020.

    Comments: accepted at AKBC 2020

  30. arXiv:1911.03814  [pdf, other

    cs.CL

    Scalable Zero-shot Entity Linking with Dense Entity Retrieval

    Authors: Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, Luke Zettlemoyer

    Abstract: This paper introduces a conceptually simple, scalable, and highly effective BERT-based entity linking model, along with an extensive evaluation of its accuracy-speed trade-off. We present a two-stage zero-shot linking algorithm, where each entity is defined only by a short textual description. The first stage does retrieval in a dense space defined by a bi-encoder that independently embeds the men… ▽ More

    Submitted 29 September, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

    Comments: accepted at EMNLP 2020

  31. arXiv:1911.03587  [pdf, other

    cs.CL

    How Decoding Strategies Affect the Verifiability of Generated Text

    Authors: Luca Massarelli, Fabio Petroni, Aleksandra Piktus, Myle Ott, Tim Rocktäschel, Vassilis Plachouras, Fabrizio Silvestri, Sebastian Riedel

    Abstract: Recent progress in pre-trained language models led to systems that are able to generate text of an increasingly high quality. While several works have investigated the fluency and grammatical correctness of such models, it is still unclear to which extent the generated text is consistent with factual world knowledge. Here, we go beyond fluency and also investigate the verifiability of text generat… ▽ More

    Submitted 29 September, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

    Comments: accepted at Findings of EMNLP 2020

  32. arXiv:1909.01066  [pdf, other

    cs.CL

    Language Models as Knowledge Bases?

    Authors: Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel

    Abstract: Recent progress in pretraining language models on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as "fill-in-the-blank" cloze statements. Language models have many advantages over structured knowledge… ▽ More

    Submitted 4 September, 2019; v1 submitted 3 September, 2019; originally announced September 2019.

    Comments: accepted at EMNLP 2019

  33. arXiv:1811.05296  [pdf, other

    cs.CR cs.LG

    SAFE: Self-Attentive Function Embeddings for Binary Similarity

    Authors: Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni

    Abstract: The binary similarity problem consists in determining if two functions are similar by only considering their compiled form. Advanced techniques for binary similarity recently gained momentum as they can be applied in several fields, such as copyright disputes, malware analysis, vulnerability detection, etc., and thus have an immediate practical impact. Current solutions compare functions by first… ▽ More

    Submitted 19 December, 2019; v1 submitted 13 November, 2018; originally announced November 2018.

    Comments: Published in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA) 2019

  34. arXiv:1810.09683  [pdf, other

    cs.LG cs.DC

    Unsupervised Features Extraction for Binary Similarity Using Graph Embedding Neural Networks

    Authors: Roberto Baldoni, Giuseppe Antonio Di Luna, Luca Massarelli, Fabio Petroni, Leonardo Querzoni

    Abstract: In this paper we consider the binary similarity problem that consists in determining if two binary functions are similar only considering their compiled form. This problem is know to be crucial in several application scenarios, such as copyright disputes, malware analysis, vulnerability detection, etc. The current state-of-the-art solutions in this field work by creating an embedding model that ma… ▽ More

    Submitted 13 November, 2018; v1 submitted 23 October, 2018; originally announced October 2018.

  35. arXiv:1509.00189  [pdf, other

    cs.CY cs.HC cs.SI physics.soc-ph

    Echo chambers in the age of misinformation

    Authors: Michela Del Vicario, Alessandro Bessi, Fabiana Zollo, Fabio Petroni, Antonio Scala, Guido Caldarelli, H. Eugene Stanley, Walter Quattrociocchi

    Abstract: The wide availability of user-provided content in online social media facilitates the aggregation of people around common interests, worldviews, and narratives. Despite the enthusiastic rhetoric on the part of some that this process generates "collective intelligence", the WWW also allows the rapid dissemination of unsubstantiated conspiracy theories that often elicite rapid, large, but naive soci… ▽ More

    Submitted 21 December, 2015; v1 submitted 1 September, 2015; originally announced September 2015.

  36. arXiv:1501.07201  [pdf, other

    cs.SI cs.HC physics.data-an physics.soc-ph

    Everyday the Same Picture: Popularity and Content Diversity

    Authors: Alessandro Bessi, Fabiana Zollo, Michela Del Vicario, Antonio Scala, Fabio Petroni, Bruno Gonçalves, Walter Quattrociocchi

    Abstract: Facebook is flooded by diverse and heterogeneous content, from kittens up to music and news, passing through satirical and funny stories. Each piece of that corpus reflects the heterogeneity of the underlying social background. In the Italian Facebook we have found an interesting case: a page having more than $40K$ followers that every day posts the same picture of a popular Italian singer. In thi… ▽ More

    Submitted 2 February, 2015; v1 submitted 28 January, 2015; originally announced January 2015.

  37. arXiv:1411.2893  [pdf, other

    cs.SI cs.CY physics.soc-ph

    Viral Misinformation: The Role of Homophily and Polarization

    Authors: Aris Anagnostopoulos, Alessandro Bessi, Guido Caldarelli, Michela Del Vicario, Fabio Petroni, Antonio Scala, Fabiana Zollo, Walter Quattrociocchi

    Abstract: The spreading of unsubstantiated rumors on online social networks (OSN) either unintentionally or intentionally (e.g., for political reasons or even trolling) can have serious consequences such as in the recent case of rumors about Ebola causing disruption to health-care workers. Here we show that indicators aimed at quantifying information consumption patterns might provide important insights abo… ▽ More

    Submitted 11 November, 2014; originally announced November 2014.

    Comments: Misinformation, Virality, Attention Patterns

  38. arXiv:1102.2180  [pdf, ps, other

    cs.CL physics.soc-ph

    Malagasy Dialects and the Peopling of Madagascar

    Authors: M. Serva, F. Petroni, D. Volchenkov, S. Wichmann

    Abstract: The origin of Malagasy DNA is half African and half Indonesian, nevertheless the Malagasy language, spoken by the entire population, belongs to the Austronesian family. The language most closely related to Malagasy is Maanyan (Greater Barito East group of the Austronesian family), but related languages are also in Sulawesi, Malaysia and Sumatra. For this reason, and because Maanyan is spoken by a… ▽ More

    Submitted 13 February, 2011; v1 submitted 10 February, 2011; originally announced February 2011.

  39. arXiv:0912.0884  [pdf, ps, other

    cs.CL physics.soc-ph

    Measures of lexical distance between languages

    Authors: Filippo Petroni, Maurizio Serva

    Abstract: The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D'Urville \cite{Urv}. He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relation among languages. The meth… ▽ More

    Submitted 9 December, 2009; v1 submitted 4 December, 2009; originally announced December 2009.

  40. Lexical evolution rates by automated stability measure

    Authors: Filippo Petroni, Maurizio Serva

    Abstract: Phylogenetic trees can be reconstructed from the matrix which contains the distances between all pairs of languages in a family. Recently, we proposed a new method which uses normalized Levenshtein distances among words with same meaning and averages on all the items of a given list. Decisions about the number of items in the input lists for language comparison have been debated since the beginn… ▽ More

    Submitted 9 December, 2009; v1 submitted 4 December, 2009; originally announced December 2009.

  41. arXiv:0911.3292  [pdf

    cs.CL physics.soc-ph q-bio.PE

    Automated words stability and languages phylogeny

    Authors: Filippo Petroni, Maurizio Serva

    Abstract: The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D'Urville (D'Urville 1832). He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relation among languages. The… ▽ More

    Submitted 5 December, 2009; v1 submitted 17 November, 2009; originally announced November 2009.

    Comments: XI International Conference "Cognitive Modeling in Linguistics-2009" Constanca, Romania, September, 7-14, 2009