Skip to main content

Showing 1–26 of 26 results for author: Najork, M

  1. arXiv:2311.17650  [pdf, other

    cs.IR

    Creator Context for Tweet Recommendation

    Authors: Spurthi Amba Hombaiah, Tao Chen, Mingyang Zhang, Michael Bendersky, Marc Najork, Matt Colen, Sergey Levi, Vladimir Ofitserov, Tanvir Amin

    Abstract: When discussing a tweet, people usually not only refer to the content it delivers, but also to the person behind the tweet. In other words, grounding the interpretation of the tweet in the context of its creator plays an important role in deciphering the true intent and the importance of the tweet. In this paper, we attempt to answer the question of how creator context should be used to advance… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  2. arXiv:2305.11944  [pdf, other

    cs.IR cs.CL

    Exploring the Viability of Synthetic Query Generation for Relevance Prediction

    Authors: Aditi Chaudhary, Karthik Raman, Krishna Srinivasan, Kazuma Hashimoto, Mike Bendersky, Marc Najork

    Abstract: Query-document relevance prediction is a critical problem in Information Retrieval systems. This problem has increasingly been tackled using (pretrained) transformer-based models which are finetuned using large collections of labeled data. However, in specialized domains such as e-commerce and healthcare, the viability of this approach is limited by the dearth of large in-domain data. To address t… ▽ More

    Submitted 16 June, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: In Proceedings of ACM SIGIRWorkshop on eCommerce (SIGIR eCom 23)

  3. arXiv:2305.05010  [pdf, other

    cs.LG cs.CL

    Do Not Blindly Imitate the Teacher: Using Perturbed Loss for Knowledge Distillation

    Authors: Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Jialu Liu, Michael Bendersky, Marc Najork, Chao Zhang

    Abstract: Knowledge distillation is a popular technique to transfer knowledge from large teacher models to a small student model. Typically, the student learns to imitate the teacher by minimizing the KL divergence of its output distribution with the teacher's output distribution. In this work, we argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's ou… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: 16 pages

  4. arXiv:2302.05852  [pdf, other

    cs.CL cs.AI cs.IR

    "Why is this misleading?": Detecting News Headline Hallucinations with Explanations

    Authors: Jiaming Shen, Jialu Liu, Dan Finnie, Negar Rahmati, Michael Bendersky, Marc Najork

    Abstract: Automatic headline generation enables users to comprehend ongoing news events promptly and has recently become an important task in web mining and natural language processing. With the growing need for news headline generation, we argue that the hallucination issue, namely the generated headlines being not supported by the original news stories, is a critical challenge for the deployment of this f… ▽ More

    Submitted 11 February, 2023; originally announced February 2023.

    Comments: WWW 2023, 12 pages

  5. arXiv:2212.13937  [pdf, other

    cs.IR cs.AI

    Towards Disentangling Relevance and Bias in Unbiased Learning to Rank

    Authors: Yunan Zhang, Le Yan, Zhen Qin, Honglei Zhuang, Jiaming Shen, Xuanhui Wang, Michael Bendersky, Marc Najork

    Abstract: Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs… ▽ More

    Submitted 4 June, 2023; v1 submitted 28 December, 2022; originally announced December 2022.

    Comments: Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

  6. arXiv:2212.09744  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    DSI++: Updating Transformer Memory with New Documents

    Authors: Sanket Vaibhav Mehta, Jai Gupta, Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Jinfeng Rao, Marc Najork, Emma Strubell, Donald Metzler

    Abstract: Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning ch… ▽ More

    Submitted 8 December, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Accepted at EMNLP 2023 main conference

  7. arXiv:2211.01494  [pdf, other

    cs.IR

    Regression Compatible Listwise Objectives for Calibrated Ranking with Binary Relevance

    Authors: Aijun Bai, Rolf Jagerman, Zhen Qin, Le Yan, Pratyush Kar, Bing-Rong Lin, Xuanhui Wang, Michael Bendersky, Marc Najork

    Abstract: As Learning-to-Rank (LTR) approaches primarily seek to improve ranking quality, their output scores are not scale-calibrated by design. This fundamentally limits LTR usage in score-sensitive applications. Though a simple multi-objective approach that combines a regression and a ranking objective can effectively learn scale-calibrated scores, we argue that the two objectives are not necessarily com… ▽ More

    Submitted 21 August, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

  8. Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models

    Authors: Tao Chen, Mingyang Zhang, Jing Lu, Michael Bendersky, Marc Najork

    Abstract: The pre-trained language model (eg, BERT) based deep retrieval models achieved superior performance over lexical retrieval models (eg, BM25) in many passage retrieval tasks. However, limited work has been done to generalize a deep retrieval model to other tasks and domains. In this work, we carefully select five datasets, including two in-domain datasets and three out-of-domain datasets with diffe… ▽ More

    Submitted 25 January, 2022; originally announced January 2022.

    Comments: Accepted at ECIR 2022 (full paper)

  9. arXiv:2201.02647  [pdf, other

    cs.LG cs.IR

    Data-Efficient Information Extraction from Form-Like Documents

    Authors: Beliz Gunel, Navneet Potti, Sandeep Tata, James B. Wendt, Marc Najork, Jing Xie

    Abstract: Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should genera… ▽ More

    Submitted 7 January, 2022; originally announced January 2022.

    Comments: Published at the 2nd Document Intelligence Workshop @ KDD 2021 (https://document-intelligence.github.io/DI-2021/)

  10. arXiv:2112.09727  [pdf, other

    cs.LG cs.AI cs.IR

    Rank4Class: A Ranking Formulation for Multiclass Classification

    Authors: Nan Wang, Zhen Qin, Le Yan, Honglei Zhuang, Xuanhui Wang, Michael Bendersky, Marc Najork

    Abstract: Multiclass classification (MCC) is a fundamental machine learning problem of classifying each instance into one of a predefined set of classes. In the deep learning era, extensive efforts have been spent on developing more powerful neural embedding models to better represent the instance for improving MCC performance. In this paper, we do not aim to propose new neural models for instance represent… ▽ More

    Submitted 21 December, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

  11. arXiv:2109.15285  [pdf, other

    cs.IR

    Improving Neural Ranking via Lossless Knowledge Distillation

    Authors: Zhen Qin, Le Yan, Yi Tay, Honglei Zhuang, Xuanhui Wang, Michael Bendersky, Marc Najork

    Abstract: We explore a novel perspective of knowledge distillation (KD) for learning to rank (LTR), and introduce Self-Distilled neural Rankers (SDR), where student rankers are parameterized identically to their teachers. Unlike the existing ranking distillation work which pursues a good trade-off between performance and efficiency, SDR is able to significantly improve ranking performance of students over t… ▽ More

    Submitted 6 April, 2022; v1 submitted 30 September, 2021; originally announced September 2021.

    Comments: 15 pages

  12. Dynamic Language Models for Continuously Evolving Content

    Authors: Spurthi Amba Hombaiah, Tao Chen, Mingyang Zhang, Michael Bendersky, Marc Najork

    Abstract: The content on the web is in a constant state of flux. New entities, issues, and ideas continuously emerge, while the semantics of the existing conversation topics gradually shift. In recent years, pre-trained language models like BERT greatly improved the state-of-the-art for a large spectrum of content understanding tasks. Therefore, in this paper, we aim to study how these language models can b… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Journal ref: KDD 2021

  13. Rethinking Search: Making Domain Experts out of Dilettantes

    Authors: Donald Metzler, Yi Tay, Dara Bahri, Marc Najork

    Abstract: When experiencing an information need, users want to engage with a domain expert, but often turn to an information retrieval system, such as a search engine, instead. Classical information retrieval systems do not answer information needs directly, but instead provide references to (hopefully authoritative) answers. Successful question answering systems offer a limited corpus created on-demand by… ▽ More

    Submitted 21 July, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

    Journal ref: SIGIR Forum 55, 1, Article 13 (June 2021), 27 pages

  14. Natural Language Understanding with Privacy-Preserving BERT

    Authors: Chen Qu, Weize Kong, Liu Yang, Mingyang Zhang, Michael Bendersky, Marc Najork

    Abstract: Privacy preservation remains a key challenge in data mining and Natural Language Understanding (NLU). Previous research shows that the input text or even text embeddings can leak private information. This concern motivates our research on effective privacy preservation approaches for pretrained Language Models (LMs). We investigate the privacy and utility implications of applying dx-privacy, a var… ▽ More

    Submitted 19 August, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

    Comments: Accepted to CIKM 2021

  15. arXiv:2103.01913  [pdf, other

    cs.CV cs.CL cs.IR

    WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

    Authors: Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, Marc Najork

    Abstract: The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Imag… ▽ More

    Submitted 3 March, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

  16. Scalable Hierarchical Agglomerative Clustering

    Authors: Nicholas Monath, Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork, Mert Terzihan, Bryon Tjanaka, Yuan Wang, Yuchen Wu

    Abstract: The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of da… ▽ More

    Submitted 30 September, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Appeared in KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

  17. DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling

    Authors: Jiecao Chen, Liu Yang, Karthik Raman, Michael Bendersky, Jung-Jung Yeh, Yun Zhou, Marc Najork, Danyang Cai, Ehsan Emadzadeh

    Abstract: Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering. However, deploying these models in real systems is highly non-trivial due to their exorbitant computational costs. A common remedy to this is knowledge distillation (Hinton et al., 2015), leading to faster inference. Howev… ▽ More

    Submitted 6 October, 2020; originally announced October 2020.

    Comments: 13 pages. Accepted to Findings of EMNLP 2020

  18. arXiv:2010.01195  [pdf, other

    cs.IR

    Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

    Authors: Saar Kuzi, Mingyang Zhang, Cheng Li, Michael Bendersky, Marc Najork

    Abstract: Search engines often follow a two-phase paradigm where in the first stage (the retrieval stage) an initial set of documents is retrieved and in the second stage (the re-ranking stage) the documents are re-ranked to obtain the final result list. While deep neural networks were shown to improve the performance of the re-ranking stage in previous works, there is little literature about using deep neu… ▽ More

    Submitted 2 October, 2020; originally announced October 2020.

  19. arXiv:2005.11442  [pdf, other

    cs.LG stat.ML

    Active Learning for Skewed Data Sets

    Authors: Abbas Kazerouni, Qi Zhao, Jing Xie, Sandeep Tata, Marc Najork

    Abstract: Consider a sequential active learning problem where, at each round, an agent selects a batch of unlabeled data points, queries their labels and updates a binary classifier. While there exists a rich body of work on active learning in this general form, in this paper, we focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training da… ▽ More

    Submitted 22 May, 2020; originally announced May 2020.

  20. Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

    Authors: Liu Yang, Mingyang Zhang, Cheng Li, Michael Bendersky, Marc Najork

    Abstract: Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendati… ▽ More

    Submitted 12 October, 2020; v1 submitted 26 April, 2020; originally announced April 2020.

    Comments: Accepted as a full paper in CIKM 2020

  21. arXiv:2004.08476  [pdf, other

    cs.IR

    Learning-to-Rank with BERT in TF-Ranking

    Authors: Shuguang Han, Xuanhui Wang, Mike Bendersky, Marc Najork

    Abstract: This paper describes a machine learning algorithm for document (re)ranking, in which queries and documents are firstly encoded using BERT [1], and on top of that a learning-to-rank (LTR) model constructed with TF-Ranking (TFR) [2] is applied to further optimize the ranking performance. This approach is proved to be effective in a public MS MARCO benchmark [3]. Our first two submissions achieve the… ▽ More

    Submitted 8 June, 2020; v1 submitted 17 April, 2020; originally announced April 2020.

    Comments: 6 pages, 1 figure, 2 tables

  22. arXiv:1910.09676  [pdf, other

    cs.IR cs.LG

    Self-Attentive Document Interaction Networks for Permutation Equivariant Ranking

    Authors: Rama Kumar Pasumarthi, Xuanhui Wang, Michael Bendersky, Marc Najork

    Abstract: How to leverage cross-document interactions to improve ranking performance is an important topic in information retrieval (IR) research. However, this topic has not been well-studied in the learning-to-rank setting and most of the existing work still treats each document independently while scoring. The recent development of deep learning shows strength in modeling complex relationships across seq… ▽ More

    Submitted 23 October, 2019; v1 submitted 21 October, 2019; originally announced October 2019.

    Comments: 8 pages

  23. Estimating Position Bias without Intrusive Interventions

    Authors: Aman Agarwal, Ivan Zaitsev, Xuanhui Wang, Cheng Li, Marc Najork, Thorsten Joachims

    Abstract: Presentation bias is one of the key challenges when learning from implicit feedback in search engines, as it confounds the relevance signal. While it was recently shown how counterfactual learning-to-rank (LTR) approaches \cite{Joachims/etal/17a} can provably overcome presentation bias when observation propensities are known, it remains to show how to effectively estimate these propensities. In th… ▽ More

    Submitted 12 December, 2018; originally announced December 2018.

  24. TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank

    Authors: Rama Kumar Pasumarthi, Sebastian Bruch, Xuanhui Wang, Cheng Li, Michael Bendersky, Marc Najork, Jan Pfeifer, Nadav Golbandi, Rohan Anil, Stephan Wolf

    Abstract: Learning-to-Rank deals with maximizing the utility of a list of examples presented to the user, with items of higher relevance being prioritized. It has several practical applications such as large-scale search, recommender systems, document summarization and question answering. While there is widespread support for classification and regression based learning, support for learning-to-rank in deep… ▽ More

    Submitted 17 May, 2019; v1 submitted 30 November, 2018; originally announced December 2018.

    Comments: KDD 2019

  25. Learning Groupwise Multivariate Scoring Functions Using Deep Neural Networks

    Authors: Qingyao Ai, Xuanhui Wang, Sebastian Bruch, Nadav Golbandi, Michael Bendersky, Marc Najork

    Abstract: While in a classification or a regression setting a label or a value is assigned to each individual document, in a ranking setting we determine the relevance ordering of the entire input document list. This difference leads to the notion of relative relevance between documents in ranking. The majority of the existing learning-to-rank algorithms model such relativity at the loss level using pairwis… ▽ More

    Submitted 4 August, 2019; v1 submitted 11 November, 2018; originally announced November 2018.

  26. arXiv:1810.05252  [pdf, ps, other

    cs.IR

    Offline Comparison of Ranking Functions using Randomized Data

    Authors: Aman Agarwal, Xuanhui Wang, Cheng Li, Michael Bendersky, Marc Najork

    Abstract: Ranking functions return ranked lists of items, and users often interact with these items. How to evaluate ranking functions using historical interaction logs, also known as off-policy evaluation, is an important but challenging problem. The commonly used Inverse Propensity Scores (IPS) approaches work better for the single item case, but suffer from extremely low data efficiency for the ranked li… ▽ More

    Submitted 11 October, 2018; originally announced October 2018.

    Comments: Published at REVEAL Workshop in ACM Recommender Systems (RecSys) (2018)