Skip to main content

Showing 1–39 of 39 results for author: Sarawagi, S

  1. arXiv:2407.02819  [pdf, other

    cs.CL cs.LG

    Efficient Training of Language Models with Compact and Consistent Next Token Distributions

    Authors: Ashutosh Sathe, Sunita Sarawagi

    Abstract: Maximizing the likelihood of the next token is an established, statistically sound objective for pre-training language models. In this paper we show that we can train better models faster by pre-aggregating the corpus with a collapsed $n$-gram distribution. Previous studies have proposed corpus-level $n$-gram statistics as a regularizer; however, the construction and querying of such $n$-grams, if… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: ACL 2024

  2. arXiv:2406.03864  [pdf, other

    cs.LG

    PairNet: Training with Observed Pairs to Estimate Individual Treatment Effect

    Authors: Lokesh Nagalapatti, Pranava Singhal, Avishek Ghosh, Sunita Sarawagi

    Abstract: Given a dataset of individuals each described by a covariate vector, a treatment, and an observed outcome on the treatment, the goal of the individual treatment effect (ITE) estimation task is to predict outcome changes resulting from a change in treatment. A fundamental challenge is that in the observational data, a covariate's outcome is observed only under one treatment, whereas we need to infe… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Lokesh and Pranava contributed equally. Accepted at ICML-24

  3. arXiv:2401.15447  [pdf, other

    cs.LG stat.ML

    Continuous Treatment Effect Estimation Using Gradient Interpolation and Kernel Smoothing

    Authors: Lokesh Nagalapatti, Akshay Iyer, Abir De, Sunita Sarawagi

    Abstract: We address the Individualized continuous treatment effect (ICTE) estimation problem where we predict the effect of any continuous-valued treatment on an individual using observational data. The main challenge in this estimation task is the potential confounding of treatment assignment with an individual's covariates in the training data, whereas during inference ICTE requires prediction on indepen… ▽ More

    Submitted 27 January, 2024; originally announced January 2024.

    Comments: Accepted at AAAI 24

  4. arXiv:2311.01173  [pdf, other

    cs.CL

    CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL

    Authors: Mayank Kothyari, Dhruva Dhingra, Sunita Sarawagi, Soumen Chakrabarti

    Abstract: Existing Text-to-SQL generators require the entire schema to be encoded with the user text. This is expensive or impractical for large databases with tens of thousands of columns. Standard dense retrieval techniques are inadequate for schema subsetting of a large structured database, where the correct semantics of retrieval demands that we rank sets of schema elements rather than individual elemen… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Comments: To appear at EMNLP 2023 (Main)

  5. arXiv:2310.13659  [pdf, other

    cs.CL

    Benchmarking and Improving Text-to-SQL Generation under Ambiguity

    Authors: Adithya Bhaskar, Tushar Tomar, Ashutosh Sathe, Sunita Sarawagi

    Abstract: Research in Text-to-SQL conversion has been largely benchmarked against datasets where each text query corresponds to one correct SQL. However, natural language queries over real-life databases frequently involve significant ambiguity about the intended SQL due to overlapping schema names and multiple confusing relationship paths. To bridge this gap, we develop a novel benchmark called AmbiQT with… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: To appear at EMNLP 2023 (Main)

  6. arXiv:2307.05006  [pdf, ps, other

    cs.CL cs.LG eess.AS

    Improving RNN-Transducers with Acoustic LookAhead

    Authors: Vinit S. Unni, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi

    Abstract: RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to s… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

    Comments: 5 pages, 1 fig, 7 tables, Proceedings of Interspeech 2023

  7. arXiv:2301.04110  [pdf, other

    cs.CL cs.AI

    Structured Case-based Reasoning for Inference-time Adaptation of Text-to-SQL parsers

    Authors: Abhijeet Awasthi, Soumen Chakrabarti, Sunita Sarawagi

    Abstract: Inference-time adaptation methods for semantic parsing are useful for leveraging examples from newly-observed domains without repeated fine-tuning. Existing approaches typically bias the decoder by simply concatenating input-output example pairs (cases) from the new domain at the encoder's input in a Seq-to-Seq model. Such methods cannot adequately leverage the structure of logical forms in the ca… ▽ More

    Submitted 10 January, 2023; originally announced January 2023.

    Comments: AAAI 2023

  8. arXiv:2210.16613  [pdf, other

    cs.CL cs.AI cs.LG

    Diverse Parallel Data Synthesis for Cross-Database Adaptation of Text-to-SQL Parsers

    Authors: Abhijeet Awasthi, Ashutosh Sathe, Sunita Sarawagi

    Abstract: Text-to-SQL parsers typically struggle with databases unseen during the train time. Adapting parsers to new databases is a challenging problem due to the lack of natural language queries in the new schemas. We present ReFill, a framework for synthesizing high-quality and textually diverse parallel datasets for adapting a Text-to-SQL parser to a target schema. ReFill learns to retrieve-and-edit tex… ▽ More

    Submitted 29 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

  9. arXiv:2210.07313  [pdf, other

    cs.CL cs.LG

    Bootstrapping Multilingual Semantic Parsers using Large Language Models

    Authors: Abhijeet Awasthi, Nitish Gupta, Bidisha Samanta, Shachi Dave, Sunita Sarawagi, Partha Talukdar

    Abstract: Despite cross-lingual generalization demonstrated by pre-trained multilingual models, the translate-train paradigm of transferring English datasets across multiple languages remains to be a key mechanism for training task-specific multilingual models. However, for many low-resource languages, the availability of a reliable translation service entails significant amounts of costly human-annotated t… ▽ More

    Submitted 11 February, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: EACL-23

  10. arXiv:2204.00871  [pdf, other

    cs.CL cs.LG

    Accurate Online Posterior Alignments for Principled Lexically-Constrained Decoding

    Authors: Soumya Chatterjee, Sunita Sarawagi, Preethi Jyothi

    Abstract: Online alignment in machine translation refers to the task of aligning a target word to a source word when the target sequence has only been partially decoded. Good online alignments facilitate important applications such as lexically constrained translation where user-defined dictionaries are used to inject lexical constraints into the translation model. We propose a novel posterior alignment tec… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: 15 pages, 2 figures. ACL 2022

  11. arXiv:2203.02317  [pdf, other

    cs.CL cs.LG

    Adaptive Discounting of Implicit Language Models in RNN-Transducers

    Authors: Vinit Unni, Shreya Khare, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi, Samarth Bharadwaj

    Abstract: RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems. While they perform competitively on a number of evaluation categories, rare words pose a serious challenge to RNN-T models. One main reason for the degradation in performance on rare words is that the language model (LM) internal to RNN-Ts can become overconfident and lead to hallucinated predictions that a… ▽ More

    Submitted 21 February, 2022; originally announced March 2022.

    Comments: Proceedings for ICASSP 2022

  12. arXiv:2203.01976  [pdf, other

    cs.CL

    Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

    Authors: Vaidehi Patil, Partha Talukdar, Sunita Sarawagi

    Abstract: Pre-trained multilingual language models such as mBERT and XLM-R have demonstrated great potential for zero-shot cross-lingual transfer to low web-resource languages (LRL). However, due to limited model capacity, the large difference in the sizes of available monolingual corpora between high web-resource languages (HRL) and LRLs does not provide enough scope of co-embedding the LRL with the HRL, t… ▽ More

    Submitted 23 March, 2022; v1 submitted 3 March, 2022; originally announced March 2022.

    Comments: Accepted to appear at the ACL 2022 Main conference

  13. arXiv:2111.03394  [pdf, ps, other

    cs.LG stat.ML

    Coherent Probabilistic Aggregate Queries on Long-horizon Forecasts

    Authors: Prathamesh Deshpande, Sunita Sarawagi

    Abstract: Long range forecasts are the starting point of many decision support systems that need to draw inference from high-level aggregate patterns on forecasted values. State of the art time-series forecasting methods are either subject to concept drift on long-horizon forecasts, or fail to accurately predict coherent and accurate high-level aggregates. In this work, we present a novel probabilistic fo… ▽ More

    Submitted 25 May, 2022; v1 submitted 5 November, 2021; originally announced November 2021.

    Comments: 7 pages, 1 figure, 1 table, 1 algorithm

  14. arXiv:2110.02619  [pdf, other

    cs.LG cs.CV

    Focus on the Common Good: Group Distributional Robustness Follows

    Authors: Vihari Piratla, Praneeth Netrapalli, Sunita Sarawagi

    Abstract: We consider the problem of training a classification model with group annotated training data. Recent work has established that, if there is distribution shift across different groups, models trained using the standard empirical risk minimization (ERM) objective suffer from poor performance on minority groups and that group distributionally robust optimization (Group-DRO) objective is a better alt… ▽ More

    Submitted 20 April, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: Presented at ICLR 2022; Code can be found at: https://github.com/vihari/cgd

  15. arXiv:2108.06721  [pdf, other

    cs.LG stat.ML

    Training for the Future: A Simple Gradient Interpolation Loss to Generalize Along Time

    Authors: Anshul Nasery, Soumyadeep Thakur, Vihari Piratla, Abir De, Sunita Sarawagi

    Abstract: In several real world applications, machine learning models are deployed to make predictions on data whose distribution changes gradually along time, leading to a drift between the train and test distributions. Such models are often re-trained on new data periodically, and they hence need to generalize to data not too far into the future. In this context, there is much prior work on enhancing temp… ▽ More

    Submitted 19 November, 2021; v1 submitted 15 August, 2021; originally announced August 2021.

  16. arXiv:2108.06514  [pdf, other

    cs.LG

    Active Assessment of Prediction Services as Accuracy Surface Over Attribute Combinations

    Authors: Vihari Piratla, Soumen Chakrabarty, Sunita Sarawagi

    Abstract: Our goal is to evaluate the accuracy of a black-box classification model, not as a single aggregate on a given test data distribution, but as a surface over a large number of combinations of attributes characterizing multiple test data distributions. Such attributed accuracy measures become important as machine learning models get deployed as a service, where the training data distribution is hidd… ▽ More

    Submitted 26 October, 2021; v1 submitted 14 August, 2021; originally announced August 2021.

    Comments: NeurIPS 2021; Code and dataset at: https://github.com/vihari/AAA; 19 pages

  17. arXiv:2106.03958  [pdf, other

    cs.CL

    Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

    Authors: Yash Khemchandani, Sarvesh Mehtani, Vaidehi Patil, Abhijeet Awasthi, Partha Talukdar, Sunita Sarawagi

    Abstract: Recent research in multilingual language models (LM) has demonstrated their ability to effectively handle multiple languages in a single model. This holds promise for low web-resource languages (LRL) as multilingual models can enable transfer of supervision from high resource languages to LRLs. However, incorporating a new language in an LM still remains a challenge, particularly for languages wit… ▽ More

    Submitted 9 June, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted to ACL-IJCNLP 2021

  18. arXiv:2104.03986  [pdf, other

    cs.DB cs.AI cs.LG stat.ML

    Deep Indexed Active Learning for Matching Heterogeneous Entity Representations

    Authors: Arjit Jain, Sunita Sarawagi, Prithviraj Sen

    Abstract: Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find inf… ▽ More

    Submitted 17 January, 2022; v1 submitted 8 April, 2021; originally announced April 2021.

    Comments: VLDB 2022

  19. arXiv:2103.03142  [pdf, other

    cs.SD cs.CL eess.AS

    Error-driven Fixed-Budget ASR Personalization for Accented Speakers

    Authors: Abhijeet Awasthi, Aman Kansal, Sunita Sarawagi, Preethi Jyothi

    Abstract: We consider the task of personalizing ASR models while being constrained by a fixed budget on recording speaker-specific utterances. Given a speaker and an ASR model, we propose a method of identifying sentences for which the speaker's utterances are likely to be harder for the given ASR model to recognize. We assume a tiny amount of speaker-specific data to learn phoneme-level error models which… ▽ More

    Submitted 2 June, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

    Comments: In ICASSP 2021

  20. arXiv:2103.01600  [pdf, other

    cs.LG cs.AI

    Missing Value Imputation on Multidimensional Time Series

    Authors: Parikshit Bansal, Prathamesh Deshpande, Sunita Sarawagi

    Abstract: We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets. Missing values are commonplace in decision support platforms that aggregate data over long time stretches from disparate sources, and reliable data analytics calls for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist… ▽ More

    Submitted 21 June, 2023; v1 submitted 2 March, 2021; originally announced March 2021.

    Comments: Accepted to VLDB 2021

  21. Long Horizon Forecasting With Temporal Point Processes

    Authors: Prathamesh Deshpande, Kamlesh Marathe, Abir De, Sunita Sarawagi

    Abstract: In recent years, marked temporal point processes (MTPPs) have emerged as a powerful modeling machinery to characterize asynchronous events in a wide variety of applications. MTPPs have demonstrated significant potential in predicting event-timings, especially for events arriving in near future. However, due to current design choices, MTPPs often show poor predictive performance at forecasting even… ▽ More

    Submitted 7 March, 2021; v1 submitted 7 January, 2021; originally announced January 2021.

    Comments: 9 pages, 4 figures

  22. arXiv:2010.01526  [pdf, other

    cs.LG cs.CL

    NLP Service APIs and Models for Efficient Registration of New Clients

    Authors: Sahil Shah, Vihari Piratla, Soumen Chakrabarti, Sunita Sarawagi

    Abstract: State-of-the-art NLP inference uses enormous neural architectures and models trained for GPU-months, well beyond the reach of most consumers of NLP. This has led to one-size-fits-all public API-based NLP service models by major AI companies, serving large numbers of clients. Neither (hardware deficient) clients nor (heavily subscribed) servers can afford traditional fine tuning. Many clients own l… ▽ More

    Submitted 4 October, 2020; originally announced October 2020.

    Comments: Accepted to Findings of EMNLP, 2020

  23. arXiv:2007.06897  [pdf, other

    cs.CL cs.LG

    What's in a Name? Are BERT Named Entity Representations just as Good for any other Name?

    Authors: Sriram Balasubramanian, Naman Jain, Gaurav Jindal, Abhijeet Awasthi, Sunita Sarawagi

    Abstract: We evaluate named entity representations of BERT-based NLP models by investigating their robustness to replacements from the same typed class in the input. We highlight that on several tasks while such perturbations are natural, state of the art trained models are surprisingly brittle. The brittleness continues even with the recent entity-aware BERT models. We also try to discern the cause of this… ▽ More

    Submitted 14 July, 2020; originally announced July 2020.

    Comments: Accepted at RepL4NLP, ACL2020

  24. arXiv:2006.13519  [pdf, other

    eess.AS cs.CL cs.SD

    Black-box Adaptation of ASR for Accented Speech

    Authors: Kartik Khandelwal, Preethi Jyothi, Abhijeet Awasthi, Sunita Sarawagi

    Abstract: We introduce the problem of adapting a black-box, cloud-based ASR system to speech from a target accent. While leading online ASR services obtain impressive performance on main-stream accents, they perform poorly on sub-populations - we observed that the word error rate (WER) achieved by Google's ASR API on Indian accents is almost twice the WER on US accents. Existing adaptation methods either re… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

    Comments: A slightly different version submitted to INTERSPEECH 2020 (currently under review)

  25. arXiv:2004.06025  [pdf, other

    cs.LG cs.CL stat.ML

    Learning from Rules Generalizing Labeled Exemplars

    Authors: Abhijeet Awasthi, Sabyasachi Ghosh, Rasna Goyal, Sunita Sarawagi

    Abstract: In many applications labeled data is not readily available, and needs to be collected via pain-staking human supervision. We propose a rule-exemplar method for collecting human supervision to combine the efficiency of rules with the quality of instance labels. The supervision is coupled such that it is both natural for humans and synergistic for learning. We propose a training algorithm that joint… ▽ More

    Submitted 15 May, 2020; v1 submitted 13 April, 2020; originally announced April 2020.

    Comments: ICLR 2020 (Spotlight)

  26. arXiv:2003.12815  [pdf, other

    cs.LG stat.ML

    Efficient Domain Generalization via Common-Specific Low-Rank Decomposition

    Authors: Vihari Piratla, Praneeth Netrapalli, Sunita Sarawagi

    Abstract: Domain generalization refers to the task of training a model which generalizes to new domains that are not seen during training. We present CSD (Common Specific Decomposition), for this setting,which jointly learns a common component (which generalizes to new domains) and a domain specific component (which overfits on training domains). The domain specific components are discarded after training a… ▽ More

    Submitted 7 April, 2020; v1 submitted 28 March, 2020; originally announced March 2020.

  27. arXiv:1911.09860  [pdf, other

    cs.LG cs.CL stat.ML

    Data Programming using Continuous and Quality-Guided Labeling Functions

    Authors: Oishik Chatterjee, Ganesh Ramakrishnan, Sunita Sarawagi

    Abstract: Scarcity of labeled data is a bottleneck for supervised learning models. A paradigm that has evolved for dealing with this problem is data programming. An existing data programming paradigm allows human supervision to be provided as a set of discrete labeling functions (LF) that output possibly noisy labels to input instances and a generative modelfor consolidating the weak labels. We enhance and… ▽ More

    Submitted 22 November, 2019; originally announced November 2019.

    Comments: Accepted paper at the 34th AAAI Conference on Artificial Intelligence (AAAI-18), New York, USA

  28. arXiv:1910.02893  [pdf, other

    cs.CL cs.LG

    Parallel Iterative Edit Models for Local Sequence Transduction

    Authors: Abhijeet Awasthi, Sunita Sarawagi, Rasna Goyal, Sabyasachi Ghosh, Vihari Piratla

    Abstract: We present a Parallel Iterative Edit (PIE) model for the problem of local sequence transduction arising in tasks like Grammatical error correction (GEC). Recent approaches are based on the popular encoder-decoder (ED) model for sequence to sequence learning. The ED model auto-regressively captures full dependency among output tokens but is slow due to sequential decoding. The PIE model does parall… ▽ More

    Submitted 15 May, 2020; v1 submitted 7 October, 2019; originally announced October 2019.

    Comments: Accepted at EMNLP-IJCNLP 2019

  29. arXiv:1906.09926  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Streaming Adaptation of Deep Forecasting Models using Adaptive Recurrent Units

    Authors: Prathamesh Deshpande, Sunita Sarawagi

    Abstract: We present ARU, an Adaptive Recurrent Unit for streaming adaptation of deep globally trained time-series forecasting models. The ARU combines the advantages of learning complex data transformations across multiple time series from deep global models, with per-series localization offered by closed-form linear models. Unlike existing methods of adaptation that are either memory-intensive or non-resp… ▽ More

    Submitted 4 July, 2019; v1 submitted 24 June, 2019; originally announced June 2019.

    Comments: 9 pages, 4 figures

  30. arXiv:1906.02688  [pdf, other

    cs.CL cs.LG stat.ML

    Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings

    Authors: Vihari Piratla, Sunita Sarawagi, Soumen Chakrabarti

    Abstract: Given a small corpus $\mathcal D_T$ pertaining to a limited set of focused topics, our goal is to train embeddings that accurately capture the sense of words in the topic in spite of the limited size of $\mathcal D_T$. These embeddings may be used in various tasks involving $\mathcal D_T$. A popular strategy in limited data settings is to adapt pre-trained embeddings $\mathcal E$ trained on a larg… ▽ More

    Submitted 24 July, 2019; v1 submitted 5 June, 2019; originally announced June 2019.

    Comments: Accepted at ACL 2019

  31. arXiv:1903.00802  [pdf, other

    cs.LG cs.CL stat.ML

    Calibration of Encoder Decoder Models for Neural Machine Translation

    Authors: Aviral Kumar, Sunita Sarawagi

    Abstract: We study the calibration of several state of the art neural machine translation(NMT) systems built on attention-based encoder-decoder models. For structured outputs like in NMT, calibration is important not just for reliable confidence with predictions, but also for proper functioning of beam-search inference. We show that most modern NMT models are surprisingly miscalibrated even when conditioned… ▽ More

    Submitted 2 March, 2019; originally announced March 2019.

    Comments: 12 Pages

  32. arXiv:1804.10745  [pdf, other

    cs.LG stat.ML

    Generalizing Across Domains via Cross-Gradient Training

    Authors: Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Siddhartha Chaudhuri, Preethi Jyothi, Sunita Sarawagi

    Abstract: We present CROSSGRAD, a method to use multi-domain training data to learn a classifier that generalizes to new domains. CROSSGRAD does not need an adaptation phase via labeled or unlabeled data, or domain features in the new domain. Most existing domain adaptation methods attempt to erase domain signals using techniques like domain adversarial training. In contrast, CROSSGRAD is free to use domain… ▽ More

    Submitted 1 May, 2018; v1 submitted 28 April, 2018; originally announced April 2018.

    Comments: The first two authors contributed equally; Accepted at ICLR 2018

  33. arXiv:1803.03800  [pdf, other

    cs.LG cs.AI stat.ML

    ARMDN: Associative and Recurrent Mixture Density Networks for eRetail Demand Forecasting

    Authors: Srayanta Mukherjee, Devashish Shankar, Atin Ghosh, Nilam Tathawadekar, Pramod Kompalli, Sunita Sarawagi, Krishnendu Chaudhury

    Abstract: Accurate demand forecasts can help on-line retail organizations better plan their supply-chain processes. The challenge, however, is the large number of associative factors that result in large, non-stationary shifts in demand, which traditional time series and regression approaches fail to model. In this paper, we propose a Neural Network architecture called AR-MDN, that simultaneously models ass… ▽ More

    Submitted 16 March, 2018; v1 submitted 10 March, 2018; originally announced March 2018.

  34. arXiv:1707.01461  [pdf, other

    cs.LG stat.ML

    Labeled Memory Networks for Online Model Adaptation

    Authors: Shiv Shankar, Sunita Sarawagi

    Abstract: Augmenting a neural network with memory that can grow without growing the number of trained parameters is a recent powerful concept with many exciting applications. We propose a design of memory augmented neural networks (MANNs) called Labeled Memory Networks (LMNs) suited for tasks requiring online adaptation in classification models. LMNs organize the memory with classes as the primary key.The m… ▽ More

    Submitted 2 December, 2017; v1 submitted 5 July, 2017; originally announced July 2017.

    Comments: Accepted at AAAI 2018, 8 pages

  35. arXiv:1606.03402  [pdf, other

    cs.AI cs.CL

    Length bias in Encoder Decoder Models and a Case for Global Conditioning

    Authors: Pavel Sountsov, Sunita Sarawagi

    Abstract: Encoder-decoder networks are popular for modeling sequences probabilistically in many applications. These models use the power of the Long Short-Term Memory (LSTM) architecture to capture the full dependence among variables, unlike earlier models like CRFs that typically assumed conditional independence among non-adjacent variables. However in practice encoder-decoder models exhibit a bias towards… ▽ More

    Submitted 21 September, 2016; v1 submitted 10 June, 2016; originally announced June 2016.

  36. arXiv:1605.04359  [pdf, other

    cs.CL

    Occurrence Statistics of Entities, Relations and Types on the Web

    Authors: Aman Madaan, Sunita Sarawagi

    Abstract: The problem of collecting reliable estimates of occurrence of entities on the open web forms the premise for this report. The models learned for tagging entities cannot be expected to perform well when deployed on the web. This is owing to the severe mismatch in the distributions of such entities on the web and in the relatively diminutive training data. In this report, we build up the case for ma… ▽ More

    Submitted 13 May, 2016; originally announced May 2016.

  37. Answering Table Queries on the Web using Column Keywords

    Authors: Rakesh Pimplikar, Sunita Sarawagi

    Abstract: We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are much richer sources of structured knowledge than free-format text. However, a corpus of tables harvested from arbitrary HTML web pages presents hu… ▽ More

    Submitted 30 June, 2012; originally announced July 2012.

    Comments: VLDB2012

    Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 10, pp. 908-919 (2012)

  38. Joint Structured Models for Extraction from Overlapping Sources

    Authors: Rahul Gupta, Sunita Sarawagi

    Abstract: We consider the problem of jointly training structured models for extraction from sources whose instances enjoy partial overlap. This has important applications like user-driven ad-hoc information extraction on the web. Such applications present new challenges in terms of the number of sources and their arbitrary pattern of overlap not seen by earlier collective training schemes applied on two so… ▽ More

    Submitted 1 May, 2010; originally announced May 2010.

  39. arXiv:0907.0589  [pdf, ps, other

    cs.AI

    Generalized Collective Inference with Symmetric Clique Potentials

    Authors: Rahul Gupta, Sunita Sarawagi, Ajit A. Diwan

    Abstract: Collective graphical models exploit inter-instance associative dependence to output more accurate labelings. However existing models support very limited kind of associativity which restricts accuracy gains. This paper makes two major contributions. First, we propose a general collective inference framework that biases data instances to agree on a set of {\em properties} of their labelings. Agre… ▽ More

    Submitted 7 July, 2009; v1 submitted 3 July, 2009; originally announced July 2009.

    Comments: 30 pages