subscribe to arXiv mailings

CIMRL: Combining IMitation and Reinforcement Learning for Safe Autonomous Driving

Authors: Jonathan Booher, Khashayar Rohanimanesh, Junhong Xu, Vladislav Isenbaev, Ashwin Balakrishna, Ishan Gupta, Wei Liu, Aleksandr Petiushko

Abstract: Modern approaches to autonomous driving rely heavily on learned components trained with large amounts of human driving data via imitation learning. However, these methods require large amounts of expensive data collection and even then face challenges with safely handling long-tail scenarios and compounding errors over time. At the same time, pure Reinforcement Learning (RL) methods can fail to le… ▽ More Modern approaches to autonomous driving rely heavily on learned components trained with large amounts of human driving data via imitation learning. However, these methods require large amounts of expensive data collection and even then face challenges with safely handling long-tail scenarios and compounding errors over time. At the same time, pure Reinforcement Learning (RL) methods can fail to learn performant policies in sparse, constrained, and challenging-to-define reward settings like driving. Both of these challenges make deploying purely cloned policies in safety critical applications like autonomous vehicles challenging. In this paper we propose Combining IMitation and Reinforcement Learning (CIMRL) approach - a framework that enables training driving policies in simulation through leveraging imitative motion priors and safety constraints. CIMRL does not require extensive reward specification and improves on the closed loop behavior of pure cloning methods. By combining RL and imitation, we demonstrate that our method achieves state-of-the-art results in closed loop simulation driving benchmarks. △ Less

Submitted 26 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2405.04373 [pdf, other]

Leveraging LSTM and GAN for Modern Malware Detection

Authors: Ishita Gupta, Sneha Kumari, Priya Jha, Mohona Ghosh

Abstract: The malware booming is a cyberspace equal to the effect of climate change to ecosystems in terms of danger. In the case of significant investments in cybersecurity technologies and staff training, the global community has become locked up in the eternal war with cyber security threats. The multi-form and changing faces of malware are continuously pushing the boundaries of the cybersecurity practit… ▽ More The malware booming is a cyberspace equal to the effect of climate change to ecosystems in terms of danger. In the case of significant investments in cybersecurity technologies and staff training, the global community has become locked up in the eternal war with cyber security threats. The multi-form and changing faces of malware are continuously pushing the boundaries of the cybersecurity practitioners employ various approaches like detection and mitigate in coping with this issue. Some old mannerisms like signature-based detection and behavioral analysis are slow to adapt to the speedy evolution of malware types. Consequently, this paper proposes the utilization of the Deep Learning Model, LSTM networks, and GANs to amplify malware detection accuracy and speed. A fast-growing, state-of-the-art technology that leverages raw bytestream-based data and deep learning architectures, the AI technology provides better accuracy and performance than the traditional methods. Integration of LSTM and GAN model is the technique that is used for the synthetic generation of data, leading to the expansion of the training datasets, and as a result, the detection accuracy is improved. The paper uses the VirusShare dataset which has more than one million unique samples of the malware as the training and evaluation set for the presented models. Through thorough data preparation including tokenization, augmentation, as well as model training, the LSTM and GAN models convey the better performance in the tasks compared to straight classifiers. The research outcomes come out with 98% accuracy that shows the efficiency of deep learning plays a decisive role in proactive cybersecurity defense. Aside from that, the paper studies the output of ensemble learning and model fusion methods as a way to reduce biases and lift model complexity. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: 11 pages

Report number: Paper ID: IST-BDE-MNNR-170524-5719

arXiv:2312.02254 [pdf]

Innovations in Agricultural Forecasting: A Multivariate Regression Study on Global Crop Yield Prediction

Authors: Ishaan Gupta, Samyutha Ayalasomayajula, Yashas Shashidhara, Anish Kataria, Shreyas Shashidhara, Krishita Kataria, Aditya Undurti

Abstract: The prediction of crop yields internationally is a crucial objective in agricultural research. Thus, this study implements 6 regression models (Linear, Tree, Gradient Descent, Gradient Boosting, K Nearest Neighbors, and Random Forest) to predict crop yields in 37 developing countries over 27 years. Given 4 key training parameters, insecticides (tonnes), rainfall (mm), temperature (Celsius), and yi… ▽ More The prediction of crop yields internationally is a crucial objective in agricultural research. Thus, this study implements 6 regression models (Linear, Tree, Gradient Descent, Gradient Boosting, K Nearest Neighbors, and Random Forest) to predict crop yields in 37 developing countries over 27 years. Given 4 key training parameters, insecticides (tonnes), rainfall (mm), temperature (Celsius), and yield (hg/ha), it was found that our Random Forest Regression model achieved a determination coefficient (r2) of 0.94, with a margin of error (ME) of .03. The models were trained and tested using the Food and Agricultural Organization of the United Nations data, along with the World Bank Climate Change Data Catalog. Furthermore, each parameter was analyzed to understand how varying factors could impact overall yield. We used unconventional models, contrary to generally used Deep Learning (DL) and Machine Learning (ML) models, combined with recently collected data to implement a unique approach in our research. Existing scholarship would benefit from understanding the most optimal model for agricultural research, specifically using the United Nations data. △ Less

Submitted 14 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: 12 pages, 8 figures, 1 table, Guided by Dr. Aditya Undurti

MSC Class: 68W03

arXiv:2311.09862 [pdf, other]

Which Modality should I use -- Text, Motif, or Image? : Understanding Graphs with Large Language Models

Authors: Debarati Das, Ishaan Gupta, Jaideep Srivastava, Dongyeop Kang

Abstract: Our research integrates graph data with Large Language Models (LLMs), which, despite their advancements in various fields using large text corpora, face limitations in encoding entire graphs due to context size constraints. This paper introduces a new approach to encoding a graph with diverse modalities, such as text, image, and motif, coupled with prompts to approximate a graph's global connectiv… ▽ More Our research integrates graph data with Large Language Models (LLMs), which, despite their advancements in various fields using large text corpora, face limitations in encoding entire graphs due to context size constraints. This paper introduces a new approach to encoding a graph with diverse modalities, such as text, image, and motif, coupled with prompts to approximate a graph's global connectivity, thereby enhancing LLMs' efficiency in processing complex graph structures. The study also presents GraphTMI, a novel benchmark for evaluating LLMs in graph structure analysis, focusing on homophily, motif presence, and graph difficulty. Key findings indicate that the image modality, especially with vision-language models like GPT-4V, is superior to text in balancing token limits and preserving essential information and outperforms prior graph neural net (GNN) encoders. Furthermore, the research assesses how various factors affect the performance of each encoding modality and outlines the existing challenges and potential future developments for LLMs in graph understanding and reasoning tasks. All data will be publicly available upon acceptance. △ Less

Submitted 13 March, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2308.09578 [pdf, other]

doi 10.1109/TSMC.2023.3288081

An AI-Driven VM Threat Prediction Model for Multi-Risks Analysis-Based Cloud Cybersecurity

Authors: Deepika Saxena, Ishu Gupta, Rishabh Gupta, Ashutosh Kumar Singh, Xiaoqing Wen

Abstract: Cloud virtualization technology, ingrained with physical resource sharing, prompts cybersecurity threats on users' virtual machines (VM)s due to the presence of inevitable vulnerabilities on the offsite servers. Contrary to the existing works which concentrated on reducing resource sharing and encryption and decryption of data before transfer for improving cybersecurity which raises computational… ▽ More Cloud virtualization technology, ingrained with physical resource sharing, prompts cybersecurity threats on users' virtual machines (VM)s due to the presence of inevitable vulnerabilities on the offsite servers. Contrary to the existing works which concentrated on reducing resource sharing and encryption and decryption of data before transfer for improving cybersecurity which raises computational cost overhead, the proposed model operates diversely for efficiently serving the same purpose. This paper proposes a novel Multiple Risks Analysis based VM Threat Prediction Model (MR-TPM) to secure computational data and minimize adversary breaches by proactively estimating the VMs threats. It considers multiple cybersecurity risk factors associated with the configuration and management of VMs, along with analysis of users' behaviour. All these threat factors are quantified for the generation of respective risk score values and fed as input into a machine learning based classifier to estimate the probability of threat for each VM. The performance of MR-TPM is evaluated using benchmark Google Cluster and OpenNebula VM threat traces. The experimental results demonstrate that the proposed model efficiently computes the cybersecurity risks and learns the VM threat patterns from historical and live data samples. The deployment of MR-TPM with existing VM allocation policies reduces cybersecurity threats up to 88.9%. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Journal ref: IEEE Transactions on Systems, Man, and Cybernetics: Systems Journal, 2023

arXiv:2308.03615 [pdf, other]

Dirigo: Self-scaling Stateful Actors For Serverless Real-time Data Processing

Authors: Le Xu, Divyanshu Saxena, Neeraja J. Yadwadkar, Aditya Akella, Indranil Gupta

Abstract: We propose Dirigo, a distributed stream processing service built atop virtual actors. Dirigo achieves both a high level of resource efficiency and performance isolation driven by user intent (SLO). To improve resource efficiency, Dirigo adopts a serverless architecture that enables time-sharing of compute resources among streaming operators, both within and across applications. Meanwhile, Dirigo i… ▽ More We propose Dirigo, a distributed stream processing service built atop virtual actors. Dirigo achieves both a high level of resource efficiency and performance isolation driven by user intent (SLO). To improve resource efficiency, Dirigo adopts a serverless architecture that enables time-sharing of compute resources among streaming operators, both within and across applications. Meanwhile, Dirigo improves performance isolation by inheriting the property of function autoscaling from serverless architecture. Specifically, Dirigo proposes (i) dual-mode actor, an actor abstraction that dynamically provides orderliness guarantee for streaming operator during autoscaling and (ii) a data plane scheduling mechanism, along with its API, that allows scheduling and scaling at the message-level granularity. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2303.00207 [pdf, other]

CoMesh: Fully-Decentralized Control for Sense-Trigger-Actuate Routines in Edge Meshes

Authors: Anna Karanika, Rui Yang, Xiaojuan Ma, Jiangran Wang, Shalni Sundram, Indranil Gupta

Abstract: While mesh networking for edge settings (e.g., smart buildings, farms, battlefields, etc.) has received much attention, the layer of control over such meshes remains largely centralized and cloud-based. This paper focuses on applications with sense-trigger-actuate (STA) workloads -- these are similar to the abstraction of routines popular in smart homes, but applied to larger-scale edge IoT deploy… ▽ More While mesh networking for edge settings (e.g., smart buildings, farms, battlefields, etc.) has received much attention, the layer of control over such meshes remains largely centralized and cloud-based. This paper focuses on applications with sense-trigger-actuate (STA) workloads -- these are similar to the abstraction of routines popular in smart homes, but applied to larger-scale edge IoT deployments. We present CoMesh, which tackles the challenge of building local, non-cloud, and decentralized solutions for control of sense-trigger-actuate applications. At its core CoMesh uses an abstraction called k-groups to spread in a fine-grained way, the load of STA actions. Coordination within the k-group uses selective fast and cheap mechanisms rather than expensive off-the-shelf solutions. k-group selection is proactively dynamic, and occurs by using a combination of zero-message-exchange mechanisms (to reduce load) and locality sensitive hashing (to be aware of physical layout of devices). We analyze and theoretically prove the safety of CoMesh's mechanisms. Our evaluations using both simulation and Raspberry Pi lab deployments show that CoMesh is load-balanced, fast, and fault-tolerant. △ Less

Submitted 28 February, 2023; originally announced March 2023.

Comments: 12 pages, 12 figures

arXiv:2302.06227 [pdf, other]

Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages

Authors: Sudhanshu Srivastava, Ishika Gupta, Anusha Prakash, Jom Kuriakose, Hema A. Murthy

Abstract: Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN… ▽ More Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN vocoder to improve HTS synthesis quality is proposed. HTS is trained on high-resolution mel-spectrograms instead of conventional mel generalized coefficients (MGC), and the output mel-spectrogram corresponding to the input text is used in a HiFi-GAN vocoder trained on Indic languages, to produce naturalness that is equivalent to that of E2E systems, as evidenced from the DMOS and PC tests. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Comments: 5 pages, 5 figures

arXiv:2302.05476 [pdf, other]

Transactional Panorama: A Conceptual Framework for User Perception in Analytical Visual Interfaces

Authors: Dixin Tang, Alan Fekete, Indranil Gupta, Aditya G. Parameswaran

Abstract: Many tools empower analysts and data scientists to consume analysis results in a visual interface, such as a dashboard. When the underlying data changes, these results need to be updated, but this update can take a long time -- all while the user continues to explore the results. In this context, tools can either (i) hide away results that haven't been updated, hindering exploration; (ii) make the… ▽ More Many tools empower analysts and data scientists to consume analysis results in a visual interface, such as a dashboard. When the underlying data changes, these results need to be updated, but this update can take a long time -- all while the user continues to explore the results. In this context, tools can either (i) hide away results that haven't been updated, hindering exploration; (ii) make the updated results immediately available to the user (on the same screen as old results), leading to confusion and incorrect insights; or (iii) present old -- and therefore stale -- results to the user during the update. To help users reason about these options and others, and make appropriate trade-offs, we introduce Transactional Panorama, a formal framework that adopts transactions to jointly model the system refreshing the analysis results and the user interacting with them. We introduce three key properties that are important for user perception in this context, visibility (allowing users to continuously explore results), consistency (ensuring that results resented are from the same version of the data), and monotonicity (making sure that results don't "go back in time"). Within transactional panorama, we characterize all of the feasible property combinations, design new mechanisms (that we call lenses) for presenting analysis results to the user while preserving a given property combination, formally prove their relative orderings for various performance criteria and discuss their use cases. We propose novel algorithms to preserve each property combination and efficiently present fresh analysis results. We implement our transactional panorama framework in a popular, open-source BI tool, illustrate the relative performance implications of different lenses, demonstrate the benefits of the novel lenses, and outline the performance improvement by our optimizations. △ Less

Submitted 10 February, 2023; originally announced February 2023.

arXiv:2301.08695 [pdf, other]

Baechi: Fast Device Placement of Machine Learning Graphs

Authors: Beomyeol Jeon, Linda Cai, Chirag Shetty, Pallavi Srivastava, Jintao Jiang, Xiaolan Ke, Yitao Meng, Cong Xie, Indranil Gupta

Abstract: Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or models are large. To split the model across devices, learning-based approaches are still popular. While these result in model placements that train fast on data (i.e., low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a pla… ▽ More Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or models are large. To split the model across devices, learning-based approaches are still popular. While these result in model placements that train fast on data (i.e., low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, the first to adopt an algorithmic approach to the placement problem for running machine learning training graphs on small clusters of memory-constrained devices. We integrate our implementation of Baechi into two popular open-source learning frameworks: TensorFlow and PyTorch. Our experimental results using GPUs show that: (i) Baechi generates placement plans 654 X - 206K X faster than state-of-the-art learning-based approaches, and (ii) Baechi-placed model's step (training) time is comparable to expert placements in PyTorch, and only up to 6.2% worse than expert placements in TensorFlow. We prove mathematically that our two algorithms are within a constant factor of the optimal. Our work shows that compared to learning-based approaches, algorithmic approaches can face different challenges for adaptation to Machine learning systems, but also they offer proven bounds, and significant performance benefits. △ Less

Submitted 20 January, 2023; originally announced January 2023.

Comments: Extended version of SoCC 2020 paper: https://dl.acm.org/doi/10.1145/3419111.3421302

arXiv:2212.03547 [pdf, other]

doi 10.1109/TNSM.2022.3170379

A Fault Tolerant Elastic Resource Management Framework Towards High Availability of Cloud Services

Authors: Deepika Saxena, Ishu Gupta, Ashutosh Kumar Singh, Chung-Nan Lee

Abstract: Cloud computing has become inevitable for every digital service which has exponentially increased its usage. However, a tremendous surge in cloud resource demand stave off service availability resulting into outages, performance degradation, load imbalance, and excessive power-consumption. The existing approaches mainly attempt to address the problem by using multi-cloud and running multiple repli… ▽ More Cloud computing has become inevitable for every digital service which has exponentially increased its usage. However, a tremendous surge in cloud resource demand stave off service availability resulting into outages, performance degradation, load imbalance, and excessive power-consumption. The existing approaches mainly attempt to address the problem by using multi-cloud and running multiple replicas of a virtual machine (VM) which accounts for high operational-cost. This paper proposes a Fault Tolerant Elastic Resource Management (FT-ERM) framework that addresses aforementioned problem from a different perspective by inducing high-availability in servers and VMs. Specifically, (1) an online failure predictor is developed to anticipate failure-prone VMs based on predicted resource contention; (2) the operational status of server is monitored with the help of power analyser, resource estimator and thermal analyser to identify any failure due to overloading and overheating of servers proactively; and (3) failure-prone VMs are assigned to proposed fault-tolerance unit composed of decision matrix and safe box to trigger VM migration and handle any outage beforehand while maintaining desired level of availability for cloud users. The proposed framework is evaluated and compared against state-of-the-arts by executing experiments using two real-world datasets. FT-ERM improved the availability of the services up to 34.47% and scales down VM-migration and power-consumption up to 88.6% and 62.4%, respectively over without FT-ERM approach. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: IEEE Transactions of Network and Service Management, 2022

arXiv:2211.01338 [pdf, other]

Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages

Authors: Anusha Prakash, Arun Kumar, Ashish Seth, Bhagyashree Mukherjee, Ishika Gupta, Jom Kuriakose, Jordan Fernandes, K V Vikram, Mano Ranjith Kumar M, Metilda Sagaya Mary, Mohammad Wajahat, Mohana N, Mudit Batra, Navina K, Nihal John George, Nithya Ravi, Pruthwik Mishra, Sudhanshu Srivastava, Vasista Sai Lodagala, Vandan Mujadia, Kada Sai Venkata Vineeth, Vrunda Sukhadia, Dipti Sharma, Hema Murthy, Pushpak Bhattacharya , et al. (2 additional authors not shown)

Abstract: Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages… ▽ More Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages belong to different language families, resulting in differences in generated audio duration. This is further compounded by the original speaker's rhythm, especially for extempore speech. This paper describes the challenges in regenerating English lecture videos in Indian languages semi-automatically. A prototype is developed for dubbing lectures into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two languages, Hindi and Tamil, on two different courses. The output video is compared with the original video in terms of MOS (1-5) and lip synchronisation with scores of 4.09 and 3.74, respectively. The human effort also reduces by 75%. △ Less

Submitted 1 November, 2022; originally announced November 2022.

arXiv:2203.11287 [pdf, other]

PCA-RF: An Efficient Parkinson's Disease Prediction Model based on Random Forest Classification

Authors: Ishu Gupta, Vartika Sharma, Sizman Kaur, Ashutosh Kumar Singh

Abstract: In this modern era of overpopulation disease prediction is a crucial step in diagnosing various diseases at an early stage. With the advancement of various machine learning algorithms, the prediction has become quite easy. However, the complex and the selection of an optimal machine learning technique for the given dataset greatly affects the accuracy of the model. A large amount of datasets exist… ▽ More In this modern era of overpopulation disease prediction is a crucial step in diagnosing various diseases at an early stage. With the advancement of various machine learning algorithms, the prediction has become quite easy. However, the complex and the selection of an optimal machine learning technique for the given dataset greatly affects the accuracy of the model. A large amount of datasets exists globally but there is no effective use of it due to its unstructured format. Hence, a lot of different techniques are available to extract something useful for the real world to implement. Therefore, accuracy becomes a major metric in evaluating the model. In this paper, a disease prediction approach is proposed that implements a random forest classifier on Parkinson's disease. We compared the accuracy of this model with the Principal Component Analysis (PCA) applied Artificial Neural Network (ANN) model and captured a visible difference. The model secured a significant accuracy of up to 90%. △ Less

Submitted 21 March, 2022; originally announced March 2022.

Comments: 10 pages, 3 figures

arXiv:2203.08143 [pdf, other]

HiSA-SMFM: Historical and Sentiment Analysis based Stock Market Forecasting Model

Authors: Ishu Gupta, Tarun Kumar Madan, Sukhman Singh, Ashutosh Kumar Singh

Abstract: One of the pillars to build a country's economy is the stock market. Over the years, people are investing in stock markets to earn as much profit as possible from the amount of money that they possess. Hence, it is vital to have a prediction model which can accurately predict future stock prices. With the help of machine learning, it is not an impossible task as the various machine learning techni… ▽ More One of the pillars to build a country's economy is the stock market. Over the years, people are investing in stock markets to earn as much profit as possible from the amount of money that they possess. Hence, it is vital to have a prediction model which can accurately predict future stock prices. With the help of machine learning, it is not an impossible task as the various machine learning techniques if modeled properly may be able to provide the best prediction values. This would enable the investors to decide whether to buy, sell or hold the share. The aim of this paper is to predict the future of the financial stocks of a company with improved accuracy. In this paper, we have proposed the use of historical as well as sentiment data to efficiently predict stock prices by applying LSTM. It has been found by analyzing the existing research in the area of sentiment analysis that there is a strong correlation between the movement of stock prices and the publication of news articles. Therefore, in this paper, we have integrated these factors to predict the stock prices more accurately. △ Less

Submitted 10 March, 2022; originally announced March 2022.

arXiv:2203.05835 [pdf, other]

MLRM: A Multiple Linear Regression based Model for Average Temperature Prediction of A Day

Authors: Ishu Gupta, Harsh Mittal, Deepak Rikhari, Ashutosh Kumar Singh

Abstract: Weather is a phenomenon that affects everything and everyone around us on a daily basis. Weather prediction has been an important point of study for decades as researchers have tried to predict the weather and climatic changes using traditional meteorological techniques. With the advent of modern technologies and computing power, we can do so with the help of machine learning techniques. We aim to… ▽ More Weather is a phenomenon that affects everything and everyone around us on a daily basis. Weather prediction has been an important point of study for decades as researchers have tried to predict the weather and climatic changes using traditional meteorological techniques. With the advent of modern technologies and computing power, we can do so with the help of machine learning techniques. We aim to predict the weather of an area using past meteorological data and features using the Multiple Linear Regression Model. The performance of the model is evaluated and a conclusion is drawn. The model is successfully able to predict the average temperature of a day with an error of 2.8 degrees Celsius. △ Less

Submitted 11 March, 2022; originally announced March 2022.

arXiv:2203.05367 [pdf, other]

TIDF-DLPM: Term and Inverse Document Frequency based Data Leakage Prevention Model

Authors: Ishu Gupta, Sloni Mittal, Ankit Tiwari, Priya Agarwal, Ashutosh Kumar Singh

Abstract: Confidentiality of the data is being endangered as it has been categorized into false categories which might get leaked to an unauthorized party. For this reason, various organizations are mainly implementing data leakage prevention systems (DLPs). Firewalls and intrusion detection systems are being outdated versions of security mechanisms. The data which are being used, in sending state or are re… ▽ More Confidentiality of the data is being endangered as it has been categorized into false categories which might get leaked to an unauthorized party. For this reason, various organizations are mainly implementing data leakage prevention systems (DLPs). Firewalls and intrusion detection systems are being outdated versions of security mechanisms. The data which are being used, in sending state or are rest are being monitored by DLPs. The confidential data is prevented with the help of neighboring contexts and contents of DLPs. In this paper, a semantic-based approach is used to classify data based on the statistical data leakage prevention model. To detect involved private data, statistical analysis is being used to contribute secure mechanisms in the environment of data leakage. The favored Frequency-Inverse Document Frequency (TF-IDF) is the facts and details recapture function to arrange documents under particular topics. The results showcase that a similar statistical DLP approach could appropriately classify documents in case of extent alteration as well as interchanged documents. △ Less

Submitted 10 March, 2022; originally announced March 2022.

arXiv:2202.12530 [pdf, other]

Banyan: A Scoped Dataflow Engine for Graph Query Service

Authors: Li Su, Xiaoming Qin, Zichao Zhang, Rui Yang, Le Xu, Indranil Gupta, Wenyuan Yu, Kai Zeng, Jingren Zhou

Abstract: Graph query services (GQS) are widely used today to interactively answer graph traversal queries on large-scale graph data. Existing graph query engines focus largely on optimizing the latency of a single query. This ignores significant challenges posed by GQS, including fine-grained control and scheduling during query execution, as well as performance isolation and load balancing in various level… ▽ More Graph query services (GQS) are widely used today to interactively answer graph traversal queries on large-scale graph data. Existing graph query engines focus largely on optimizing the latency of a single query. This ignores significant challenges posed by GQS, including fine-grained control and scheduling during query execution, as well as performance isolation and load balancing in various levels from across user to intra-query. To tackle these control and scheduling challenges, we propose a novel scoped dataflow for modeling graph traversal queries, which explicitly exposes concurrent execution and control of any subquery to the finest granularity. We implemented Banyan, an engine based on the scoped dataflow model for GQS. Banyan focuses on scaling up the performance on a single machine, and provides the ability to easily scale out. Extensive experiments on multiple benchmarks show that Banyan improves performance by up to three orders of magnitude over state-of-the-art graph query engines, while providing performance isolation and load balancing. △ Less

Submitted 25 February, 2022; originally announced February 2022.

arXiv:2202.11965 [pdf, other]

A Holistic View on Data Protection for Sharing, Communicating, and Computing Environments: Taxonomy and Future Directions

Authors: Ishu Gupta, Ashutosh Kumar Singh

Abstract: The data is an important asset of an organization and it is essential to keep this asset secure. It requires security in whatever state is it i.e. data at rest, data in use, and data in transit. There is a need to pay more attention to it when the third party is included i.e. when the data is stored in the cloud then it requires more security. Since confidential data can reside on a variety of com… ▽ More The data is an important asset of an organization and it is essential to keep this asset secure. It requires security in whatever state is it i.e. data at rest, data in use, and data in transit. There is a need to pay more attention to it when the third party is included i.e. when the data is stored in the cloud then it requires more security. Since confidential data can reside on a variety of computing devices (physical servers, virtual servers, databases, file servers, PCs, point-of-sale devices, flash drives, and mobile devices) and move through a variety of network access points (wireline, wireless, VPNs, etc.), there is a need of solutions or mechanism that can tackle the problem of data loss, data recovery and data leaks. In this context, the paper presents a holistic view of data protection for sharing and communicating environments for any type of organization. A taxonomy of data leakage protection systems and major challenges faced while protecting confidential data are discussed. Data protection solutions, Data Leakage Protection System's analysis techniques, and, a thorough analysis of existing state-of-the-art contributions empowering machine learning-based approaches are entailed. Finally, the paper explores and concludes various critical emerging challenges and future research directions concerning data protection. △ Less

Submitted 24 February, 2022; originally announced February 2022.

arXiv:2109.02485 [pdf]

Severity and Mortality Prediction Models to Triage Indian COVID-19 Patients

Authors: Samarth Bhatia, Yukti Makhija, Sneha Jayaswal, Shalendra Singh, Ishaan Gupta

Abstract: As the second wave in India mitigates, COVID-19 has now infected about 29 million patients countrywide, leading to more than 350 thousand people dead. As the infections surged, the strain on the medical infrastructure in the country became apparent. While the country vaccinates its population, opening up the economy may lead to an increase in infection rates. In this scenario, it is essential to e… ▽ More As the second wave in India mitigates, COVID-19 has now infected about 29 million patients countrywide, leading to more than 350 thousand people dead. As the infections surged, the strain on the medical infrastructure in the country became apparent. While the country vaccinates its population, opening up the economy may lead to an increase in infection rates. In this scenario, it is essential to effectively utilize the limited hospital resources by an informed patient triaging system based on clinical parameters. Here, we present two interpretable machine learning models predicting the clinical outcomes, severity, and mortality, of the patients based on routine non-invasive surveillance of blood parameters from one of the largest cohorts of Indian patients at the day of admission. Patient severity and mortality prediction models achieved 86.3% and 88.06% accuracy, respectively, with an AUC-ROC of 0.91 and 0.92. We have integrated both the models in a user-friendly web app calculator, https://triage-COVID-19.herokuapp.com/, to showcase the potential deployment of such efforts at scale. △ Less

Submitted 23 October, 2021; v1 submitted 2 September, 2021; originally announced September 2021.

Comments: 31 pages, 6 figures, 8 tables. The first two authors (SB and YM) have equal contribution. IG is the corresponding author (ishaan@iitd.ac.in) Changes: Author List updated

arXiv:2107.13502 [pdf, other]

doi 10.1109/JSYST.2021.3092521

A Secure and Multi-objective Virtual Machine Placement Framework for Cloud Data Centre

Authors: Deepika Saxena, Ishu Gupta, Jitendra Kumar, Ashutosh Kumar Singh, Xiaoqing Wen

Abstract: To facilitate cost-effective and elastic computing benefits to the cloud users, the energy-efficient and secure allocation of virtual machines (VMs) plays a significant role at the data centre. The inefficient VM Placement (VMP) and sharing of common physical machines among multiple users leads to resource wastage, excessive power consumption, increased inter-communication cost and security breach… ▽ More To facilitate cost-effective and elastic computing benefits to the cloud users, the energy-efficient and secure allocation of virtual machines (VMs) plays a significant role at the data centre. The inefficient VM Placement (VMP) and sharing of common physical machines among multiple users leads to resource wastage, excessive power consumption, increased inter-communication cost and security breaches. To address the aforementioned challenges, a novel secure and multi-objective virtual machine placement (SM-VMP) framework is proposed with an efficient VM migration. The proposed framework ensures an energy-efficient distribution of physical resources among VMs that emphasizes secure and timely execution of user application by reducing inter-communication delay. The VMP is carried out by applying the proposed Whale Optimization Genetic Algorithm (WOGA), inspired by whale evolutionary optimization and non-dominated sorting based genetic algorithms. The performance evaluation for static and dynamic VMP and comparison with recent state-of-the-arts observed a notable reduction in shared servers, inter-communication cost, power consumption and execution time up to 28.81%, 25.7%, 35.9% and 82.21%, respectively and increased resource utilization up to 30.21%. △ Less

Submitted 28 July, 2021; originally announced July 2021.

Comments: This article has been accepted for inclusion in a future issue of IEEE Systems Journal (2021)

arXiv:2101.07215 [pdf]

Challenges in the application of a mortality prediction model for COVID-19 patients on an Indian cohort

Authors: Yukti Makhija, Samarth Bhatia, Shalendra Singh, Sneha Kumar Jayaswal, Prabhat Singh Malik, Pallavi Gupta, Shreyas N. Samaga, Shreya Johri, Sri Krishna Venigalla, Rabi Narayan Hota, Surinder Singh Bhatia, Ishaan Gupta

Abstract: Many countries are now experiencing the third wave of the COVID-19 pandemic straining the healthcare resources with an acute shortage of hospital beds and ventilators for the critically ill patients. This situation is especially worse in India with the second largest load of COVID-19 cases and a relatively resource-scarce medical infrastructure. Therefore, it becomes essential to triage the patien… ▽ More Many countries are now experiencing the third wave of the COVID-19 pandemic straining the healthcare resources with an acute shortage of hospital beds and ventilators for the critically ill patients. This situation is especially worse in India with the second largest load of COVID-19 cases and a relatively resource-scarce medical infrastructure. Therefore, it becomes essential to triage the patients based on the severity of their disease and devote resources towards critically ill patients. Yan et al. 1 have published a very pertinent research that uses Machine learning (ML) methods to predict the outcome of COVID-19 patients based on their clinical parameters at the day of admission. They used the XGBoost algorithm, a type of ensemble model, to build the mortality prediction model. The final classifier is built through the sequential addition of multiple weak classifiers. The clinically operable decision rule was obtained from a 'single-tree XGBoost' and used lactic dehydrogenase (LDH), lymphocyte and high-sensitivity C-reactive protein (hs-CRP) values. This decision tree achieved a 100% survival prediction and 81% mortality prediction. However, these models have several technical challenges and do not provide an out of the box solution that can be deployed for other populations as has been reported in the "Matters Arising" section of Yan et al. Here, we show the limitations of this model by deploying it on one of the largest datasets of COVID-19 patients containing detailed clinical parameters collected from India. △ Less

Submitted 15 January, 2021; originally announced January 2021.

Comments: 8 pages, 1 figure, 1 table Study designed by: IG, SB, YM, SJ. Data collected and curated by: SKJ, PG, SNS, RNH, SSB, PSM, SKV and SS. Data analysis performed by: SB, YM. Manuscript was written by: IG, SS, SB, YM . All authors read and approved the final manuscript. The first two authors have contributed equally

arXiv:2010.03035 [pdf, other]

Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo

Authors: Le Xu, Shivaram Venkataraman, Indranil Gupta, Luo Mai, Rahul Potharaju

Abstract: Resource provisioning in multi-tenant stream processing systems faces the dual challenges of keeping resource utilization high (without over-provisioning), and ensuring performance isolation. In our common production use cases, where streaming workloads have to meet latency targets and avoid breaching service-level agreements, existing solutions are incapable of handling the wide variability of us… ▽ More Resource provisioning in multi-tenant stream processing systems faces the dual challenges of keeping resource utilization high (without over-provisioning), and ensuring performance isolation. In our common production use cases, where streaming workloads have to meet latency targets and avoid breaching service-level agreements, existing solutions are incapable of handling the wide variability of user needs. Our framework called Cameo uses fine-grained stream processing (inspired by actor computation models), and is able to provide high resource utilization while meeting latency targets. Cameo dynamically calculates and propagates priorities of events based on user latency targets and query semantics. Experiments on Microsoft Azure show that compared to state-of-the-art, the Cameo framework: i) reduces query latency by 2.7X in single tenant settings, ii) reduces query latency by 4.6X in multi-tenant scenarios, and iii) weathers transient spikes of workload. △ Less

Submitted 6 October, 2020; originally announced October 2020.

arXiv:2007.13221 [pdf, other]

CSER: Communication-efficient SGD with Error Reset

Authors: Cong Xie, Shuai Zheng, Oluwasanmi Koyejo, Indranil Gupta, Mu Li, Haibin Lin

Abstract: The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: Communication-efficient SGD with Error Reset, or CSER. The key idea in CSER is first a new technique called "error reset" that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. S… ▽ More The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: Communication-efficient SGD with Error Reset, or CSER. The key idea in CSER is first a new technique called "error reset" that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. Second we introduce partial synchronization for both the gradients and the models, leveraging advantages from them. We prove the convergence of CSER for smooth non-convex problems. Empirical results show that when combined with highly aggressive compressors, the CSER algorithms accelerate the distributed training by nearly 10x for CIFAR-100, and by 4.5x for ImageNet. △ Less

Submitted 4 December, 2020; v1 submitted 26 July, 2020; originally announced July 2020.

arXiv:2007.12359 [pdf, other]

doi 10.1145/3447786.3456261

Home, SafeHome: Smart Home Reliability with Visibility and Atomicity

Authors: Shegufta Bakht Ahsan, Rui Yang, Shadi A. Noghabi, Indranil Gupta

Abstract: Smart environments (homes, factories, hospitals, buildings) contain an increasing number of IoT devices, making them complex to manage. Today, in smart homes where users or triggers initiate routines (i.e., a sequence of commands), concurrent routines and device failures can cause incongruent outcomes. We describe SafeHome, a system that provides notions of atomicity and serial equivalence for sma… ▽ More Smart environments (homes, factories, hospitals, buildings) contain an increasing number of IoT devices, making them complex to manage. Today, in smart homes where users or triggers initiate routines (i.e., a sequence of commands), concurrent routines and device failures can cause incongruent outcomes. We describe SafeHome, a system that provides notions of atomicity and serial equivalence for smart homes. Due to the human-facing nature of smart homes, SafeHome offers a spectrum of {\it visibility models} which trade off between responsiveness vs. incongruence of the smart home state. We implemented SafeHome and performed workload-driven experiments. We find that a weak visibility model, called {\it eventual visibility}, is almost as fast as today's status quo (up to 23\% slower) and yet guarantees serially-equivalent end states. △ Less

Submitted 24 July, 2020; originally announced July 2020.

Comments: 12 pages

arXiv:1911.09030 [pdf, other]

Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Authors: Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, Haibin Lin

Abstract: When scaling distributed training, the communication overhead is often the bottleneck. In this paper, we propose a novel SGD variant with reduced communication and adaptive learning rates. We prove the convergence of the proposed algorithm for smooth but non-convex problems. Empirical results show that the proposed algorithm significantly reduces the communication overhead, which, in turn, reduces… ▽ More When scaling distributed training, the communication overhead is often the bottleneck. In this paper, we propose a novel SGD variant with reduced communication and adaptive learning rates. We prove the convergence of the proposed algorithm for smooth but non-convex problems. Empirical results show that the proposed algorithm significantly reduces the communication overhead, which, in turn, reduces the training time by up to 30% for the 1B word dataset. △ Less

Submitted 4 December, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

arXiv:1905.09945 [pdf, other]

Multifaceted Privacy: How to Express Your Online Persona without Revealing Your Sensitive Attributes

Authors: Victor Zakhary, Ishani Gupta, Rey Tang, Amr El Abbadi

Abstract: Recent works in social network stream analysis show that a user's online persona attributes (e.g., gender, ethnicity, political interest, location, etc.) can be accurately inferred from the topics the user writes about or engages with. Attribute and preference inferences have been widely used to serve personalized recommendations, directed ads, and to enhance the user experience in social networks… ▽ More Recent works in social network stream analysis show that a user's online persona attributes (e.g., gender, ethnicity, political interest, location, etc.) can be accurately inferred from the topics the user writes about or engages with. Attribute and preference inferences have been widely used to serve personalized recommendations, directed ads, and to enhance the user experience in social networks. However, revealing a user's sensitive attributes could represent a privacy threat to some individuals. Microtargeting (e.g.,Cambridge Analytica scandal), surveillance, and discriminating ads are examples of threats to user privacy caused by sensitive attribute inference. In this paper, we propose Multifaceted privacy, a novel privacy model that aims to obfuscate a user's sensitive attributes while publicly preserving the user's public persona. To achieve multifaceted privacy, we build Aegis, a prototype client-centric social network stream processing system that helps preserve multifaceted privacy, and thus allowing social network users to freely express their online personas without revealing their sensitive attributes of choice. Aegis allows social network users to control which persona attributes should be publicly revealed and which ones should be kept private. For this, Aegis continuously suggests topics and hashtags to social network users to post in order to obfuscate their sensitive attributes and hence confuse content-based sensitive attribute inferences. The suggested topics are carefully chosen to preserve the user's publicly revealed persona attributes while hiding their private sensitive persona attributes. Our experiments show that adding as few as 0 to 4 obfuscation posts (depending on how revealing the original post is) successfully hides the user specified sensitive attributes without changing the user's public persona attributes. △ Less

Submitted 23 May, 2019; originally announced May 2019.

arXiv:1903.07020 [pdf, other]

Zeno++: Robust Fully Asynchronous SGD

Authors: Cong Xie, Sanmi Koyejo, Indranil Gupta

Abstract: We propose Zeno++, a new robust asynchronous Stochastic Gradient Descent~(SGD) procedure which tolerates Byzantine failures of the workers. In contrast to previous work, Zeno++ removes some unrealistic restrictions on worker-server communications, allowing for fully asynchronous updates from anonymous workers, arbitrarily stale worker updates, and the possibility of an unbounded number of Byzantin… ▽ More We propose Zeno++, a new robust asynchronous Stochastic Gradient Descent~(SGD) procedure which tolerates Byzantine failures of the workers. In contrast to previous work, Zeno++ removes some unrealistic restrictions on worker-server communications, allowing for fully asynchronous updates from anonymous workers, arbitrarily stale worker updates, and the possibility of an unbounded number of Byzantine workers. The key idea is to estimate the descent of the loss value after the candidate gradient is applied, where large descent values indicate that the update results in optimization progress. We prove the convergence of Zeno++ for non-convex problems under Byzantine failures. Experimental results show that Zeno++ outperforms existing approaches. △ Less

Submitted 9 May, 2021; v1 submitted 16 March, 2019; originally announced March 2019.

Comments: ICML version with some additional remarks related to the acceptance rate of Byzantine validation, and also with the full version of error bounds in the theorems

arXiv:1903.06996 [pdf, other]

SLSGD: Secure and Efficient Distributed On-device Machine Learning

Authors: Cong Xie, Sanmi Koyejo, Indranil Gupta

Abstract: We consider distributed on-device learning with limited communication and security requirements. We propose a new robust distributed optimization algorithm with efficient communication and attack tolerance. The proposed algorithm has provable convergence and robustness under non-IID settings. Empirical results show that the proposed algorithm stabilizes the convergence and tolerates data poisoning… ▽ More We consider distributed on-device learning with limited communication and security requirements. We propose a new robust distributed optimization algorithm with efficient communication and attack tolerance. The proposed algorithm has provable convergence and robustness under non-IID settings. Empirical results show that the proposed algorithm stabilizes the convergence and tolerates data poisoning on a small number of workers. △ Less

Submitted 1 October, 2019; v1 submitted 16 March, 2019; originally announced March 2019.

arXiv:1903.03936 [pdf, other]

Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation

Authors: Cong Xie, Sanmi Koyejo, Indranil Gupta

Abstract: Recently, new defense techniques have been developed to tolerate Byzantine failures for distributed machine learning. The Byzantine model captures workers that behave arbitrarily, including malicious and compromised workers. In this paper, we break two prevailing Byzantine-tolerant techniques. Specifically we show robust aggregation methods for synchronous SGD -- coordinate-wise median and Krum --… ▽ More Recently, new defense techniques have been developed to tolerate Byzantine failures for distributed machine learning. The Byzantine model captures workers that behave arbitrarily, including malicious and compromised workers. In this paper, we break two prevailing Byzantine-tolerant techniques. Specifically we show robust aggregation methods for synchronous SGD -- coordinate-wise median and Krum -- can be broken using new attack strategies based on inner product manipulation. We prove our results theoretically, as well as show empirical validation. △ Less

Submitted 10 March, 2019; originally announced March 2019.

arXiv:1903.03934 [pdf, other]

Asynchronous Federated Optimization

Authors: Cong Xie, Sanmi Koyejo, Indranil Gupta

Abstract: Federated learning enables training on a massive number of edge devices. To improve flexibility and scalability, we propose a new asynchronous federated optimization algorithm. We prove that the proposed approach has near-linear convergence to a global optimum, for both strongly convex and a restricted family of non-convex problems. Empirical results show that the proposed algorithm converges quic… ▽ More Federated learning enables training on a massive number of edge devices. To improve flexibility and scalability, we propose a new asynchronous federated optimization algorithm. We prove that the proposed approach has near-linear convergence to a global optimum, for both strongly convex and a restricted family of non-convex problems. Empirical results show that the proposed algorithm converges quickly and tolerates staleness in various applications. △ Less

Submitted 4 December, 2020; v1 submitted 10 March, 2019; originally announced March 2019.

arXiv:1805.10032 [pdf, other]

Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

Authors: Cong Xie, Oluwasanmi Koyejo, Indranil Gupta

Abstract: We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty nodes; we need assume only one non-faulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positiv… ▽ More We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty nodes; we need assume only one non-faulty worker. Our key idea is to suspect workers that are potentially defective. Since this is likely to lead to false positives, we use a ranking-based preference mechanism. We prove the convergence of SGD for non-convex problems under these scenarios. Experimental results show that Zeno outperforms existing approaches. △ Less

Submitted 17 May, 2019; v1 submitted 25 May, 2018; originally announced May 2018.

Comments: ICML 2019

arXiv:1805.09682 [pdf, other]

Phocas: dimensional Byzantine-resilient stochastic gradient descent

Authors: Cong Xie, Oluwasanmi Koyejo, Indranil Gupta

Abstract: We propose a novel robust aggregation rule for distributed synchronous Stochastic Gradient Descent~(SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server~(PS) architecture. We prove the Byzantine resilience of the proposed aggregation rules. Empirical analysis shows that the proposed t… ▽ More We propose a novel robust aggregation rule for distributed synchronous Stochastic Gradient Descent~(SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server~(PS) architecture. We prove the Byzantine resilience of the proposed aggregation rules. Empirical analysis shows that the proposed techniques outperform current approaches for realistic use cases and Byzantine attack scenarios. △ Less

Submitted 23 May, 2018; originally announced May 2018.

Comments: Submitted to NIPS 2018. arXiv admin note: substantial text overlap with arXiv:1802.10116

arXiv:1802.10116 [pdf, other]

Generalized Byzantine-tolerant SGD

Authors: Cong Xie, Oluwasanmi Koyejo, Indranil Gupta

Abstract: We propose three new robust aggregation rules for distributed synchronous Stochastic Gradient Descent~(SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server~(PS) architecture. We prove the Byzantine resilience properties of these aggregation rules. Empirical analysis shows that the pro… ▽ More We propose three new robust aggregation rules for distributed synchronous Stochastic Gradient Descent~(SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server~(PS) architecture. We prove the Byzantine resilience properties of these aggregation rules. Empirical analysis shows that the proposed techniques outperform current approaches for realistic use cases and Byzantine attack scenarios. △ Less

Submitted 23 March, 2018; v1 submitted 27 February, 2018; originally announced February 2018.

arXiv:1802.00082 [pdf, other]

Henge: Intent-driven Multi-Tenant Stream Processing

Authors: Faria Kalim, Le Xu, Sharanya Bathey, Richa Meherwal, Indranil Gupta

Abstract: We present Henge, a system to support intent-based multi-tenancy in modern stream processing applications. Henge supports multi-tenancy as a first-class citizen: everyone inside an organization can now submit their stream processing jobs to a single, shared, consolidated cluster. Additionally, Henge allows each tenant (job) to specify its own intents (i.e., requirements) as a Service Level Objecti… ▽ More We present Henge, a system to support intent-based multi-tenancy in modern stream processing applications. Henge supports multi-tenancy as a first-class citizen: everyone inside an organization can now submit their stream processing jobs to a single, shared, consolidated cluster. Additionally, Henge allows each tenant (job) to specify its own intents (i.e., requirements) as a Service Level Objective (SLO) that captures latency and/or throughput. In a multi-tenant cluster, the Henge scheduler adapts continually to meet jobs' SLOs in spite of limited cluster resources, and under dynamic input workloads. SLOs are soft and are based on utility functions. Henge continually tracks SLO satisfaction, and when jobs miss their SLOs, it wisely navigates the state space to perform resource allocations in real time, maximizing total system utility achieved by all jobs in the system. Henge is integrated in Apache Storm and we present experimental results using both production topologies and real datasets. △ Less

Submitted 31 January, 2018; originally announced February 2018.

arXiv:1712.10056 [pdf, other]

Inferring Formal Properties of Production Key-Value Stores

Authors: Edgar Pek, Pranav Garg, Muntasir Raihan Rahman, Karl Palmskog, Indranil Gupta, P. Madhusudan

Abstract: Production distributed systems are challenging to formally verify, in particular when they are based on distributed protocols that are not rigorously described or fully understood. In this paper, we derive models and properties for two core distributed protocols used in eventually consistent production key-value stores such as Riak and Cassandra. We propose a novel modeling called certified progra… ▽ More Production distributed systems are challenging to formally verify, in particular when they are based on distributed protocols that are not rigorously described or fully understood. In this paper, we derive models and properties for two core distributed protocols used in eventually consistent production key-value stores such as Riak and Cassandra. We propose a novel modeling called certified program models, where complete distributed systems are captured as programs written in traditional systems languages such as concurrent C. Specifically, we model the read-repair and hinted-handoff recovery protocols as concurrent C programs, test them for conformance with real systems, and then verify that they guarantee eventual consistency, modeling precisely the specification as well as the failure assumptions under which the results hold. △ Less

Submitted 28 December, 2017; originally announced December 2017.

Comments: 15 pages, 2 figures

arXiv:1707.08772 [pdf]

Spike sorting using non-volatile metal-oxide memristors

Authors: Isha Gupta, Alexantrou Serb, Ali Khiat, Maria Trapatseli, Themistoklis Prodromakis

Abstract: Electrophysiological techniques have improved substantially over the past years to the point that neuroprosthetics applications are becoming viable. This evolution has been fuelled by the advancement of implantable microelectrode technologies that have followed their own version of Moore's scaling law. Similarly to electronics, however, excessive data-rates and strained power budgets require the d… ▽ More Electrophysiological techniques have improved substantially over the past years to the point that neuroprosthetics applications are becoming viable. This evolution has been fuelled by the advancement of implantable microelectrode technologies that have followed their own version of Moore's scaling law. Similarly to electronics, however, excessive data-rates and strained power budgets require the development of more efficient computation paradigms for handling neural data in-situ, in particular the computationally heavy task of events classification. Here, we demonstrate how the intrinsic analogue programmability of memristive devices can be exploited to perform spike-sorting. We then show how combining memristors with standard logic enables efficient in-silico template matching. Leveraging the physical properties of nanoscale memristors allows us to implement ultra-compact analogue circuits for neural signal processing at the power cost of digital. △ Less

Submitted 27 July, 2017; originally announced July 2017.

Comments: 7 pages, 3 figures

arXiv:1611.09671 [pdf]

Sub 100nW volatile nano-metal-oxide memristor as synaptic-like encoder of neuronal spikes

Authors: Isha Gupta, Alexantrou Serb, Ali Khiat, Ralf Zeitler, Stefano Vassanelli, Themistoklis Prodromakis

Abstract: Advanced neural interfaces mediate a bio-electronic link between the nervous system and microelectronic devices, bearing great potential as innovative therapy for various diseases. Spikes from a large number of neurons are recorded leading to creation of big data that require on-line processing under most stringent conditions, such as minimal power dissipation and on-chip space occupancy. Here, we… ▽ More Advanced neural interfaces mediate a bio-electronic link between the nervous system and microelectronic devices, bearing great potential as innovative therapy for various diseases. Spikes from a large number of neurons are recorded leading to creation of big data that require on-line processing under most stringent conditions, such as minimal power dissipation and on-chip space occupancy. Here, we present a new concept where the inherent volatile properties of a nano-scale memristive device are used to detect and compress information on neural spikes as recorded by a multi-electrode array. Simultaneously, and similarly to a biological synapse, information on spike amplitude and frequency is transduced in metastable resistive state transitions of the device, which is inherently capable of self-resetting and of continuous encoding of spiking activity. Furthermore, operating the memristor in a very high resistive state range reduces its average in-operando power dissipation to less than 100 nW, demonstrating the potential to build highly scalable, yet energy-efficient on-node processors for advanced neural interfaces. △ Less

Submitted 29 November, 2016; originally announced November 2016.

Comments: 15 pages main article, 15 pages supplementary information, 2 pages supplementary notes

arXiv:1509.02464 [pdf, other]

Characterizing and Adapting the Consistency-Latency Tradeoff in Distributed Key-value Stores

Authors: Muntasir Raihan Rahman, Lewis Tseng, Son Nguyen, Indranil Gupta, Nitin Vaidya

Abstract: The CAP theorem is a fundamental result that applies to distributed storage systems. In this paper, we first present and prove two CAP-like impossibility theorems. To state these theorems, we present probabilistic models to characterize the three important elements of the CAP theorem: consistency (C), availability or latency (A), and partition tolerance (P). The theorems show the un-achievable env… ▽ More The CAP theorem is a fundamental result that applies to distributed storage systems. In this paper, we first present and prove two CAP-like impossibility theorems. To state these theorems, we present probabilistic models to characterize the three important elements of the CAP theorem: consistency (C), availability or latency (A), and partition tolerance (P). The theorems show the un-achievable envelope, i.e., which combinations of the parameters of the three models make them impossible to achieve together. Next, we present the design of a class of systems called PCAP that perform close to the envelope described by our theorems. In addition, these systems allow applications running on a single data-center to specify either a latency SLA or a consistency SLA. The PCAP systems automatically adapt, in real-time and under changing network conditions, to meet the SLA while optimizing the other C/A metric. We incorporate PCAP into two popular key-value stores -- Apache Cassandra and Riak. Our experiments with these two deployments, under realistic workloads, reveal that the PCAP system satisfactorily meets SLAs, and performs close to the achievable envelope. We also extend PCAP from a single data-center to multiple geo-distributed data-centers. △ Less

Submitted 23 January, 2016; v1 submitted 8 September, 2015; originally announced September 2015.

arXiv:1507.06832 [pdf]

Memristive integrative sensors for neuronal activity

Authors: Isha Gupta, Alexantrou Serb, Ali Khiat, Ralf Zeitler, Stefano Vassanelli, Themistoklis Prodromakis

Abstract: The advent of advanced neuronal interfaces offers great promise for linking brain functions to electronics. A major bottleneck in achieving this is real-time processing of big data that imposes excessive requirements on bandwidth, energy and computation capacity; limiting the overall number of bio-electronic links. Here, we present a novel monitoring system concept that exploits the intrinsic prop… ▽ More The advent of advanced neuronal interfaces offers great promise for linking brain functions to electronics. A major bottleneck in achieving this is real-time processing of big data that imposes excessive requirements on bandwidth, energy and computation capacity; limiting the overall number of bio-electronic links. Here, we present a novel monitoring system concept that exploits the intrinsic properties of memristors for processing neural information in real time. We demonstrate that the inherent voltage thresholds of solid-state TiOx memristors can be useful for discriminating significant neural activity, i.e. spiking events, from noise. When compared with a multi-dimensional, principal component feature space threshold detector, our system is capable of recording the majority of significant events, without resorting to computationally heavy off-line processing. We also show a memristive integrating sensing array that discriminates neuronal activity recorded in-vitro. We prove that information on spiking event amplitude is simultaneously transduced and stored as non-volatile resistive state transitions, allowing for more efficient data compression, demonstrating the memristors' potential for building scalable, yet energy efficient on-node processors for big data. △ Less

Submitted 24 July, 2015; originally announced July 2015.

arXiv:1207.2991 [pdf]

BIGP- a new single protocol that can work as an igp (interior gateway protocol) as well as egp (exterior gateway protocol)

Authors: Isha Gupta

Abstract: EGP and IGP are the key components of the present internet infrastructure. Routers in a domain forward IP packet within and between domains. Each domain uses an intra-domain routing protocol known as Interior Gateway Protocol (IGP) like IS-IS, OSPF, RIP etc to populate the routing tables of its routers. Routing information must also be exchanged between domains to ensure that a host in one domain… ▽ More EGP and IGP are the key components of the present internet infrastructure. Routers in a domain forward IP packet within and between domains. Each domain uses an intra-domain routing protocol known as Interior Gateway Protocol (IGP) like IS-IS, OSPF, RIP etc to populate the routing tables of its routers. Routing information must also be exchanged between domains to ensure that a host in one domain can reach another host in remote domain. This role is performed by inter-domain routing protocol called Exterior Gateway Protocol (EGP). Basically EGP used these days is Border Gateway Protocol (BGP). Basic difference between the both is that BGP has smaller convergence as compared to the IGP's. And IGP's on the other hand have lesser scalability as compared to the BGP. So in this paper a proposal to create a new protocol is given which can act as an IGP when we consider inter-domain transfer of traffic and acts as BGP when we consider intra-domain transfer of traffic. △ Less

Submitted 12 July, 2012; originally announced July 2012.

Comments: 5 Pages, 6 Figures

arXiv:cs/0509095 [pdf]

Leveraging Social-Network Infrastructure to Improve Peer-to-Peer Overlay Performance: Results from Orkut

Authors: Zahid Anwar, William Yurcik, Vivek Pandey, Asim Shankar, Indranil Gupta, Roy H. Campbell

Abstract: Application-level peer-to-peer (P2P) network overlays are an emerging paradigm that facilitates decentralization and flexibility in the scalable deployment of applications such as group communication, content delivery, and data sharing. However the construction of the overlay graph topology optimized for low latency, low link and node stress and lookup performance is still an open problem. We pr… ▽ More Application-level peer-to-peer (P2P) network overlays are an emerging paradigm that facilitates decentralization and flexibility in the scalable deployment of applications such as group communication, content delivery, and data sharing. However the construction of the overlay graph topology optimized for low latency, low link and node stress and lookup performance is still an open problem. We present a design of an overlay constructed on top of a social network and show that it gives a sizable improvement in lookups, average round-trip delay and scalability as opposed to other overlay topologies. We build our overlay on top of the topology of a popular real-world social network namely Orkut. We show Orkuts suitability for our purposes by evaluating the clustering behavior of its graph structure and the socializing pattern of its members. △ Less

Submitted 28 September, 2005; originally announced September 2005.

Comments: 9 pages 8 figures

ACM Class: C.2.2

Showing 1–41 of 41 results for author: Gupta, I