subscribe to arXiv mailings

Matching Algorithms in the Sparse Stochastic Block Model

Authors: Anna Brandenberger, Byron Chin, Nathan S. Sheffield, Divya Shyamal

Abstract: The stochastic block model (SBM) is a generalization of the Erdős--Rényi model of random graphs that describes the interaction of a finite number of distinct communities. In sparse Erdős--Rényi graphs, it is known that a linear-time algorithm of Karp and Sipser achieves near-optimal matching sizes asymptotically almost surely, giving a law-of-large numbers for the matching sizes of such graphs in… ▽ More The stochastic block model (SBM) is a generalization of the Erdős--Rényi model of random graphs that describes the interaction of a finite number of distinct communities. In sparse Erdős--Rényi graphs, it is known that a linear-time algorithm of Karp and Sipser achieves near-optimal matching sizes asymptotically almost surely, giving a law-of-large numbers for the matching sizes of such graphs in terms of solutions to an ODE. We provide an extension of this analysis, identifying broad ranges of stochastic block model parameters for which the Karp--Sipser algorithm achieves near-optimal matching sizes, but demonstrating that it cannot perform optimally on general SBM instances. We also consider the problem of constructing a matching online, in which the vertices of one half of a bipartite stochastic block model arrive one-at-a-time, and must be matched as they arrive. We show that the competitive ratio lower bound of 0.837 found by Mastin and Jaillet for the Erdős--Rényi case is tight whenever the expected degrees in all communities are equal. We propose several linear-time algorithms for online matching in the general stochastic block model, but prove that despite very good experimental performance, none of these achieve online asymptotic optimality. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 28 pages

arXiv:2310.04237 [pdf]

Written and spoken corpus of real and fake social media postings about COVID-19

Authors: Ng Bee Chin, Ng Zhi Ee Nicole, Kyla Kwan, Lee Yong Han Dylann, Liu Fang, Xu Hong

Abstract: This study investigates the linguistic traits of fake news and real news. There are two parts to this study: text data and speech data. The text data for this study consisted of 6420 COVID-19 related tweets re-filtered from Patwa et al. (2021). After cleaning, the dataset contained 3049 tweets, with 2161 labeled as 'real' and 888 as 'fake'. The speech data for this study was collected from TikTok,… ▽ More This study investigates the linguistic traits of fake news and real news. There are two parts to this study: text data and speech data. The text data for this study consisted of 6420 COVID-19 related tweets re-filtered from Patwa et al. (2021). After cleaning, the dataset contained 3049 tweets, with 2161 labeled as 'real' and 888 as 'fake'. The speech data for this study was collected from TikTok, focusing on COVID-19 related videos. Research assistants fact-checked each video's content using credible sources and labeled them as 'Real', 'Fake', or 'Questionable', resulting in a dataset of 91 real entries and 109 fake entries from 200 TikTok videos with a total word count of 53,710 words. The data was analysed using the Linguistic Inquiry and Word Count (LIWC) software to detect patterns in linguistic data. The results indicate a set of linguistic features that distinguish fake news from real news in both written and speech data. This offers valuable insights into the role of language in shaping trust, social media interactions, and the propagation of fake news. △ Less

Submitted 6 October, 2023; originally announced October 2023.

Comments: 9 pages, 3 tables

arXiv:2106.04831 [pdf, other]

MICE: A Crosslinguistic Emotion Corpus in Malay, Indonesian, Chinese and English

Authors: Ng Bee Chin, Yosephine Susanto, Erik Cambria

Abstract: MICE is a corpus of emotion words in four languages which is currently working progress. There are two sections to this study, Part I: Emotion word corpus and Part II: Emotion word survey. In Part 1, the method of how the emotion data is culled for each of the four languages will be described and very preliminary data will be presented. In total, we identified 3,750 emotion expressions in Malay, 6… ▽ More MICE is a corpus of emotion words in four languages which is currently working progress. There are two sections to this study, Part I: Emotion word corpus and Part II: Emotion word survey. In Part 1, the method of how the emotion data is culled for each of the four languages will be described and very preliminary data will be presented. In total, we identified 3,750 emotion expressions in Malay, 6,657 in Indonesian, 3,347 in Mandarin Chinese and 8,683 in English. We are currently evaluating and double checking the corpus and doing further analysis on the distribution of these emotion expressions. Part II Emotion word survey involved an online language survey which collected information on how speakers assigned the emotion words into basic emotion categories, the rating for valence and intensity as well as biographical information of all the respondents. △ Less

Submitted 9 June, 2021; originally announced June 2021.

arXiv:2012.06209 [pdf, other]

KOSMOS: Knowledge-graph Oriented Social media and Mainstream media Overview System

Authors: Chua Hao Yang, Yong Shan Jie, Boon Kok Chin, Lander Chin, Lynnette Hui Xian Ng

Abstract: We introduce KOSMOS, a knowledge retrieval system based on the constructed knowledge graph of social media and mainstream media documents. The system first identifies key events from the documents at each time frame through clustering, extracting a document to represent each cluster, then describing the document in terms of 5W1H (Who, What, When, Where, Why, How). The event centric knowledge graph… ▽ More We introduce KOSMOS, a knowledge retrieval system based on the constructed knowledge graph of social media and mainstream media documents. The system first identifies key events from the documents at each time frame through clustering, extracting a document to represent each cluster, then describing the document in terms of 5W1H (Who, What, When, Where, Why, How). The event centric knowledge graph is enhanced by relation triplets and entity disambiguation from the representative document. This knowledge retrieval is supported by a web interface that presents a graph visualisation of related nodes and relevant articles based on a user query. The interface facilitates understanding relationships between events reported in mainstream and social media journalism through the KOSMOS information extraction pipeline, which is valuable to understand media slant and public opinions. Finally, we explore a use case in extracting events and relations from documents to understand the media and community's view to the 2020 COVID19 pandemic. △ Less

Submitted 17 December, 2020; v1 submitted 11 December, 2020; originally announced December 2020.

arXiv:2010.10672 [pdf, ps, other]

Optimal Recovery of Block Models with $q$ Communities

Authors: Byron Chin, Allan Sly

Abstract: This paper is motivated by the reconstruction problem on the sparse stochastic block model. The paper "Belief Propagation, robust reconstruction and optimal recovery of block models" by Mossel, Neeman, and Sly provided and proved a reconstruction algorithm that recovers an optimal fraction of the communities in the 2 community case. The main step in their proof was to show that when the signal to… ▽ More This paper is motivated by the reconstruction problem on the sparse stochastic block model. The paper "Belief Propagation, robust reconstruction and optimal recovery of block models" by Mossel, Neeman, and Sly provided and proved a reconstruction algorithm that recovers an optimal fraction of the communities in the 2 community case. The main step in their proof was to show that when the signal to noise ratio is sufficiently large, in particular $θ^2d > C$, the reconstruction accuracy on a regular tree with or without noise on the leaves is the same. This paper will generalize their results, including the main step, to any number of communities, providing an algorithm related to Belief Propagation that recovers a provably optimal fraction of community labels. △ Less

Submitted 20 October, 2020; originally announced October 2020.

Comments: 49 pages

arXiv:2005.05063 [pdf, other]

doi 10.1038/s41598-020-75697-z

Spatial super-spreaders and super-susceptibles in human movement networks

Authors: Wei Chien Benny Chin, Roland Bouffanais

Abstract: As lockdowns and stay-at-home orders start to be lifted across the globe, governments are struggling to establish effective and practical guidelines to reopen their economies. In dense urban environments with people returning to work and public transportation resuming full capacity, enforcing strict social distancing measures will be extremely challenging, if not practically impossible. Government… ▽ More As lockdowns and stay-at-home orders start to be lifted across the globe, governments are struggling to establish effective and practical guidelines to reopen their economies. In dense urban environments with people returning to work and public transportation resuming full capacity, enforcing strict social distancing measures will be extremely challenging, if not practically impossible. Governments are thus paying close attention to particular locations that may become the next cluster of disease spreading. Indeed, certain places, like some people, can be "super-spreaders." Is a bustling train station in a central business district more or less susceptible and vulnerable as compared to teeming bus interchanges in the suburbs? Here, we propose a quantitative and systematic framework to identify spatial super-spreaders and the novel concept of super-susceptibles, i.e. respectively, places most likely to contribute to disease spread or to people contracting it. Our proposed data-analytic framework is based on the daily-aggregated ridership data of public transport in Singapore. By constructing the directed and weighted human movement networks and integrating human flow intensity with two neighborhood diversity metrics, we are able to pinpoint super-spreader and super-susceptible locations. Our results reveal that most super-spreaders are also super-susceptibles and that counterintuitively, busy peripheral bus interchanges are riskier places than crowded central train stations. Our analysis is based on data from Singapore, but can be readily adapted and extended for any other major urban center. It therefore serves as a useful framework for devising targeted and cost-effective preventive measures for urban planning and epidemiological preparedness. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Comments: 19 pages, 10 figures

Journal ref: Scientific reports 10 (2020) 18642

Showing 1–6 of 6 results for author: Chin, B