Computer Science > Computation and Language

arXiv:1911.12579 (cs)

[Submitted on 28 Nov 2019 (v1), last revised 30 Dec 2020 (this version, v3)]

Title:Word Embedding based New Corpus for Low-resourced Language: Sindhi

Authors:Wazir Ali, Jay Kumar, Junyu Lu, Zenglin Xu

View PDF

Abstract:Representing words and phrases into dense vectors of real numbers which encode semantic and syntactic properties is a vital constituent in natural language processing (NLP). The success of neural network (NN) models in NLP largely rely on such dense word representations learned on the large unlabeled corpus. Sindhi is one of the rich morphological language, spoken by large population in Pakistan and India lacks corpora which plays an essential role of a test-bed for generating word embeddings and developing language independent NLP systems. In this paper, a large corpus of more than 61 million words is developed for low-resourced Sindhi language for training neural word embeddings. The corpus is acquired from multiple web-resources using web-scrappy. Due to the unavailability of open source preprocessing tools for Sindhi, the prepossessing of such large corpus becomes a challenging problem specially cleaning of noisy data extracted from web resources. Therefore, a preprocessing pipeline is employed for the filtration of noisy text. Afterwards, the cleaned vocabulary is utilized for training Sindhi word embeddings with state-of-the-art GloVe, Skip-Gram (SG), and Continuous Bag of Words (CBoW) word2vec algorithms. The intrinsic evaluation approach of cosine similarity matrix and WordSim-353 are employed for the evaluation of generated Sindhi word embeddings. Moreover, we compare the proposed word embeddings with recently revealed Sindhi fastText (SdfastText) word representations. Our intrinsic evaluation results demonstrate the high quality of our generated Sindhi word embeddings using SG, CBoW, and GloVe as compare to SdfastText word representations.

Comments:	Body 21 pages, Tables 9, Figures 7
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:1911.12579 [cs.CL]
	(or arXiv:1911.12579v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1911.12579

Submission history

From: Wazir Ali [view email]
[v1] Thu, 28 Nov 2019 08:11:44 UTC (4,368 KB)
[v2] Mon, 2 Dec 2019 19:35:19 UTC (4,367 KB)
[v3] Wed, 30 Dec 2020 03:50:16 UTC (4,364 KB)

Computer Science > Computation and Language

Title:Word Embedding based New Corpus for Low-resourced Language: Sindhi

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Word Embedding based New Corpus for Low-resourced Language: Sindhi

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators