Skip to main content

Showing 1–1 of 1 results for author: Bhabesh, S

  1. arXiv:2404.08690  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Towards Building a Robust Toxicity Predictor

    Authors: Dmitriy Bespalov, Sourav Bhabesh, Yi Xiang, Liutong Zhou, Yanjun Qi

    Abstract: Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, \texttt{ToxicTrap}, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. ToxicTrap exploits greedy based search strategies t… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: ACL 2023 /