NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human
Authors:
Shuo Huang,
William MacLean,
Xiaoxi Kang,
Anqi Wu,
Lizhen Qu,
Qiongkai Xu,
Zhuang Li,
Xingliang Yuan,
Gholamreza Haffari
Abstract:
Increasing concerns about privacy leakage issues in academia and industry arise when employing NLP models from third-party providers to process sensitive texts. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them.…
▽ More
Increasing concerns about privacy leakage issues in academia and industry arise when employing NLP models from third-party providers to process sensitive texts. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined NAP^2, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works based on differential privacy, which lead to a sharp drop in information utility and unnatural texts, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.