Compressing llms: The truth is rarely pure and never simple

A Jaiswal, Z Gan, X Du, B Zhang, Z Wang…�- arXiv preprint arXiv�…, 2023 - arxiv.org
arXiv preprint arXiv:2310.01382, 2023arxiv.org
Despite their remarkable achievements, modern Large Language Models (LLMs) encounter
exorbitant computational and memory footprints. Recently, several works have shown
significant success in training-free and data-free compression (pruning and quantization) of
LLMs achieving 50-60% sparsity and reducing the bit-width down to 3 or 4 bits per weight,
with negligible perplexity degradation over the uncompressed baseline. As recent research
efforts are focused on developing increasingly sophisticated compression methods, our�…
Despite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs achieving 50-60% sparsity and reducing the bit-width down to 3 or 4 bits per weight, with negligible perplexity degradation over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back, and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully-curated tasks to re-define the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts, and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity on knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at % sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods. All our related codes are planed to be open-sourced.
arxiv.org