PLAY PODCASTS
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Episode 295

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Daily Paper Cast

December 31, 202419m 3s

Audio is streamed directly from the publisher (media.transistor.fm) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

🤗 Upvotes: 6 | cs.CL

Authors:
Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee

Title:
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Arxiv:
http://arxiv.org/abs/2412.19512v1

Abstract:
Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.