Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

Hong, Yihuai; Yu, Lei; Yang, Haiqin; Ravfogel, Shauli; Geva, Mor

Abstract:The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize "concept vectors" - parameter vectors that encode concrete concepts - and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors and mostly suppress them during inference, while directly ablating these vectors demonstrably removes the associated knowledge and significantly reduces the model's susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parameter-based evaluations. To support this, we release our code and benchmark at this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.11614 [cs.CL]
	(or arXiv:2406.11614v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.11614

Computer Science > Computation and Language

Title:Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators