Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

Zhengbo Wang1,2, Jian Liang2,3
1 University of Science and Technology of China
2 NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences
3 School of Artificial Intelligence, University of Chinese Academy of Sciences
zhengbowang@mail.ustc.edu.cn, liangjian92@gmail.com
Correspondence to: Jian Liang (liangjian92@gmail.com)
Abstract

Low-Rank Adaptation, also known as LoRA, has emerged as a prominent method for parameter-efficient fine-tuning foundation models by re-parameterizing the original matrix into the product of two low-rank matrices. Despite its efficiency, LoRA often yields inferior performance compared to full fine-tuning. In this paper, we propose LoRA-Pro to bridge this performance gap.

Firstly, we delve into the optimization processes in LoRA and full fine-tuning. We reveal that while LoRA employs low-rank approximation, it neglects to approximate the optimization process of full fine-tuning. To address this, we introduce a novel concept called the "equivalent gradient." This virtual gradient makes the optimization process on the re-parameterized matrix equivalent to LoRA, which can be used to quantify the differences between LoRA and full fine-tuning. The equivalent gradient is derived from the gradients of matrices A𝐴Aitalic_A and B𝐵Bitalic_B. To narrow the performance gap, our approach minimizes the differences between the equivalent gradient and the gradient obtained from full fine-tuning during the optimization process. By solving this objective, we derive optimal closed-form solutions for updating matrices A𝐴Aitalic_A and B𝐵Bitalic_B. Our method constrains the optimization process, shrinking the performance gap between LoRA and full fine-tuning.

Extensive experiments on natural language processing tasks validate the effectiveness of our method.

1 Introduction

Foundational models [Radford et al., 2021, Brown et al., 2020, Achiam et al., 2023, Kirillov et al., 2023, Rombach et al., 2022] have become the cornerstone of modern deep learning. By undergoing pre-training on massive datasets, these models typically exhibit excellent generalization and versatility. Remarkably, some foundation models even demonstrate emergent properties [Hoffmann et al., 2022, Kaplan et al., 2020]. As a result, foundation models have been widely applied to various downstream applications.

Despite these advantages, the huge number of parameters in foundational models hinders their broader application. The substantial parameter count results in high fine-tuning costs for these tasks. To address this issue, recent research has focused on parameter-efficient fine-tuning (PEFT) methods [Hu et al., 2022, Houlsby et al., 2019, Lester et al., 2021, Zhou et al., 2022]. PEFT methods reduce the fine-tuning cost by keeping the foundation models frozen and only fine-tuning small, additional lightweight adapters. With the majority of parameters frozen, PEFT enables faster fine-tuning and requires fewer computational resources.

Low-rank adaptation [Hu et al., 2022], also known as LoRA, is one of the most famous PEFT methods, which has been widely adopted across various domains. Inspired by previous works [Aghajanyan et al., 2021, Li et al., 2018], LoRA hypothesizes that the changes in weights during model adaptation exhibit a low-rank structure. To capture this, LoRA re-parameterizes these changes by expressing them as the product of two low-rank matrices: W=W0+ΔWW0+sBA𝑊subscript𝑊0Δ𝑊subscript𝑊0𝑠𝐵𝐴W=W_{0}+\Delta W\approx W_{0}+sBAitalic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ italic_W ≈ italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_s italic_B italic_A, where s𝑠sitalic_s is a scaling factor, and Ar×n𝐴superscript𝑟𝑛A\in\mathbb{R}^{r\times n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT and Bm×r𝐵superscript𝑚𝑟B\in\mathbb{R}^{m\times r}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT are low-rank matrices with rank rmin(m,n)much-less-than𝑟𝑚𝑛r\ll\min(m,n)italic_r ≪ roman_min ( italic_m , italic_n ). LoRA reduces the number of trainable parameters from m×n𝑚𝑛m\times nitalic_m × italic_n to r×(m+n)𝑟𝑚𝑛r\times(m+n)italic_r × ( italic_m + italic_n ), thereby decreasing the cost of fine-tuning. However, despite its efficiency, LoRA’s fine-tuning performance often falls short compared to full fine-tuning [Hu et al., 2022, Liu et al., 2024, Ding et al., 2023].

In this paper, we propose a novel PEFT method, LoRA-Pro, aimed at bridging the gap between LoRA and full fine-tuning. While LoRA employs low-rank approximation by re-parametrizing weight changes as the product of two low-rank matrices, it falls short in approximating the optimization process of full fine-tuning. To measure their discrepancy in the optimization process, we propose a novel concept, “Equivalent Gradient", for LoRA optimization. Equivalent gradient characterizes the gradient of the original matrix after low-rank approximation (despite it not being directly trainable), is composed of gradients from matrices A and B. Thus, during LoRA fine-tuning, our goal is not only to approximate the matrix with low-rank matrices but also to minimize the difference between the equivalent gradient and the gradient from full fine-tuning during the gradient descent process. This is achieved by selecting appropriate gradients for matrices A and B, ensuring a more accurate and effective fine-tuning process. To achieve this, we formulate it as an optimization problem. We then derive theoretical solutions for the problem, presenting optimal gradients for updating matrices A and B. These solutions ensure that the equivalent gradient closely match the optimization dynamics of full fine-tuning. By doing so, we enhance the effectiveness LoRA, bridging the gap between LoRA and full fine-tuning.

Our main contributions are summarized as follows:

  • We identify that LoRA approximates low-rank matrices but neglects to approximate the optimization process of full parameter fine-tuning. This shortcoming is one of the reasons for the performance gap between LoRA and full fine-tuning.

  • We introduce the concept of Equivalent Gradient, which allows us to quantify the discrepancy in the optimization process between LoRA and full fine-tuning. By minimizing this discrepancy, we derive the optimal closed-form updated solutions for LoRA.

  • Extensive experiments on natural language processing tasks validate the effectiveness of our method.

2 Related Work

Parameter-Efficient Fine-Tuning. Given the huge size of foundation models, recent research has focused on developing parameter-efficient fine-tuning methods [Hu et al., 2022, Liu et al., 2024, Ding et al., 2023, Houlsby et al., 2019, Liu et al., 2023, Lester et al., 2021]. These methods aim to reduce the cost of fine-tuning by adjusting only a small portion of the model’s parameters. Generally, these methods fall into two main categories. The first category is adapter tuning [Houlsby et al., 2019, Sung et al., 2022, He et al., 2021, Zhang et al., 2024, Bapna and Firat, 2019, Hu et al., 2022], which involves inserting small neural network modules, called adapters, into specific layers of the model. During fine-tuning, we keep the model frozen and only fine-tune the lightweight adapter modules, significantly reducing the memory footprint for fine-tuning. The second category is prompt tuning [Lester et al., 2021, Zhou et al., 2022, Li and Liang, 2021, Liu et al., 2022]. Prompt tuning adapts the models to specific tasks by adding specially designed prompts or learnable tokens to the input data, rather than directly modifying the internal parameters of foundation models. In this paper, we focus on LoRA [Hu et al., 2022], a prominent method within the realm of adapter tuning.

Low Rank Adaptation. Low-rank adaptation, initially referred to as LoRA [Hu et al., 2022], has evolved into a broad category encompassing parameter-efficient fine-tuning methods based on low-rank approximations [Hu et al., 2022, Liu et al., 2024, Hayou et al., 2024, Kalajdzievski, 2023, Zhang et al., 2023, Kopiczko et al., 2024, Hyeon-Woo et al., 2022, Zhang and Pilanci, 2024, Wang et al., 2024, Zhao et al., 2024]. LoRA [Hu et al., 2022] assumes that the changes in the weights of pre-trained models exhibit a low-rank structure. Consequently, it re-parameterizes these changes as the product of low-rank matrices, thereby reducing the cost associated with fine-tuning.

Several variants of LoRA have been proposed to address different aspects of this approach. For example, DoRA [Liu et al., 2024] improves LoRA [Hu et al., 2022] by incorporating a learnable magnitude vector to re-scale the normalized product of low-rank matrices. Another variant, rsLoRA Kalajdzievski [2023], introduces a new scaling factor to stabilize training in high-rank scenarios. LoRA+[Hayou et al., 2024] improves upon LoRA by applying different learning rates to the two low-rank matrices. Additionally, Galore [Zhao et al., 2024] employs SVD to project the gradients of full parameter training into a low-rank space, thereby reducing the memory footprint during pre-training and fine-tuning.

3 Method

In this section, we begin by revisiting LoRA [Hu et al., 2022] in Section 3.1. Following this, we conduct a comparison between LoRA and full fine-tuning from an optimization perspective in Section 3.2. Finally, in Section 3.3, we point out that LoRA falls short in approximating full fine-tuning during the optimization process, and we introduce LoRA-Pro as a solution to bridge this performance gap.

3.1 Revisit Low Rank Adaptation

First of all, let’s dive back into Low-Rank Adaptation (LoRA) [Hu et al., 2022]. LoRA’s core idea revolves around recognizing the low-rank structure of the change matrix ΔWΔ𝑊\Delta Wroman_Δ italic_W in the standard fine-tuning process. This insight allows LoRA [Hu et al., 2022] to re-parameterize the change matrix into the product of two low-rank matrices,

W=W0+ΔW=W0+sBA.𝑊subscript𝑊0Δ𝑊subscript𝑊0𝑠𝐵𝐴W=W_{0}+\Delta W=W_{0}+sBA.italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_s italic_B italic_A . (1)

Here, W0m×nsubscript𝑊0superscript𝑚𝑛W_{0}\in\mathbb{R}^{m\times n}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT represents the pre-trained weight matrix, Bm×r𝐵superscript𝑚𝑟B\in\mathbb{R}^{m\times r}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT and Ar×n𝐴superscript𝑟𝑛A\in\mathbb{R}^{r\times n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT are the low-rank matrices, and s𝑠sitalic_s is a scaling factor. For LoRA [Hu et al., 2022], s=αr𝑠𝛼𝑟s=\frac{\alpha}{r}italic_s = divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG, while for rsLoRA [Kalajdzievski, 2023], s=αr𝑠𝛼𝑟s=\frac{\alpha}{\sqrt{r}}italic_s = divide start_ARG italic_α end_ARG start_ARG square-root start_ARG italic_r end_ARG end_ARG. Here, α𝛼\alphaitalic_α is the hyper-parameter and rmin(m,n)much-less-than𝑟𝑚𝑖𝑛𝑚𝑛r\ll min(m,n)italic_r ≪ italic_m italic_i italic_n ( italic_m , italic_n ) denotes the rank. Consequently, LoRA significantly reduces the number of fine-tuning parameters from m×n𝑚𝑛m\times nitalic_m × italic_n to r×(m+n)𝑟𝑚𝑛r\times(m+n)italic_r × ( italic_m + italic_n ).

3.2 LoRA v.s. Full Fine-tuning

Despite widespread applications across various domains, LoRA’s performance still falls short when compared to full fine-tuning. In this part, we review and compare LoRA and full fine-tuning in the optimization process. In full fine-tuning, we utilize differential to analyze the relationship between changes in the loss and changes in the weights:

dL=LW,dWF,d𝐿subscript𝐿𝑊d𝑊𝐹\mathrm{d}L=\langle\frac{\partial L}{\partial W},\mathrm{d}W\rangle_{F},roman_d italic_L = ⟨ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_W end_ARG , roman_d italic_W ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , (2)

where dLd𝐿\mathrm{d}Lroman_d italic_L and dWd𝑊\mathrm{d}Wroman_d italic_W denotes the changes of the parameter W𝑊Witalic_W and the loss L𝐿Litalic_L, and F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the Frobenius norm. To minimize the loss function, we typically set dW=LW\triangleqgd𝑊𝐿𝑊\triangleq𝑔\mathrm{d}W=-\frac{\partial L}{\partial W}\triangleq-groman_d italic_W = - divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_W end_ARG - italic_g (omitting the learning rate for simplicity), which results in dL=LWF20d𝐿subscriptsuperscriptnorm𝐿𝑊2𝐹0\mathrm{d}L=-\|\frac{\partial L}{\partial W}\|^{2}_{F}\leq 0roman_d italic_L = - ∥ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_W end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ 0.

In LoRA optimization, given that W=W0+sBA𝑊subscript𝑊0𝑠𝐵𝐴W=W_{0}+sBAitalic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_s italic_B italic_A, we compute the differential using the chain rule:

dLd𝐿\displaystyle\mathrm{d}Lroman_d italic_L =LW,dWFabsentsubscript𝐿𝑊d𝑊𝐹\displaystyle=\langle\frac{\partial L}{\partial W},\mathrm{d}W\rangle_{F}= ⟨ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_W end_ARG , roman_d italic_W ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT (3)
=LW,WATdA+WBTdBFabsentsubscript𝐿𝑊superscript𝑊𝐴𝑇d𝐴superscript𝑊𝐵𝑇d𝐵𝐹\displaystyle=\langle\frac{\partial L}{\partial W},\frac{\partial W}{\partial A% }^{T}\mathrm{d}A+\frac{\partial W}{\partial B}^{T}\mathrm{d}B\rangle_{F}= ⟨ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_W end_ARG , divide start_ARG ∂ italic_W end_ARG start_ARG ∂ italic_A end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_d italic_A + divide start_ARG ∂ italic_W end_ARG start_ARG ∂ italic_B end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_d italic_B ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT
=LWWA,dAF+LWWB,dBFabsentsubscript𝐿𝑊𝑊𝐴d𝐴𝐹subscript𝐿𝑊𝑊𝐵d𝐵𝐹\displaystyle=\langle\frac{\partial L}{\partial W}\frac{\partial W}{\partial A% },\mathrm{d}A\rangle_{F}+\langle\frac{\partial L}{\partial W}\frac{\partial W}% {\partial B},\mathrm{d}B\rangle_{F}= ⟨ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_W end_ARG divide start_ARG ∂ italic_W end_ARG start_ARG ∂ italic_A end_ARG , roman_d italic_A ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ⟨ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_W end_ARG divide start_ARG ∂ italic_W end_ARG start_ARG ∂ italic_B end_ARG , roman_d italic_B ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT
=LA,dAF+LB,dBF.absentsubscript𝐿𝐴d𝐴𝐹subscript𝐿𝐵d𝐵𝐹\displaystyle=\langle\frac{\partial L}{\partial A},\mathrm{d}A\rangle_{F}+% \langle\frac{\partial L}{\partial B},\mathrm{d}B\rangle_{F}.= ⟨ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_A end_ARG , roman_d italic_A ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ⟨ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_B end_ARG , roman_d italic_B ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT .

Similarly, LoRA sets dA=LA\triangleqgloraAd𝐴𝐿𝐴\triangleqsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎\mathrm{d}A=-\frac{\partial L}{\partial A}\triangleq-g^{A}_{lora}roman_d italic_A = - divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_A end_ARG - italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT and dB=LB\triangleqgloraBd𝐵𝐿𝐵\triangleqsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎\mathrm{d}B=-\frac{\partial L}{\partial B}\triangleq-g^{B}_{lora}roman_d italic_B = - divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_B end_ARG - italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT, and thus dL=LAF2LBF20d𝐿subscriptsuperscriptnorm𝐿𝐴2𝐹subscriptsuperscriptnorm𝐿𝐵2𝐹0\mathrm{d}L=-\|\frac{\partial L}{\partial A}\|^{2}_{F}-\|\frac{\partial L}{% \partial B}\|^{2}_{F}\leq 0roman_d italic_L = - ∥ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_A end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - ∥ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_B end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ 0. Moreover, employing the chain rule, we derive:

gloraA=LWWA=sBTg,gloraB=LWWB=sgAT.formulae-sequencesubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐿𝑊𝑊𝐴𝑠superscript𝐵𝑇𝑔subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝐿𝑊𝑊𝐵𝑠𝑔superscript𝐴𝑇g^{A}_{lora}=\frac{\partial L}{\partial W}\frac{\partial W}{\partial A}=sB^{T}% g,\qquad g^{B}_{lora}=\frac{\partial L}{\partial W}\frac{\partial W}{\partial B% }=sgA^{T}.italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT = divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_W end_ARG divide start_ARG ∂ italic_W end_ARG start_ARG ∂ italic_A end_ARG = italic_s italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g , italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT = divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_W end_ARG divide start_ARG ∂ italic_W end_ARG start_ARG ∂ italic_B end_ARG = italic_s italic_g italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (4)

3.3 Low-Rank Adaptation with Equivalent Gradient

Definition 3.1 (Equivalent Gradient) In the context of LoRA optimization, we define the equivalent gradient as, g~\triangleqWATgA+WBTgB=sBgA+sgBA,~𝑔\triangleqsuperscript𝑊𝐴𝑇superscript𝑔𝐴superscript𝑊𝐵𝑇superscript𝑔𝐵𝑠𝐵superscript𝑔𝐴𝑠superscript𝑔𝐵𝐴\tilde{g}\triangleq\frac{\partial W}{\partial A}^{T}g^{A}+\frac{\partial W}{% \partial B}^{T}g^{B}=sBg^{A}+sg^{B}A,over~ start_ARG italic_g end_ARG divide start_ARG ∂ italic_W end_ARG start_ARG ∂ italic_A end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT + divide start_ARG ∂ italic_W end_ARG start_ARG ∂ italic_B end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = italic_s italic_B italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT + italic_s italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_A , (5) where s𝑠sitalic_s is the scaling factor, and gAsuperscript𝑔𝐴g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and gBsuperscript𝑔𝐵g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT are gradients with respect to A𝐴Aitalic_A and B𝐵Bitalic_B, respectively.

In this section, Equivalent Gradient. From Equation (3), we can see that changes in matrices A𝐴Aitalic_A and B𝐵Bitalic_B are inherently linked to changes in matrix W𝑊Witalic_W through the chain rule:

dW=WATdA+WBTdB=(sBgloraA+sgloraBA).d𝑊superscript𝑊𝐴𝑇d𝐴superscript𝑊𝐵𝑇d𝐵𝑠𝐵subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑠subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝐴\mathrm{d}W=\frac{\partial W}{\partial A}^{T}\mathrm{d}A+\frac{\partial W}{% \partial B}^{T}\mathrm{d}B=-(sBg^{A}_{lora}+sg^{B}_{lora}A).roman_d italic_W = divide start_ARG ∂ italic_W end_ARG start_ARG ∂ italic_A end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_d italic_A + divide start_ARG ∂ italic_W end_ARG start_ARG ∂ italic_B end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_d italic_B = - ( italic_s italic_B italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT + italic_s italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A ) . (6)

In comparison to full fine-tuning, this is equivalent to updating W𝑊Witalic_W using the gradient g~=sBgloraA+sgloraBA~𝑔𝑠𝐵subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑠subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝐴\tilde{g}=sBg^{A}_{lora}+sg^{B}_{lora}Aover~ start_ARG italic_g end_ARG = italic_s italic_B italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT + italic_s italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A. This critical relationship has been neglected in the LoRA optimization process. Hence, we hypothesize that by carefully adjusting the gradients of matrices A𝐴Aitalic_A and B𝐵Bitalic_B in such a way that g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG under LoRA closely approximates the gradient g𝑔gitalic_g from full fine-tuning, we can effectively bridge the gap between LoRA and full fine-tuning.

Based on this relationship, we define the concept of equivalent gradient in Definition 3.3. Equivalent gradient describes the gradient of the matrix W𝑊Witalic_W following low-rank adaptation, despite W𝑊Witalic_W not being a trainable parameter. To narrow the performance gap, our goal is to carefully select suitable gAsuperscript𝑔𝐴g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and gBsuperscript𝑔𝐵g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT to minimize the distance between the equivalent gradient g~~𝑔\tilde{g}over~ start_ARG italic_g end_ARG and the gradient under full fine-tuning g𝑔gitalic_g. Hence, our objective is:

mingA,gBg~gF2subscriptsuperscript𝑔𝐴superscript𝑔𝐵superscriptsubscriptnorm~𝑔𝑔𝐹2\displaystyle\min_{g^{A},g^{B}}\|\tilde{g}-g\|_{F}^{2}roman_min start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ over~ start_ARG italic_g end_ARG - italic_g ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (7)
s.t. g~=sBgA+sgBA,~𝑔𝑠𝐵superscript𝑔𝐴𝑠superscript𝑔𝐵𝐴\displaystyle\tilde{g}=sBg^{A}+sg^{B}A,over~ start_ARG italic_g end_ARG = italic_s italic_B italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT + italic_s italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_A ,
dL0.d𝐿0\displaystyle\mathrm{d}L\leq 0.roman_d italic_L ≤ 0 .
Theorem 4.1 Assume matrices Bm×r,Ar×nformulae-sequence𝐵superscript𝑚𝑟𝐴superscript𝑟𝑛B\in\mathbb{R}^{m\times r},A\in\mathbb{R}^{r\times n}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT are both full rank. For the objective mingA,gBg~gF2subscriptsuperscript𝑔𝐴superscript𝑔𝐵subscriptsuperscriptnorm~𝑔𝑔2𝐹\min_{g^{A},g^{B}}\|\tilde{g}-g\|^{2}_{F}roman_min start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ over~ start_ARG italic_g end_ARG - italic_g ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, the solutions are given by: gAsuperscript𝑔𝐴\displaystyle g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT =1s(BTB)1BTg+XA=1s2(BTB)1gloraA+XAabsent1𝑠superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇𝑔𝑋𝐴1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑋𝐴\displaystyle=\frac{1}{s}(B^{T}B)^{-1}B^{T}g+XA=\frac{1}{s^{2}}(B^{T}B)^{-1}g^% {A}_{lora}+XA= divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g + italic_X italic_A = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT + italic_X italic_A (8) gBsuperscript𝑔𝐵\displaystyle g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT =1s[IB(BTB)1BT]gAT(AAT)1BX=1s2[IB(BTB)1BT]gloraB(AAT)1BX.absent1𝑠delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇𝑔superscript𝐴𝑇superscript𝐴superscript𝐴𝑇1𝐵𝑋1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐵𝑋\displaystyle=\frac{1}{s}[I-B(B^{T}B)^{-1}B^{T}]gA^{T}(AA^{T})^{-1}-BX=\frac{1% }{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^{B}_{lora}(AA^{T})^{-1}-BX.= divide start_ARG 1 end_ARG start_ARG italic_s end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B italic_X = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B italic_X . (9) Here, Xr×r𝑋superscript𝑟𝑟X\in\mathbb{R}^{r\times r}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_r end_POSTSUPERSCRIPT represents an arbitrary matrix.

Closed-form Solution. Fortunately, Equation (7) admits a closed-form solution. According to Theorem 3.3, we obtain the optimal gradients for matrices A𝐴Aitalic_A and B𝐵Bitalic_B, ensuring that the equivalent gradient achieves the best approximation to the full fine-tuning gradient. Moreover, we observe that gAsuperscript𝑔𝐴g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and gBsuperscript𝑔𝐵g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT can be expressed as gloraAsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎g^{A}_{lora}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT and gloraBsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎g^{B}_{lora}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT, respectively, indicating that we do not explicitly possess the full fine-tuning gradient g𝑔gitalic_g. Therefore, our approach involves back-propagating in standard LoRA and adjusting the gradients of matrices A𝐴Aitalic_A and B𝐵Bitalic_B using the closed-form solution outlined in Theorem 3.3.

Theorem 4.2 When updating matrices A𝐴Aitalic_A and B𝐵Bitalic_B using the closed-form solution from Theorem 3.3, we proceed as follows: AAγgA𝐴𝐴𝛾superscript𝑔𝐴\displaystyle A\leftarrow A-\gamma g^{A}italic_A ← italic_A - italic_γ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT (10) BBγgB,𝐵𝐵𝛾superscript𝑔𝐵\displaystyle B\leftarrow B-\gamma g^{B},italic_B ← italic_B - italic_γ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT , (11) where γ0𝛾0\gamma\geq 0italic_γ ≥ 0 denotes the learning rate. Our method ensures a decrease in the loss, akin to the standard gradient descent algorithm, expressed by: dL=γ{gloraA,1s2(BTB)1gloraAF+gloraB,1s2[IB(BTB)1BT]gloraB(AAT)1F}0d𝐿𝛾subscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹subscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐹0\mathrm{d}L=-\gamma\{\langle g^{A}_{lora},\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{% lora}\rangle_{F}+\langle g^{B}_{lora},\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^% {B}_{lora}(AA^{T})^{-1}\rangle_{F}\}\leq 0roman_d italic_L = - italic_γ { ⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } ≤ 0 (12)

Although Theorem 3.3 provides a closed-form solution to the optimization problem mingA,gBg~gF2subscriptsuperscript𝑔𝐴superscript𝑔𝐵subscriptsuperscriptnorm~𝑔𝑔2𝐹\min_{g^{A},g^{B}}\|\tilde{g}-g\|^{2}_{F}roman_min start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ over~ start_ARG italic_g end_ARG - italic_g ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, this does not necessarily mean that updating matrices A𝐴Aitalic_A and B𝐵Bitalic_B with this solution will decrease the loss. To address this, we have Theorem 3.3, which guarantees a decrease in the loss during the optimization process. This theorem indicates that the change in loss, dLd𝐿\mathrm{d}Lroman_d italic_L, can be expressed as a negative scalar multiplied by the sum of two positive definite quadratic forms. This relationship ensures that dL0d𝐿0\mathrm{d}L\leq 0roman_d italic_L ≤ 0 during the update process, thus consistently driving the optimization process towards a lower loss.

Theorem 4.3 Consider the optimization problem, minXgAgloraAF2+gBgloraBF2,subscript𝑋superscriptsubscriptnormsuperscript𝑔𝐴subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹2superscriptsubscriptnormsuperscript𝑔𝐵subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝐹2\min_{X}\|g^{A}-g^{A}_{lora}\|_{F}^{2}+\|g^{B}-g^{B}_{lora}\|_{F}^{2},roman_min start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT - italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT - italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (13) where gAsuperscript𝑔𝐴g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and gBsuperscript𝑔𝐵g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT are the optimal solutions as stated in Theorem 3.3. The optimal X𝑋Xitalic_X can be determined by solving the Sylvester equation: BTBX+XAAT=1s2(BTB)1gloraAAT,superscript𝐵𝑇𝐵𝑋𝑋𝐴superscript𝐴𝑇1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝐴𝑇B^{T}BX+XAA^{T}=-\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}A^{T},italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B italic_X + italic_X italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (14) which has a unique solution X𝑋Xitalic_X provided that BTBsuperscript𝐵𝑇𝐵B^{T}Bitalic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B and AAT𝐴superscript𝐴𝑇-AA^{T}- italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT do not have any shared eigenvalues.

Selection of X. Although the equivalent gradient itself is not directly related to the matrix X𝑋Xitalic_X, the presence of X𝑋Xitalic_X plays a significant role in the updates of matrices A𝐴Aitalic_A and B𝐵Bitalic_B. We select an appropriate X𝑋Xitalic_X such that gAsuperscript𝑔𝐴g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and gBsuperscript𝑔𝐵g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT remain close to gloraAsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎g^{A}_{lora}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT and gloraBsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎g^{B}_{lora}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT respectively. Consequently, we minimize their Frobenius norm, as demonstrated in Equation (41). In practical terms, BTBsuperscript𝐵𝑇𝐵B^{T}Bitalic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B and AAT𝐴superscript𝐴𝑇AA^{T}italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT do not share common eigenvalues. Therefore, according to Theorem 3.3, we can determine a unique optimal X𝑋Xitalic_X for updating matrices A𝐴Aitalic_A and B𝐵Bitalic_B.

5 Experimental Results

In this section, we evaluate our LoRA-Pro method across various natural language understanding datasets. To provide a comprehensive comparison, we include several baseline methods: 1) full fine-tuning and the standard LoRA [Hu et al., 2022]. 2) LoRA variants maintaining the original structure, such as rsLoRA [Kalajdzievski, 2023], LoRA+ [Hayou et al., 2024], PiSSA [Meng et al., 2024], 3) oRA variants with modified structures, including DoRA [Liu et al., 2024] and AdaLoRA [Zhang et al., 2023].

The results are shown in Table 1. We fine-tune the T5-base model [Raffel et al., 2020] with the baseline methods on a subset of GLUE datasets. From Table 1, we observe that LoRA-Pro achieves the highest scores on 3 out of 5 datasets and the highest average score across all 5 datasets. Moreover, on average over 5 datasets, LoRA-Pro suppass standard LoRA [Hu et al., 2022] with a margin of 6.72. These results validate the effectiveness of our methods.

Table 1: Results on fine-tuning T5-base with Full Fine-tuning and LoRA variants on a subset of GLUE datasets.
Method MNLI SST2 CoLA QNLI MRPC Avg.
Full FT 86.33±0.00 94.75±0.21 80.70±0.24 93.19±0.22 84.56±0.73 87.91
LoRA 85.30±0.04 94.04±0.11 69.35±0.05 92.96±0.09 68.38±0.01 82.08
PiSSA 85.75±0.07 94.07±0.06 74.27±0.39 93.15±0.14 76.31±0.51 84.71
rsLoRA 85.73±0.10 94.19±0.23 72.32±1.12 93.12±0.09 52.86±2.27 79.64
LoRA+ 85.81±0.09 93.85±0.24 77.53±0.20 93.14±0.03 74.43±1.39 84.95
DoRA 85.67±0.09 94.04±0.53 72.04±0.94 93.04±0.06 68.08±0.51 82.57
AdaLoRA 85.45±0.11 93.69±0.20 69.16±0.24 91.66±0.05 68.14±0.28 81.62
LoRA-GA 85.70±0.09 94.11±0.18 80.57±0.20 93.18±0.06 85.29±0.24 87.77
LoRA-Pro 86.92±0.08 94.46±0.24 82.25±1.01 92.89±0.12 87.50±0.65 88.80

6 Conclusion

In this paper, we introduce LoRA-Pro, a novel approach designed to bridge the performance gap between LoRA and full fine-tuning. To bridge the performance gap, we introduce the concept of Equivalent Gradient, which allows us to quantify the difference in the optimization process between LoRA and full fine-tuning. By minimizing this discrepancy, we derive the optimal closed-form updated solutions for LoRA. Moreover, we prove that the solutions guarantee the loss decease during optimization. These solutions not only apply a low-rank approximation to the fine-tuning matrix but also maintain consistency with the optimization of full fine-tuning, enabling more effective fine-tuning. Finally, we validate the effectiveness of our method through extensive experiments on natural language processing tasks.

References

  • Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Aghajanyan et al. [2021] A. Aghajanyan, S. Gupta, and L. Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In ACL-IJCNLP, 2021.
  • Bapna and Firat [2019] A. Bapna and O. Firat. Simple, scalable adaptation for neural machine translation. In EMNLP-IJCNLP, 2019.
  • Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. In NeurIPS, 2020.
  • Ding et al. [2023] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W. Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
  • Hayou et al. [2024] S. Hayou, N. Ghosh, and B. Yu. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354, 2024.
  • He et al. [2021] R. He, L. Liu, H. Ye, Q. Tan, B. Ding, L. Cheng, J. Low, L. Bing, and L. Si. On the effectiveness of adapter-based tuning for pretrained language model adaptation. In ACL-IJCNLP, 2021.
  • Hoffmann et al. [2022] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. In NeurIPS, 2022.
  • Houlsby et al. [2019] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In ICML, 2019.
  • Hu et al. [2022] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  • Hyeon-Woo et al. [2022] N. Hyeon-Woo, M. Ye-Bin, and T.-H. Oh. Fedpara: Low-rank hadamard product for communication-efficient federated learning. In ICLR, 2022.
  • Kalajdzievski [2023] D. Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora. arXiv preprint arXiv:2312.03732, 2023.
  • Kaplan et al. [2020] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  • Kirillov et al. [2023] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. In ICCV, 2023.
  • Kopiczko et al. [2024] D. J. Kopiczko, T. Blankevoort, and Y. M. Asano. Vera: Vector-based random matrix adaptation. In ICLR, 2024.
  • Lester et al. [2021] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. In EMNLP, 2021.
  • Li et al. [2018] C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the intrinsic dimension of objective landscapes. In ICLR, 2018.
  • Li and Liang [2021] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. In ACL-IJCNLP, 2021.
  • Liu et al. [2023] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  • Liu et al. [2024] S.-y. Liu, C.-Y. Wang, H. Yin, P. Molchanov, Y.-C. F. Wang, K.-T. Cheng, and M.-H. Chen. Dora: Weight-decomposed low-rank adaptation. In ICML, 2024.
  • Liu et al. [2022] X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, and J. Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In ACL, 2022.
  • Loshchilov and Hutter [2019] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019.
  • Meng et al. [2024] F. Meng, Z. Wang, and M. Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948, 2024.
  • Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
  • Raffel et al. [2020] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  • Rombach et al. [2022] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • Sung et al. [2022] Y.-L. Sung, J. Cho, and M. Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In CVPR, 2022.
  • Sutskever et al. [2013] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
  • Wang et al. [2024] S. Wang, L. Yu, and J. Li. Lora-ga: Low-rank adaptation with gradient approximation. arXiv preprint arXiv:2407.05000, 2024.
  • Zhang and Pilanci [2024] F. Zhang and M. Pilanci. Riemannian preconditioned lora for fine-tuning foundation models. In ICML, 2024.
  • Zhang et al. [2023] Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In ICLR, 2023.
  • Zhang et al. [2024] R. Zhang, J. Han, C. Liu, A. Zhou, P. Lu, Y. Qiao, H. Li, and P. Gao. Llama-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In ICLR, 2024.
  • Zhao et al. [2024] J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y. Tian. Galore: Memory-efficient llm training by gradient low-rank projection. In ICML, 2024.
  • Zhou et al. [2022] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

————Appendix————

The structure of Appendix is as follows,

  • Appendix A contains the notation usage in our paper.

  • Appendix B contains the proofs of the theorems in the main manuscript.

  • Appendix C details the optimization algorithm of the proposed methods.

Appendix A Notations

In Table 2, we detail the notations utilized in our paper.

Table 2: Description of notations used in the paper.
Notation Description
s𝑠sitalic_s scaling factor in lora
Bm×r𝐵superscript𝑚𝑟B\in\mathbb{R}^{m\times r}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT, Ar×n𝐴superscript𝑟𝑛A\in\mathbb{R}^{r\times n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT low rank matrices in LoRA
g=LWm×n𝑔𝐿𝑊superscript𝑚𝑛g=\frac{\partial L}{\partial W}\in\mathbb{R}^{m\times n}italic_g = divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_W end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT gradients of full rank fine-tuning
gloraA=LA=sBTgr×nsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐿𝐴𝑠superscript𝐵𝑇𝑔superscript𝑟𝑛g^{A}_{lora}=\frac{\partial L}{\partial A}=sB^{T}g\in\mathbb{R}^{r\times n}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT = divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_A end_ARG = italic_s italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT gradients of matrix A in lora
gloraB=LB=sgATm×rsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝐿𝐵𝑠𝑔superscript𝐴𝑇superscript𝑚𝑟g^{B}_{lora}=\frac{\partial L}{\partial B}=sgA^{T}\in\mathbb{R}^{m\times r}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT = divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_B end_ARG = italic_s italic_g italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT gradients of matrix B in lora
dLd𝐿\mathrm{d}Lroman_d italic_L differential of the loss function
dAd𝐴\mathrm{d}Aroman_d italic_A differential of the matrix A
dBd𝐵\mathrm{d}Broman_d italic_B differential of the matrix B
F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT Frobenius Norm
,Fsubscript𝐹\langle\cdot,\cdot\rangle_{F}⟨ ⋅ , ⋅ ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT Frobenius inner product

Appendix B Proof of Theoretical Results

B.1 Proof of Theorem 3.3

Theorem B.1 Assume matrices Bm×r,Ar×nformulae-sequence𝐵superscript𝑚𝑟𝐴superscript𝑟𝑛B\in\mathbb{R}^{m\times r},A\in\mathbb{R}^{r\times n}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT are both full rank. For the objective mingA,gBg~gF2subscriptsuperscript𝑔𝐴superscript𝑔𝐵subscriptsuperscriptnorm~𝑔𝑔2𝐹\min_{g^{A},g^{B}}\|\tilde{g}-g\|^{2}_{F}roman_min start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ over~ start_ARG italic_g end_ARG - italic_g ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, the solutions are given by: gAsuperscript𝑔𝐴\displaystyle g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT =1s(BTB)1BTg+XA=1s2(BTB)1gloraA+XAabsent1𝑠superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇𝑔𝑋𝐴1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑋𝐴\displaystyle=\frac{1}{s}(B^{T}B)^{-1}B^{T}g+XA=\frac{1}{s^{2}}(B^{T}B)^{-1}g^% {A}_{lora}+XA= divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g + italic_X italic_A = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT + italic_X italic_A (15) gBsuperscript𝑔𝐵\displaystyle g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT =1s[IB(BTB)1BT]gAT(AAT)1BX=1s2[IB(BTB)1BT]gloraB(AAT)1BX.absent1𝑠delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇𝑔superscript𝐴𝑇superscript𝐴superscript𝐴𝑇1𝐵𝑋1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐵𝑋\displaystyle=\frac{1}{s}[I-B(B^{T}B)^{-1}B^{T}]gA^{T}(AA^{T})^{-1}-BX=\frac{1% }{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^{B}_{lora}(AA^{T})^{-1}-BX.= divide start_ARG 1 end_ARG start_ARG italic_s end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B italic_X = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B italic_X . (16) Here, Xr×r𝑋superscript𝑟𝑟X\in\mathbb{R}^{r\times r}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_r end_POSTSUPERSCRIPT represents an arbitrary matrix.
Proof For simplicity, we denote L=sBgA+sgBAgF2𝐿superscriptsubscriptnorm𝑠𝐵superscript𝑔𝐴𝑠superscript𝑔𝐵𝐴𝑔𝐹2L=\|sBg^{A}+sg^{B}A-g\|_{F}^{2}italic_L = ∥ italic_s italic_B italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT + italic_s italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_A - italic_g ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. To solve the optimization problem, we need to satisfy the following conditions: LA𝐿𝐴\displaystyle\frac{\partial L}{\partial A}divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_A end_ARG =2sBT(sBgA+sgBAg)=0absent2𝑠superscript𝐵𝑇𝑠𝐵superscript𝑔𝐴𝑠superscript𝑔𝐵𝐴𝑔0\displaystyle=2sB^{T}(sBg^{A}+sg^{B}A-g)=0= 2 italic_s italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_s italic_B italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT + italic_s italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_A - italic_g ) = 0 (17) LB𝐿𝐵\displaystyle\frac{\partial L}{\partial B}divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_B end_ARG =2(sBgA+sgBAg)sAT=0absent2𝑠𝐵superscript𝑔𝐴𝑠superscript𝑔𝐵𝐴𝑔𝑠superscript𝐴𝑇0\displaystyle=2(sBg^{A}+sg^{B}A-g)sA^{T}=0= 2 ( italic_s italic_B italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT + italic_s italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_A - italic_g ) italic_s italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = 0 (18) Given that matrices A𝐴Aitalic_A and B𝐵Bitalic_B are full-rank, AAT𝐴superscript𝐴𝑇AA^{T}italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and BTBsuperscript𝐵𝑇𝐵B^{T}Bitalic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B are invertible. And from Equation (18), we derive: gB=1sgAT(AAT)1BgAAT(AAT)1.superscript𝑔𝐵1𝑠𝑔superscript𝐴𝑇superscript𝐴superscript𝐴𝑇1𝐵superscript𝑔𝐴superscript𝐴𝑇superscript𝐴superscript𝐴𝑇1g^{B}=\frac{1}{s}gA^{T}(AA^{T})^{-1}-Bg^{A}A^{T}(AA^{T})^{-1}.italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s end_ARG italic_g italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (19) Substituting this into Equation (17), we obtain the following linear equation: gA[IAT(AAT)1A]=1s(BTB)1BTg.superscript𝑔𝐴delimited-[]𝐼superscript𝐴𝑇superscript𝐴superscript𝐴𝑇1𝐴1𝑠superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇𝑔g^{A}[I-A^{T}(AA^{T})^{-1}A]=\frac{1}{s}(B^{T}B)^{-1}B^{T}g.italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT [ italic_I - italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A ] = divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g . (20) Here, we notice that the matrix P=IAT(AAT)1A𝑃𝐼superscript𝐴𝑇superscript𝐴superscript𝐴𝑇1𝐴P=I-A^{T}(AA^{T})^{-1}Aitalic_P = italic_I - italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A is a projection matrix with rank nr𝑛𝑟n-ritalic_n - italic_r. The solution to the linear equation (20) is: gA=1s(BTB)1BTg+XA,superscript𝑔𝐴1𝑠superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇𝑔𝑋𝐴g^{A}=\frac{1}{s}(B^{T}B)^{-1}B^{T}g+XA,italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g + italic_X italic_A , (21) where Xr×r𝑋superscript𝑟𝑟X\in\mathbb{R}^{r\times r}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_r end_POSTSUPERSCRIPT represents an arbitrary matrix. We take the solution (24) into Equation (19), we derive: gB=1s[IB(BTB)1BT]gAT(AAT)1BXsuperscript𝑔𝐵1𝑠delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇𝑔superscript𝐴𝑇superscript𝐴superscript𝐴𝑇1𝐵𝑋g^{B}=\frac{1}{s}[I-B(B^{T}B)^{-1}B^{T}]gA^{T}(AA^{T})^{-1}-BXitalic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B italic_X (22) While we have obtained closed-form solutions for gAsuperscript𝑔𝐴g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and gBsuperscript𝑔𝐵g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, these solutions explicitly depend on the gradient of the matrix W𝑊Witalic_W, i.e., g𝑔gitalic_g, which is undesirable since g𝑔gitalic_g is unknown during LoRA optimization. Fortunately, the solutions can be transformed into the forms of the gradients of standard LoRA, where the gradients are: gloraA=sBTg,gloraB=sgAT.formulae-sequencesubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑠superscript𝐵𝑇𝑔subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝑠𝑔superscript𝐴𝑇\displaystyle g^{A}_{lora}=sB^{T}g,\quad g^{B}_{lora}=sgA^{T}.italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT = italic_s italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g , italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT = italic_s italic_g italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (23) Therefore, the solutions to the optimization problem can be written as: gAsuperscript𝑔𝐴\displaystyle g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT =1s2(BTB)1gloraA+XA,absent1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑋𝐴\displaystyle=\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}+XA,= divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT + italic_X italic_A , (24) gBsuperscript𝑔𝐵\displaystyle g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT =1s2[IB(BTB)1BT]gloraB(AAT)1BX.absent1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐵𝑋\displaystyle=\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^{B}_{lora}(AA^{T})^{-1}-BX.= divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B italic_X . (25) In our method, we perform the standard forward and backward passes of LoRA, then adjust the gradients of A and B using Solutions (24) and (25), and subsequently update them.

B.2 Proof of Theorem 3.3

Theorem B.2 When updating matrices A𝐴Aitalic_A and B𝐵Bitalic_B using the closed-form solution from Theorem 3.3, we proceed as follows: AAγgA,𝐴𝐴𝛾superscript𝑔𝐴\displaystyle A\leftarrow A-\gamma g^{A},italic_A ← italic_A - italic_γ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , (26) BBγgB,𝐵𝐵𝛾superscript𝑔𝐵\displaystyle B\leftarrow B-\gamma g^{B},italic_B ← italic_B - italic_γ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT , (27) where γ0𝛾0\gamma\geq 0italic_γ ≥ 0 denotes the learning rate. Our method ensures a decrease in the loss, akin to the standard gradient descent algorithm, expressed by: dL=γ{gloraA,1s2(BTB)1gloraAF+gloraB,1s2[IB(BTB)1BT]gloraB(AAT)1F}0d𝐿𝛾subscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹subscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐹0\mathrm{d}L=-\gamma\{\langle g^{A}_{lora},\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{% lora}\rangle_{F}+\langle g^{B}_{lora},\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^% {B}_{lora}(AA^{T})^{-1}\rangle_{F}\}\leq 0roman_d italic_L = - italic_γ { ⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } ≤ 0 (28)
Proof (Part 1) In summary, the proof of Theorem 3.3 is divided into two distinct parts. To begin with, we demonstrate that dLd𝐿\mathrm{d}Lroman_d italic_L can be expressed in the following form: dL=γ{gloraA,1s2(BTB)1gloraAF+gloraB,1s2[IB(BTB)1BT]gloraB(AAT)1F}.d𝐿𝛾subscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹subscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐹\mathrm{d}L=-\gamma\{\langle g^{A}_{lora},\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{% lora}\rangle_{F}+\langle g^{B}_{lora},\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^% {B}_{lora}(AA^{T})^{-1}\rangle_{F}\}.roman_d italic_L = - italic_γ { ⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } . (29) In the second part, we prove that this expression for dLd𝐿\mathrm{d}Lroman_d italic_L is always less than or equal to zero: dL0d𝐿0\mathrm{d}L\leq 0roman_d italic_L ≤ 0. Therefore, in this part, we first prove Equation (29). During the optimization process, the differential change in the loss function, dLd𝐿\mathrm{d}Lroman_d italic_L, can be expressed in terms of the differentials dAd𝐴\mathrm{d}Aroman_d italic_A and dBd𝐵\mathrm{d}Broman_d italic_B as follows: dL=LA,dAF+LB,dBF.d𝐿subscript𝐿𝐴d𝐴𝐹subscript𝐿𝐵d𝐵𝐹\mathrm{d}L=\langle\frac{\partial L}{\partial A},\mathrm{d}A\rangle_{F}+% \langle\frac{\partial L}{\partial B},\mathrm{d}B\rangle_{F}.roman_d italic_L = ⟨ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_A end_ARG , roman_d italic_A ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ⟨ divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_B end_ARG , roman_d italic_B ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT . (30) From Equation (26) and (27), we can derive that: dA=γgA,dB=γgB.formulae-sequenced𝐴𝛾superscript𝑔𝐴d𝐵𝛾superscript𝑔𝐵\mathrm{d}A=-\gamma g^{A},\quad\mathrm{d}B=\gamma g^{B}.roman_d italic_A = - italic_γ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , roman_d italic_B = italic_γ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT . (31) Given that LA=gloraA𝐿𝐴subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎\frac{\partial L}{\partial A}=g^{A}_{lora}divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_A end_ARG = italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT and LB=gloraB𝐿𝐵subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎\frac{\partial L}{\partial B}=g^{B}_{lora}divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_B end_ARG = italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT, it follows that: dLd𝐿\displaystyle\mathrm{d}Lroman_d italic_L =γ(gloraA,gAF+gloraB,gBF)absent𝛾subscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝑔𝐴𝐹subscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝑔𝐵𝐹\displaystyle=-\gamma(\langle g^{A}_{lora},g^{A}\rangle_{F}+\langle g^{B}_{% lora},g^{B}\rangle_{F})= - italic_γ ( ⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) (32) =γ(gloraA,1s2(BTB)1gloraAF+gloraB,1s2[IB(BTB)1BT]gloraB(AAT)1F\displaystyle=-\gamma(\langle g^{A}_{lora},\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{% lora}\rangle_{F}+\langle g^{B}_{lora},\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^% {B}_{lora}(AA^{T})^{-1}\rangle_{F}= - italic_γ ( ⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT +gloraA,XAFgloraB,BXF).\displaystyle+\langle g^{A}_{lora},XA\rangle_{F}-\langle g^{B}_{lora},BX% \rangle_{F}).+ ⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_X italic_A ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - ⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_B italic_X ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) . And we have the following equation: gloraA,XAFgloraB,BXFsubscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑋𝐴𝐹subscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝐵𝑋𝐹\displaystyle\langle g^{A}_{lora},XA\rangle_{F}-\langle g^{B}_{lora},BX\rangle% _{F}⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_X italic_A ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - ⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_B italic_X ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT (33) =\displaystyle== gloraAAT,XFBTgloraB,XFsubscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝐴𝑇𝑋𝐹subscriptsuperscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝑋𝐹\displaystyle\langle g^{A}_{lora}A^{T},X\rangle_{F}-\langle B^{T}g^{B}_{lora},% X\rangle_{F}⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_X ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - ⟨ italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_X ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT =\displaystyle== gloraAATBTgloraB,XFsubscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝐴𝑇superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝑋𝐹\displaystyle\langle g^{A}_{lora}A^{T}-B^{T}g^{B}_{lora},X\rangle_{F}⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_X ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT =\displaystyle== (sBTg)ATBT(sgAT),XFsubscript𝑠superscript𝐵𝑇𝑔superscript𝐴𝑇superscript𝐵𝑇𝑠𝑔superscript𝐴𝑇𝑋𝐹\displaystyle\langle(sB^{T}g)A^{T}-B^{T}(sgA^{T}),X\rangle_{F}⟨ ( italic_s italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g ) italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_s italic_g italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , italic_X ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT =\displaystyle== 0.0\displaystyle 0.0 . Therefore, we have: dL=γ{gloraA,1s2(BTB)1gloraAF+gloraB,1s2[IB(BTB)1BT]gloraB(AAT)1F}.d𝐿𝛾subscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹subscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐹\mathrm{d}L=-\gamma\{\langle g^{A}_{lora},\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{% lora}\rangle_{F}+\langle g^{B}_{lora},\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^% {B}_{lora}(AA^{T})^{-1}\rangle_{F}\}.roman_d italic_L = - italic_γ { ⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } . (34)
Proof (Part 2) In this part, we aim to prove dL0d𝐿0\mathrm{d}L\leq 0roman_d italic_L ≤ 0. Given that the learning rate γ>0𝛾0\gamma>0italic_γ > 0, it suffices to show the following inequalities: gloraA,1s2(BTB)1gloraAF0,subscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹0\displaystyle\langle g^{A}_{lora},\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}% \rangle_{F}\geq 0,⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≥ 0 , (35) gloraB,1s2[IB(BTB)1BT]gloraB(AAT)1F0.subscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐹0\displaystyle\langle g^{B}_{lora},\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^{B}_% {lora}(AA^{T})^{-1}\rangle_{F}\geq 0.⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≥ 0 . (36) By proving these inequalities, we can establish that dL0d𝐿0\mathrm{d}L\leq 0roman_d italic_L ≤ 0 as derived from Equation (29). ① Proof of gloraA,1s2(BTB)1gloraAF0subscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹0\langle g^{A}_{lora},\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}\rangle_{F}\geq 0⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≥ 0. To begin with, we need to show that (BTB)1superscriptsuperscript𝐵𝑇𝐵1(B^{T}B)^{-1}( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is positive definite. To establish this, it is sufficient to show that BTBsuperscript𝐵𝑇𝐵B^{T}Bitalic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B is positive definite, as the inverse of a positive definite matrix is also positive definite. To achieve this, consider any non-zero vector x𝑥xitalic_x, and noting that B𝐵Bitalic_B is full-rank, we have, x,BTBx=Bx,Bx=Bx2>0.𝑥superscript𝐵𝑇𝐵𝑥𝐵𝑥𝐵𝑥superscriptnorm𝐵𝑥20\langle x,B^{T}Bx\rangle=\langle Bx,Bx\rangle=\|Bx\|^{2}>0.⟨ italic_x , italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B italic_x ⟩ = ⟨ italic_B italic_x , italic_B italic_x ⟩ = ∥ italic_B italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0 . (37) This shows that BTBsuperscript𝐵𝑇𝐵B^{T}Bitalic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B is positive definite. Consequently, (BTB)1superscriptsuperscript𝐵𝑇𝐵1(B^{T}B)^{-1}( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is positive definite as well. Since (BTB)1superscriptsuperscript𝐵𝑇𝐵1(B^{T}B)^{-1}( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is positive definite, and thus we can apply Cholesky decomposition, and (BTB)1=UUTsuperscriptsuperscript𝐵𝑇𝐵1𝑈superscript𝑈𝑇(B^{T}B)^{-1}=UU^{T}( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_U italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. With this, we have, gloraA,1s2(BTB)1gloraAFsubscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹\displaystyle\langle g^{A}_{lora},\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}% \rangle_{F}⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT =1s2gloraA,UUTgloraAFabsent1superscript𝑠2subscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑈superscript𝑈𝑇subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹\displaystyle=\frac{1}{s^{2}}\langle g^{A}_{lora},UU^{T}g^{A}_{lora}\rangle_{F}= divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_U italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT (38) =1s2UTgloraA,UTgloraAFabsent1superscript𝑠2subscriptsuperscript𝑈𝑇subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝑈𝑇subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹\displaystyle=\frac{1}{s^{2}}\langle U^{T}g^{A}_{lora},U^{T}g^{A}_{lora}% \rangle_{F}= divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟨ italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT =1s2UTgloraAF20absent1superscript𝑠2superscriptsubscriptnormsuperscript𝑈𝑇subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹20\displaystyle=\frac{1}{s^{2}}\|U^{T}g^{A}_{lora}\|_{F}^{2}\geq 0= divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0 ② Proof of gloraB,1s2[IB(BTB)1BT]gloraB(AAT)1F0subscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐹0\langle g^{B}_{lora},\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^{B}_{lora}(AA^{T}% )^{-1}\rangle_{F}\geq 0⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≥ 0. Similarly, we can prove that matrix (AAT)1superscript𝐴superscript𝐴𝑇1(AA^{T})^{-1}( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is positive-definite. By employing Cholesky decomposition, we express (AAT)1=UUTsuperscript𝐴superscript𝐴𝑇1𝑈superscript𝑈𝑇(AA^{T})^{-1}=UU^{T}( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_U italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where U𝑈Uitalic_U is a lower-triangle matrix. Subsequently, we define P=IB(BTB)1BT𝑃𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇P=I-B(B^{T}B)^{-1}B^{T}italic_P = italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. It can be shown that P2=Psuperscript𝑃2𝑃P^{2}=Pitalic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_P, indicating that P𝑃Pitalic_P is a projection matrix. Consequently, the eigenvalues of P𝑃Pitalic_P are either 0 or 1, which implies that P𝑃Pitalic_P is positive semi-definite. Utilizing the Cholesky decomposition, we derive that P=VVT𝑃𝑉superscript𝑉𝑇P=VV^{T}italic_P = italic_V italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where V𝑉Vitalic_V is a lower-triangle matrix. Finally, we have: gloraB,1s2[IB(BTB)1BT]gloraB(AAT)1Fsubscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐹\displaystyle\langle g^{B}_{lora},\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^{B}_% {lora}(AA^{T})^{-1}\rangle_{F}⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT =1s2gloraB,VVTgloraBUUtFabsent1superscript𝑠2subscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝑉superscript𝑉𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝑈superscript𝑈𝑡𝐹\displaystyle=\frac{1}{s^{2}}\langle g^{B}_{lora},VV^{T}g^{B}_{lora}UU^{t}% \rangle_{F}= divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_V italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_U italic_U start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT (39) =1s2VTgloraBU,VTgloraBUFabsent1superscript𝑠2subscriptsuperscript𝑉𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝑈superscript𝑉𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝑈𝐹\displaystyle=\frac{1}{s^{2}}\langle V^{T}g^{B}_{lora}U,V^{T}g^{B}_{lora}U% \rangle_{F}= divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟨ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_U , italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_U ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT =1s2VTgloraBUF20absent1superscript𝑠2superscriptsubscriptnormsuperscript𝑉𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝑈𝐹20\displaystyle=\frac{1}{s^{2}}\|V^{T}g^{B}_{lora}U\|_{F}^{2}\geq 0= divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_U ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0 In summary, based on the above proofs, we have demonstrated that: dL=γ{gloraA,1s2(BTB)1gloraAF+gloraB,1s2[IB(BTB)1BT]gloraB(AAT)1F}0d𝐿𝛾subscriptsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹subscriptsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐹0\mathrm{d}L=-\gamma\{\langle g^{A}_{lora},\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{% lora}\rangle_{F}+\langle g^{B}_{lora},\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^% {B}_{lora}(AA^{T})^{-1}\rangle_{F}\}\leq 0roman_d italic_L = - italic_γ { ⟨ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + ⟨ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } ≤ 0 (40)

B.3 Proof of Theorem 3.3

Theorem B.3 Consider the optimization problem, minXgAgloraAF2+gBgloraBF2,subscript𝑋superscriptsubscriptnormsuperscript𝑔𝐴subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹2superscriptsubscriptnormsuperscript𝑔𝐵subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝐹2\min_{X}\|g^{A}-g^{A}_{lora}\|_{F}^{2}+\|g^{B}-g^{B}_{lora}\|_{F}^{2},roman_min start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT - italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT - italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (41) where gAsuperscript𝑔𝐴g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and gBsuperscript𝑔𝐵g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT are the optimal solutions as stated in Theorem 3.3. The optimal X𝑋Xitalic_X can be determined by solving the Sylvester equation: BTBX+XAAT=1s2(BTB)1gloraAAT,superscript𝐵𝑇𝐵𝑋𝑋𝐴superscript𝐴𝑇1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝐴𝑇B^{T}BX+XAA^{T}=-\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}A^{T},italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B italic_X + italic_X italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (42) which has a unique solution X𝑋Xitalic_X provided that BTBsuperscript𝐵𝑇𝐵B^{T}Bitalic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B and AAT𝐴superscript𝐴𝑇-AA^{T}- italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT do not have any shared eigenvalues.
Proof For simplicity, we denote L=gAgloraAF2+gBgloraBF2𝐿superscriptsubscriptnormsuperscript𝑔𝐴subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝐹2superscriptsubscriptnormsuperscript𝑔𝐵subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝐹2L=\|g^{A}-g^{A}_{lora}\|_{F}^{2}+\|g^{B}-g^{B}_{lora}\|_{F}^{2}italic_L = ∥ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT - italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT - italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. To solve the optimization problem, we need to satisfy the following conditions: LX=0.𝐿𝑋0\frac{\partial L}{\partial X}=0.divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_X end_ARG = 0 . (43) Since gAsuperscript𝑔𝐴g^{A}italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and gBsuperscript𝑔𝐵g^{B}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT are solutions in Theorem 3.3 and gloraA=sBTgsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑠superscript𝐵𝑇𝑔g^{A}_{lora}=sB^{T}gitalic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT = italic_s italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g and gloraB=sgATsubscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎𝑠𝑔superscript𝐴𝑇g^{B}_{lora}=sgA^{T}italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT = italic_s italic_g italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, we obtain that: 2(gAgloraA)AT2superscript𝑔𝐴subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝐴𝑇\displaystyle 2(g^{A}-g^{A}_{lora})A^{T}2 ( italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT - italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ) italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT 2BT(gBgloraB)=0,2superscript𝐵𝑇superscript𝑔𝐵subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎0\displaystyle-2B^{T}(g^{B}-g^{B}_{lora})=0,- 2 italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT - italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ) = 0 , (44) gAATBTgBsuperscript𝑔𝐴superscript𝐴𝑇superscript𝐵𝑇superscript𝑔𝐵\displaystyle\Rightarrow\quad g^{A}A^{T}-B^{T}g^{B}⇒ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT =gloraAATBTgloraB,absentsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝐴𝑇superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎\displaystyle=g^{A}_{lora}A^{T}-B^{T}g^{B}_{lora},= italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , BTBX+XAATsuperscript𝐵𝑇𝐵𝑋𝑋𝐴superscript𝐴𝑇\displaystyle\Rightarrow\quad B^{T}BX+XAA^{T}⇒ italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B italic_X + italic_X italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT =1s2(BTB)1gloraAAT,absent1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝐴𝑇\displaystyle=-\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}A^{T},= - divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , which is a Sylvester equation. This equation has a unique solution for X𝑋Xitalic_X if and only if BTBsuperscript𝐵𝑇𝐵B^{T}Bitalic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B and AAT𝐴superscript𝐴𝑇-AA^{T}- italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT have no shared eigenvalues.

Appendix C Optimization Algorithms

In this section, we present the pseudo-codes for implementing our LoRA-Pro method using the SGD [Sutskever et al., 2013] and AdamW [Loshchilov and Hutter, 2019] optimizers. These are detailed in Algorithm 1 and Algorithm 2, respectively.

In the standard SGD algorithm, as illustrated in Algorithm 1, all we need to do is adjusting the gradients of matrices A𝐴Aitalic_A and B𝐵Bitalic_B with the solutions in Theorem 3.3.

In AdamW optimizer, the implementation becomes more complex. Several modifications are necessary. Firstly, in order to mimic full fine-tuning, after adjusting the gradients of matrices A𝐴Aitalic_A and B𝐵Bitalic_B, we need to compute the equivalent gradient,

g~=sgBA+sBgA.~𝑔𝑠superscript𝑔𝐵𝐴𝑠𝐵superscript𝑔𝐴\tilde{g}=sg^{B}A+sBg^{A}.over~ start_ARG italic_g end_ARG = italic_s italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_A + italic_s italic_B italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT . (45)

Subsequently, we calculate the first and second moments of this equivalent gradient to derive the corresponding AdamW gradient, g~AdamWsuperscript~𝑔𝐴𝑑𝑎𝑚𝑊\tilde{g}^{AdamW}over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A italic_d italic_a italic_m italic_W end_POSTSUPERSCRIPT. Secondly, we determine the gradients with respect to matrices A𝐴Aitalic_A and B𝐵Bitalic_B as follows:

g~A=sBTg~AdamW,g~B=sg~AdamWAT.formulae-sequencesuperscript~𝑔𝐴𝑠superscript𝐵𝑇superscript~𝑔𝐴𝑑𝑎𝑚𝑊superscript~𝑔𝐵𝑠superscript~𝑔𝐴𝑑𝑎𝑚𝑊superscript𝐴𝑇\tilde{g}^{A}=sB^{T}\tilde{g}^{AdamW},\quad\tilde{g}^{B}=s\tilde{g}^{AdamW}A^{% T}.over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = italic_s italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A italic_d italic_a italic_m italic_W end_POSTSUPERSCRIPT , over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = italic_s over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A italic_d italic_a italic_m italic_W end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (46)

Thirdly, the weight decay process must be adjusted. In line with full fine-tuning, the weight decay is given by:

W(1γλ)(W0+sBA).𝑊1𝛾𝜆subscript𝑊0𝑠𝐵𝐴W\leftarrow(1-\gamma\lambda)(W_{0}+sBA).italic_W ← ( 1 - italic_γ italic_λ ) ( italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_s italic_B italic_A ) . (47)

This can be decomposed into:

W0(1γλ)W0,B1γλB,A1γλAformulae-sequencesubscript𝑊01𝛾𝜆subscript𝑊0formulae-sequence𝐵1𝛾𝜆𝐵𝐴1𝛾𝜆𝐴W_{0}\leftarrow(1-\gamma\lambda)W_{0},\quad B\leftarrow\sqrt{1-\gamma\lambda}B% ,\quad A\leftarrow\sqrt{1-\gamma\lambda}Aitalic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← ( 1 - italic_γ italic_λ ) italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_B ← square-root start_ARG 1 - italic_γ italic_λ end_ARG italic_B , italic_A ← square-root start_ARG 1 - italic_γ italic_λ end_ARG italic_A (48)
Algorithm 1 LoRA-Pro with SGD optimizer
0:  Given initial learning rate γ𝛾\gammaitalic_γ, scaling factor s𝑠sitalic_s.
1:  Initialize time step t0𝑡0t\leftarrow 0italic_t ← 0, low-rank matrices A0r×nsubscript𝐴0superscript𝑟𝑛A_{0}\in\mathbb{R}^{r\times n}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT and B0m×rsubscript𝐵0superscript𝑚𝑟B_{0}\in\mathbb{R}^{m\times r}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT
2:  repeat
3:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
4:     gloraA,gloraBsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎absentg^{A}_{lora},g^{B}_{lora}\leftarrowitalic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ← SelectBatch(At1,Bt1)subscript𝐴𝑡1subscript𝐵𝑡1(A_{t-1},B_{t-1})( italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) contains-as-subgroup\rhd Select batch and return the corresponding gradients
5:     A,BAt1,Bt1formulae-sequence𝐴𝐵subscript𝐴𝑡1subscript𝐵𝑡1A,B\leftarrow A_{t-1},B_{t-1}italic_A , italic_B ← italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT contains-as-subgroup\rhd Obtain the low-rank matrices A𝐴Aitalic_A and B𝐵Bitalic_B
6:     X𝑋absentX\leftarrowitalic_X ← SolveSylvester(BTBX+XAAT=1s2(BTB)1gloraAATsuperscript𝐵𝑇𝐵𝑋𝑋𝐴superscript𝐴𝑇1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝐴𝑇B^{T}BX+XAA^{T}=-\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}A^{T}italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B italic_X + italic_X italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT) contains-as-subgroup\rhd Compute X by solving the sylvester equation
7:     gA=1s2(BTB)1gloraA+XAsuperscript𝑔𝐴1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑋𝐴g^{A}=\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}+XAitalic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT + italic_X italic_A contains-as-subgroup\rhd Adjust the gradients of LoRA with Theorem 3.3
8:     gB=1s2[IB(BTB)1BT]gloraB(AAT)1BXsuperscript𝑔𝐵1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐵𝑋g^{B}=\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^{B}_{lora}(AA^{T})^{-1}-BXitalic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B italic_X
9:     AtAt1γgAsubscript𝐴𝑡subscript𝐴𝑡1𝛾superscript𝑔𝐴A_{t}\leftarrow A_{t-1}-\gamma g^{A}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_γ italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT
10:     BtBt1γgBsubscript𝐵𝑡subscript𝐵𝑡1𝛾superscript𝑔𝐵B_{t}\leftarrow B_{t-1}-\gamma g^{B}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_γ italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT
11:  until stopping criterion is met
12:  return  optimized parameters Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Btsubscript𝐵𝑡B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Algorithm 2 LoRA-Pro with AdamW optimizer
0:  Given initial learning rate γ𝛾\gammaitalic_γ, scaling factor s𝑠sitalic_s, original weight matrix W0m×nsubscript𝑊0superscript𝑚𝑛W_{0}\in\mathbb{R}^{m\times n}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, and β1=0.9,β2=0.999,ϵ=108,λformulae-sequencesubscript𝛽10.9formulae-sequencesubscript𝛽20.999formulae-sequenceitalic-ϵsuperscript108𝜆\beta_{1}=0.9,\beta_{2}=0.999,\epsilon=10^{-8},\lambda\in\mathbb{R}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999 , italic_ϵ = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT , italic_λ ∈ blackboard_R
1:  Initialize time step t0𝑡0t\leftarrow 0italic_t ← 0, low-rank matrices A0r×nsubscript𝐴0superscript𝑟𝑛A_{0}\in\mathbb{R}^{r\times n}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT and B0m×rsubscript𝐵0superscript𝑚𝑟B_{0}\in\mathbb{R}^{m\times r}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT, first momentum m0m×nsubscript𝑚0superscript𝑚𝑛m_{0}\in\mathbb{R}^{m\times n}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, second momentum vtm×nsubscript𝑣𝑡superscript𝑚𝑛v_{t}\in\mathbb{R}^{m\times n}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT
2:  repeat
3:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
4:     gloraA,gloraBsubscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎absentg^{A}_{lora},g^{B}_{lora}\leftarrowitalic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT , italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ← SelectBatch(At1,Bt1)subscript𝐴𝑡1subscript𝐵𝑡1(A_{t-1},B_{t-1})( italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) contains-as-subgroup\rhd Select batch and return the corresponding gradients
5:     A,BAt1,Bt1formulae-sequence𝐴𝐵subscript𝐴𝑡1subscript𝐵𝑡1A,B\leftarrow A_{t-1},B_{t-1}italic_A , italic_B ← italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT contains-as-subgroup\rhd Obtain the low-rank matrices A𝐴Aitalic_A and B𝐵Bitalic_B
6:     X𝑋absentX\leftarrowitalic_X ← SolveSylvester(BTBX+XAAT=1s2(BTB)1gloraAATsuperscript𝐵𝑇𝐵𝑋𝑋𝐴superscript𝐴𝑇1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎superscript𝐴𝑇B^{T}BX+XAA^{T}=-\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}A^{T}italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B italic_X + italic_X italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT) contains-as-subgroup\rhd Compute X by solving the sylvester equation
7:     gA=1s2(BTB)1gloraA+XAsuperscript𝑔𝐴1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript𝑔𝐴𝑙𝑜𝑟𝑎𝑋𝐴g^{A}=\frac{1}{s^{2}}(B^{T}B)^{-1}g^{A}_{lora}+XAitalic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT + italic_X italic_A contains-as-subgroup\rhd Adjust the gradients of LoRA with Theorem 3.3
8:     gB=1s2[IB(BTB)1BT]gloraB(AAT)1BXsuperscript𝑔𝐵1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐵𝑋g^{B}=\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]g^{B}_{lora}(AA^{T})^{-1}-BXitalic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B italic_X
9:     g~sgBA+sBgA~𝑔𝑠superscript𝑔𝐵𝐴𝑠𝐵superscript𝑔𝐴\tilde{g}\leftarrow sg^{B}A+sBg^{A}over~ start_ARG italic_g end_ARG ← italic_s italic_g start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_A + italic_s italic_B italic_g start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT contains-as-subgroup\rhd Compute equivalent gradient
10:     mtβ1mt1+(1β1)g~subscript𝑚𝑡subscript𝛽1subscript𝑚𝑡11subscript𝛽1~𝑔m_{t}\leftarrow\beta_{1}m_{t-1}+(1-\beta_{1})\tilde{g}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) over~ start_ARG italic_g end_ARG
11:     vtβ2vt1+(1β2)g~2subscript𝑣𝑡subscript𝛽2subscript𝑣𝑡11subscript𝛽2superscript~𝑔2v_{t}\leftarrow\beta_{2}v_{t-1}+(1-\beta_{2})\tilde{g}^{2}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
12:     m^tmt1β1tsubscript^𝑚𝑡subscript𝑚𝑡1superscriptsubscript𝛽1𝑡\hat{m}_{t}\leftarrow\frac{m_{t}}{1-\beta_{1}^{t}}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← divide start_ARG italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG
13:     v^tvt1β2tsubscript^𝑣𝑡subscript𝑣𝑡1superscriptsubscript𝛽2𝑡\hat{v}_{t}\leftarrow\frac{v_{t}}{1-\beta_{2}^{t}}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← divide start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG
14:     g~AdamWm^tv^t+ϵsuperscript~𝑔𝐴𝑑𝑎𝑚𝑊subscript^𝑚𝑡subscript^𝑣𝑡italic-ϵ\tilde{g}^{AdamW}\leftarrow\frac{\hat{m}_{t}}{\sqrt{\hat{v}_{t}}+\epsilon}over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A italic_d italic_a italic_m italic_W end_POSTSUPERSCRIPT ← divide start_ARG over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_ϵ end_ARG
15:     g~loraAsBTg~AdamWsubscriptsuperscript~𝑔𝐴𝑙𝑜𝑟𝑎𝑠superscript𝐵𝑇superscript~𝑔𝐴𝑑𝑎𝑚𝑊\tilde{g}^{A}_{lora}\leftarrow sB^{T}\tilde{g}^{AdamW}over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ← italic_s italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A italic_d italic_a italic_m italic_W end_POSTSUPERSCRIPT
16:     g~loraBsg~AdamWATsubscriptsuperscript~𝑔𝐵𝑙𝑜𝑟𝑎𝑠superscript~𝑔𝐴𝑑𝑎𝑚𝑊superscript𝐴𝑇\tilde{g}^{B}_{lora}\leftarrow s\tilde{g}^{AdamW}A^{T}over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ← italic_s over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A italic_d italic_a italic_m italic_W end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
17:     X𝑋absentX\leftarrowitalic_X ← SolveSylvester(BTBX+XAAT=1s2(BTB)1g~loraAATsuperscript𝐵𝑇𝐵𝑋𝑋𝐴superscript𝐴𝑇1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript~𝑔𝐴𝑙𝑜𝑟𝑎superscript𝐴𝑇B^{T}BX+XAA^{T}=-\frac{1}{s^{2}}(B^{T}B)^{-1}\tilde{g}^{A}_{lora}A^{T}italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B italic_X + italic_X italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT)
18:     g~A=1s2(BTB)1g~loraA+XAsuperscript~𝑔𝐴1superscript𝑠2superscriptsuperscript𝐵𝑇𝐵1subscriptsuperscript~𝑔𝐴𝑙𝑜𝑟𝑎𝑋𝐴\tilde{g}^{A}=\frac{1}{s^{2}}(B^{T}B)^{-1}\tilde{g}^{A}_{lora}+XAover~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT + italic_X italic_A contains-as-subgroup\rhd Adjust the gradients of LoRA with Theorem 3.3
19:     g~B=1s2[IB(BTB)1BT]g~loraB(AAT)1BXsuperscript~𝑔𝐵1superscript𝑠2delimited-[]𝐼𝐵superscriptsuperscript𝐵𝑇𝐵1superscript𝐵𝑇subscriptsuperscript~𝑔𝐵𝑙𝑜𝑟𝑎superscript𝐴superscript𝐴𝑇1𝐵𝑋\tilde{g}^{B}=\frac{1}{s^{2}}[I-B(B^{T}B)^{-1}B^{T}]\tilde{g}^{B}_{lora}(AA^{T% })^{-1}-BXover~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_I - italic_B ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_r italic_a end_POSTSUBSCRIPT ( italic_A italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B italic_X
20:     A1γλA𝐴1𝛾𝜆𝐴A\leftarrow\sqrt{1-\gamma\lambda}Aitalic_A ← square-root start_ARG 1 - italic_γ italic_λ end_ARG italic_A contains-as-subgroup\rhd Weight Decay
21:     B1γλB𝐵1𝛾𝜆𝐵B\leftarrow\sqrt{1-\gamma\lambda}Bitalic_B ← square-root start_ARG 1 - italic_γ italic_λ end_ARG italic_B
22:     W0(1γλ)W0subscript𝑊01𝛾𝜆subscript𝑊0W_{0}\leftarrow(1-\gamma\lambda)W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← ( 1 - italic_γ italic_λ ) italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
23:     AtAt1γg~Asubscript𝐴𝑡subscript𝐴𝑡1𝛾superscript~𝑔𝐴A_{t}\leftarrow A_{t-1}-\gamma\tilde{g}^{A}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_γ over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT
24:     BtBt1γg~Bsubscript𝐵𝑡subscript𝐵𝑡1𝛾superscript~𝑔𝐵B_{t}\leftarrow B_{t-1}-\gamma\tilde{g}^{B}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_γ over~ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT
25:  until stopping criterion is met
26:  return  optimized parameters Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Btsubscript𝐵𝑡B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT