# Parameter-efficient fine-tuning (LoRA, QLoRA, QDoRA)

## Parameter-efficient fine-tuning (LoRA, QLoRA, QDoRA)

Use this when you need strong results without huge infra.

### LoRA vs QLoRA vs QDoRA

LoRA fundamentally addresses the challenge of fine-tuning large language models by decomposing weight updates into low-rank matrices. The core insight is that the change in weights during fine-tuning, while appearing to require updating billions of parameters, actually lies in a much lower dimensional subspace. Instead of updating the full weight matrix W in a layer, LoRA keeps W frozen and adds a trainable low-rank decomposition expressed as BA, where B and A are much smaller matrices. The rank r of these matrices controls the capacity of the adaptation, typically ranging from 4 to 64, which is orders of magnitude smaller than the original matrix dimensions.

The mathematical elegance comes from the fact that during inference, the low-rank updates BA can be absorbed directly into the original weights by computing W' = W + BA, eliminating any inference overhead. During training, only B and A are updated through backpropagation, dramatically reducing the number of trainable parameters and thus memory requirements for optimizer states. This makes fine-tuning a 7B parameter model feasible on GPUs with 16-24GB of memory, whereas full fine-tuning would require substantially more.

QLoRA extends LoRA by adding aggressive quantization of the base model. The frozen pretrained weights get quantized to 4-bit precision using NormalFloat4, a specialized data type designed for the normal distribution of neural network weights. This quantization achieves approximately 8x memory compression compared to 16-bit precision, enabling fine-tuning of much larger models on consumer hardware. The critical innovation is that while the base model stores in 4-bit, the LoRA adapters remain in bfloat16 precision, and during forward and backward passes, the base weights temporarily dequantize to bfloat16 for computation.

QLoRA introduces additional memory optimizations including double quantization, where even the quantization constants themselves get quantized, and paged optimizers that offload optimizer states to CPU memory when GPU memory pressure becomes high. These techniques combined enable fine-tuning of models like Llama-2 70B on a single 48GB GPU, which would be completely impossible with standard fine-tuning or even LoRA alone.

QDoRA represents the latest evolution, combining quantization with DoRA, which itself improves on LoRA by decomposing weight updates into magnitude and direction components. The key insight from DoRA is that full fine-tuning adjusts both the magnitude and direction of weight vectors, but standard LoRA primarily adjusts direction while limiting magnitude changes. By explicitly separating these components and allowing both to be learned, DoRA achieves performance closer to full fine-tuning while maintaining parameter efficiency.

QDoRA quantizes the base model to 4-bit like QLoRA while implementing the DoRA decomposition for the adapters. This combination provides the memory efficiency of QLoRA with the performance benefits of DoRA, often matching or exceeding full fine-tuning accuracy while using dramatically less memory. The practical difference is that QDoRA requires careful initialization and slightly more complex training code compared to QLoRA, but the performance improvements often justify this complexity.

When choosing between these methods, I start with LoRA when I have sufficient GPU memory for the base model at bfloat16 precision and the task requires strong adaptation capability. I reach for QLoRA when memory constraints force more aggressive compression, typically when fine-tuning models larger than 13B on consumer GPUs or when I want to fine-tune very large models within limited hardware budgets. I choose QDoRA when absolute performance matters most and I am willing to accept slightly longer training times and more implementation complexity in exchange for results closer to full fine-tuning.

### Picking the right LoRA rank

Determining optimal rank requires systematic experimentation because the right value depends on task complexity, model architecture, and dataset characteristics. I begin with the understanding that rank controls the expressiveness of the adaptation. Too low a rank underfits the task, leaving the model unable to learn necessary adaptations. Too high a rank overfits, learning task-specific quirks rather than generalizable patterns, while also increasing training cost.

My experimental methodology starts with a rank sweep across exponentially spaced values, typically testing ranks of 4, 8, 16, 32, 64, and sometimes 128. I run training with these different ranks while keeping all other hyperparameters constant, including learning rate, batch size, and training steps. This controlled comparison isolates the effect of rank on model performance.

The validation loss curve provides the primary signal for rank selection. I plot validation loss throughout training for each rank value. Lower ranks often show faster initial learning but plateau at higher validation loss, indicating insufficient capacity. Higher ranks may learn more slowly but achieve lower final validation loss, though sometimes higher ranks overfit, showing good training loss but degrading validation loss. The optimal rank achieves the lowest validation loss without signs of overfitting.

Task-specific evaluation metrics provide critical validation beyond just loss curves. For text generation tasks, I examine sample outputs at different ranks to assess quality qualitatively. For classification tasks, I look at accuracy and F1 scores. For retrieval or ranking tasks, I measure metrics like MRR or NDCG. Sometimes a lower rank achieves nearly equivalent performance on these downstream metrics despite slightly higher validation loss, suggesting the more efficient configuration is preferable.

Convergence speed matters for practical deployment. I measure wall-clock time to achieve target performance levels for different rank values. Higher ranks require more computation per training step, so sometimes a lower rank that converges in fewer steps ends up faster overall despite plateauing at slightly worse final performance. This time-performance tradeoff informs rank selection when training speed matters.

Model size considerations influence rank selection. Different components of transformer models have different sensitivities to rank. The queries, keys, values, and output projections in attention layers often benefit most from LoRA adaptation. I sometimes use different ranks for different layer types, applying higher rank to attention projections and lower rank to feed-forward layers. This hybrid approach optimizes parameter efficiency while maintaining adaptation capacity where it matters most.

Dataset size interacts with optimal rank selection. Smaller datasets, perhaps 1000-10000 examples, typically require lower ranks to avoid overfitting. Larger datasets with hundreds of thousands of examples can leverage higher ranks to capture nuanced patterns. I adjust my initial rank sweep based on dataset size, starting with lower ranks for smaller datasets and higher ranks for larger ones.

Computational budget constraints often determine practical rank limits. Training with rank 64 takes approximately twice as long and uses 1.5-2x more memory than rank 32. When fine-tuning many models or working within tight time constraints, I might accept slightly worse performance from a lower rank in exchange for faster iteration. The optimal rank in production is not always the rank with absolute best performance but rather the rank with best performance per unit of computational cost.

### Fine-tuning a 7B model on one A10G

Fine-tuning a 7B parameter model on a single A10G with 24GB of memory requires careful optimization at every level of the stack. The first challenge is simply loading the model into memory. A 7B parameter model in bfloat16 precision requires approximately 14GB just for the weights. Add optimizer states for Adam, and memory requirements explode to 42GB or more, far exceeding available capacity.

QLoRA provides the foundation for fitting within memory constraints. I quantize the base model to 4-bit using NormalFloat4, reducing weight storage from 14GB to roughly 3.5GB. This dramatic compression leaves enough room for LoRA adapters, optimizer states for just the adapter parameters, gradients, and activation memory during training. The key is that only the small adapter matrices train, so optimizer states stay minimal even though the full model is quite large.

Gradient checkpointing trades computation for memory. The attention mechanism generates large activation tensors that must be stored during the forward pass for use in backpropagation. Gradient checkpointing discards these activations and recomputes them when needed during the backward pass. This typically increases training time by 20-30% but can reduce memory usage by 30-50%, making it essential for fitting large models in limited memory.

Batch size must balance memory constraints against training stability. Larger batches improve gradient estimates and training stability but consume more memory. With limited GPU memory, I often train with micro-batch sizes of 1-2 per GPU, using gradient accumulation to achieve larger effective batch sizes. This means processing multiple micro-batches sequentially before updating weights, which provides equivalent gradient estimates to larger batches while fitting in available memory.

Selecting which layers to apply LoRA adapters affects both memory usage and model quality. Applying adapters to all linear layers in the model maximizes adaptation capacity but also maximizes memory consumption. I typically start by applying adapters only to query, key, value, and output projections in attention layers, as these often prove most important for fine-tuning. If memory permits, I expand to include feed-forward layers as well.

Mixed precision training using bfloat16 or float16 reduces memory requirements for activations and gradients while maintaining numerical stability. Modern GPUs like A10G have dedicated tensor cores that accelerate mixed precision computation, making this optimization essentially free. I enable automatic mixed precision in PyTorch using torch.cuda.amp, which handles precision conversion automatically while maintaining training stability.

Efficient data loading prevents GPU underutilization. With limited batch sizes, the GPU can process data faster than it arrives if the data pipeline is not optimized. I use multiple data loader workers, prefetch data to GPU memory while training, and ensure datasets are stored in efficient formats that minimize parsing overhead. These optimizations keep the GPU fully utilized, maximizing training throughput despite memory constraints.

Framework choice and optimization matter. I use Hugging Face Transformers with bitsandbytes for quantization and PEFT for LoRA implementation. These libraries provide highly optimized implementations that handle the intricate details of 4-bit quantization and efficient adapter weight updates. Attempting to implement these optimizations from scratch would likely result in slower training and higher memory usage.

### 4-bit quantization trade-offs and benchmarking

Four-bit quantization provides dramatic memory compression at the cost of numerical precision. Each weight parameter that requires 16 bits in bfloat16 requires only 4 bits in NF4, achieving 4x compression. For a 7B parameter model, this translates from 14GB to 3.5GB of weight storage, a reduction that transforms infeasible training into practical training on consumer hardware.

The memory trade-offs extend beyond just weight storage. Quantized weights require dequantization during the forward pass, which means maintaining both the compressed 4-bit representation and temporary bfloat16 activations during computation. However, these activations are transient and only exist during the forward and backward passes for the current layer, so they do not accumulate as they would if keeping the full model in bfloat16.

Quantization constants represent another memory consideration. Each block of quantized weights requires scaling factors and zero points to enable dequantization. NF4 uses sophisticated normalization and binning strategies that minimize the overhead of these constants, but they still consume some memory. Double quantization compresses even these constants, recovering additional memory at minimal performance cost.

Performance degradation from quantization manifests in several ways. The quantization error introduces noise into the weight values, which theoretically degrades model capabilities. However, empirical results consistently show that the frozen quantized base model combined with full-precision LoRA adapters achieves performance nearly equivalent to full-precision fine-tuning. The adapters apparently compensate for quantization noise during training.

I benchmark quantization impact through direct comparison. I train models using three configurations: full fine-tuning in bfloat16, LoRA with bfloat16 base model, and QLoRA with 4-bit base model. Each uses the same dataset, hyperparameters, and training steps. I measure both validation loss and task-specific metrics across all three configurations. Typically, I observe that QLoRA achieves 98-99% of full fine-tuning performance, with the small degradation often falling within the noise of random initialization.

Inference latency provides another performance dimension affected by quantization. Dequantizing weights during inference adds computational overhead. However, for LoRA-based approaches, the adapters can be merged with the base weights after training, then the full merged model can be quantized back to 4-bit for efficient inference. This merging eliminates any adapter overhead while maintaining quantization benefits.

Task difficulty interacts with quantization tolerance. Simple tasks like sentiment classification often show no measurable degradation from 4-bit quantization, as these tasks do not stress the model's full representational capacity. Complex tasks like long-form text generation or multi-step reasoning sometimes show slightly larger performance gaps from quantization, though still typically small. I adjust my expectations for acceptable performance loss based on task requirements.

Different quantization schemes provide different performance-memory tradeoffs. NF4 assumes normally distributed weights and allocates quantization bins accordingly, which works well for most neural network weights. Alternative schemes like integer quantization or logarithmic quantization might provide different characteristics. I have experimented with various quantization approaches and consistently find NF4 provides the best balance for language model fine-tuning.

### When QDoRA beats QLoRA

I encountered QDoRA's advantages while fine-tuning a 7B parameter model for domain-specific reasoning in quantum physics applications. The task required understanding complex mathematical relationships and generating detailed step-by-step solutions. This represents the type of task where model capacity and subtle weight adjustments matter significantly.

Using QLoRA with rank 32, I achieved reasonable performance but noticed the model struggled with multi-step reasoning chains. The validation loss plateaued and generated solutions sometimes skipped intermediate steps or introduced logical inconsistencies. Increasing LoRA rank to 64 improved results somewhat but still left a noticeable gap compared to the few full fine-tuning experiments I could run on larger infrastructure.

Switching to QDoRA with the same rank 32 configuration produced measurably better results. The validation loss decreased further and generated solutions showed more consistent multi-step reasoning. Quantitatively, accuracy on held-out problems improved from 67% with QLoRA to 74% with QDoRA, a substantial jump considering I changed only the fine-tuning method without touching hyperparameters.

The task characteristics that favored QDoRA involved substantial adaptation from the base model's pretrained knowledge. Quantum physics reasoning requires the model to apply domain-specific transformations that differ significantly from the general text generation patterns learned during pretraining. QDoRA's ability to adjust both magnitude and direction of weight updates apparently enabled more effective adaptation to these domain-specific patterns.

Analysis of the learned adapters revealed interesting differences. QDoRA adapters showed larger magnitude adjustments in specific layers, particularly in middle layers of the transformer that perform abstract reasoning. QLoRA adapters adjusted direction but maintained smaller magnitude changes, potentially limiting their capacity to significantly alter model behavior for this challenging domain adaptation task.

Training dynamics also differed. QDoRA training showed slower initial loss decrease but steadier long-term improvement, suggesting more stable learning of the complex task requirements. QLoRA training showed faster initial progress that plateaued earlier, consistent with learning easier surface patterns without capturing deeper task requirements.

The computational overhead of QDoRA was noticeable but acceptable. Training took approximately 20% longer than QLoRA due to the additional complexity of magnitude-direction decomposition. However, this overhead proved worthwhile given the substantial performance improvements. For less demanding tasks where QLoRA already achieves strong results, this overhead might not be justified.

This experience taught me that the choice between QLoRA and QDoRA depends significantly on task characteristics. For tasks requiring moderate adaptation where base model knowledge transfers well, QLoRA suffices and trains faster. For tasks requiring substantial model behavior changes or complex reasoning capabilities, QDoRA's additional expressiveness justifies its computational cost.

### Avoid catastrophic forgetting

Catastrophic forgetting occurs when fine-tuning overwrites the pretrained knowledge encoded in model weights, causing performance degradation on tasks the model could previously handle. This becomes particularly problematic when fine-tuning on narrow task distributions that differ substantially from pretraining data. The model learns the new task but loses general capabilities.

Parameter-efficient methods like LoRA provide inherent protection against catastrophic forgetting. By keeping most model weights frozen and only training small adapter matrices, the original pretrained knowledge remains largely intact. The adapters add task-specific adjustments without overwriting general capabilities. This architectural constraint proves remarkably effective at preserving base model capabilities.

Mixing general data into the fine-tuning dataset explicitly reinforces pretrained knowledge. I construct training batches that include both task-specific examples and samples from general domain data similar to pretraining. The ratio varies by task, but typically including 10-30% general data maintains broad capabilities while enabling effective task adaptation. This simple technique works well when appropriate general data is available.

Learning rate selection critically affects forgetting. Lower learning rates reduce the magnitude of weight updates, which limits how drastically fine-tuning can alter model behavior. I typically use learning rates 5-10x lower for fine-tuning compared to pretraining. Combined with parameter-efficient methods, conservative learning rates maintain stability while still enabling effective adaptation.

Early stopping based on general capability evaluation prevents overfitting to narrow task distributions. I maintain evaluation sets that test general language understanding, reasoning, and generation capabilities in addition to task-specific metrics. If general capabilities begin degrading while task performance improves, I stop training early, sacrificing some task performance to maintain broader utility.

Regularization techniques like weight decay help maintain weights close to their pretrained values. I apply relatively strong weight decay, typically 0.01-0.1, which penalizes large deviations from initial weights. This regularization encourages the model to find solutions that minimize changes to pretrained knowledge while still adapting to the new task.

Layer-wise learning rate decay recognizes that different layers serve different purposes. Early layers learn general features while later layers specialize. I apply smaller learning rates to early layers and larger rates to later layers, allowing task-specific adaptation primarily in the specialized layers while protecting general feature representations in early layers.

Curriculum learning strategies can reduce forgetting by gradually increasing task difficulty. Starting with simpler examples from the target task mixed with general data, then progressively increasing difficulty, allows the model to adapt incrementally rather than being shocked by dramatically different data distribution. This gradual adaptation typically preserves more pretrained knowledge than abrupt distribution shift.

When catastrophic forgetting cannot be fully prevented through these techniques, I maintain multiple specialized adapters rather than a single fine-tuned model. Each adapter targets a specific task while sharing the same frozen base model. At inference time, the appropriate adapter loads based on the task, providing task-specific performance without the impossible goal of a single model that excels at everything.

### Hyperparameter tuning for LoRA

Hyperparameter tuning for LoRA fine-tuning involves several parameters that interact in complex ways. The key parameters include LoRA rank, alpha scaling factor, dropout rate, learning rate, and batch size. Rather than attempting to optimize all parameters simultaneously, I adopt a staged approach that tunes related parameters together while keeping others fixed.

I start with LoRA rank, as this fundamentally determines adaptation capacity. The rank sweep methodology I described earlier identifies a reasonable rank value for the task. This becomes the baseline around which other parameters are tuned. Starting with a fixed rank reduces the dimensionality of the search space substantially.

Alpha, the LoRA scaling factor, works in conjunction with learning rate. The effective learning rate for adapter weights is scaled by alpha/rank, so these parameters interact. I typically set alpha equal to rank as a starting point, which has emerged as a reasonable default from community experience. However, I experiment with alpha values ranging from 0.5x to 2x the rank, particularly when learning rate tuning suggests the effective learning rate needs adjustment.

Learning rate requires careful tuning as it affects both convergence speed and final performance. I conduct learning rate finding runs using techniques like the learning rate finder from fastai, which gradually increases learning rate while monitoring loss. This identifies both the maximum stable learning rate and the optimal learning rate for fastest convergence. I typically find that LoRA fine-tuning works well with learning rates in the range of 1e-4 to 5e-4, substantially lower than typical pretraining rates.

The learning rate schedule interacts with total training steps. I use cosine annealing schedules that decay learning rate from the initial value to near zero over the course of training. The warmup phase at the beginning stabilizes training by gradually increasing learning rate from zero to the target value. Warmup steps typically represent 5-10% of total training steps. This schedule pattern consistently outperforms constant learning rates.

LoRA dropout provides regularization for the adapter weights. I experiment with dropout values from 0 to 0.1, though I find that dropout benefits diminish with LoRA compared to full fine-tuning. The frozen base model already provides substantial regularization, so adding dropout to adapters often provides minimal additional benefit. I typically use dropout of 0 or 0.05 unless I observe clear overfitting signs.

Batch size affects both convergence characteristics and memory usage. Larger batches provide more stable gradients but require more memory. With gradient accumulation, I can simulate larger effective batch sizes on limited hardware. I tune effective batch size by testing values like 8, 16, 32, and 64, measuring both convergence speed and final performance. The optimal batch size often depends on dataset size, with smaller datasets favoring smaller batches.

Training duration requires balancing convergence against overfitting. I train until validation loss plateaus, typically monitoring for several evaluations after the best validation loss to confirm the model has truly converged. For most tasks, convergence happens within 3-5 epochs for large datasets or 10-20 epochs for smaller datasets. Going beyond convergence risks overfitting to the training set.

I document the interaction between hyperparameters carefully. Sometimes a parameter that appeared optimal at one rank becomes suboptimal at different ranks. Similarly, learning rate and batch size interact, with larger batches typically requiring larger learning rates. I maintain records of hyperparameter configurations and their resulting performance across multiple experiments, building intuition about these interactions over time.

For critical applications, I use grid search or Bayesian optimization to explore the hyperparameter space systematically. However, for most projects, manually guided tuning based on understanding parameter interactions and initial experimental results proves more efficient than exhaustive search. The key is developing intuition through experience about which parameters matter most for which types of tasks.

### Multi-adapter inference

Multi-adapter inference allows serving many specialized models using shared base weights, dramatically reducing memory requirements compared to loading completely separate models. The architecture keeps one copy of the frozen base model in memory while dynamically loading different adapter weights depending on the inference request. This approach scales much more efficiently than maintaining separate model instances.

The implementation begins with model loading. I load the base model once into GPU memory, typically using 4-bit quantization to minimize memory footprint. This base model remains resident in memory for the entire inference session. The adapter weights for different tasks are stored separately, either on disk or in CPU memory, ready to be loaded on demand.

Dynamic adapter switching happens at request time. Each inference request includes metadata indicating which adapter to use. The inference server checks whether the required adapter is currently loaded. If so, the request proceeds immediately. If not, the server loads the adapter weights from storage, which typically takes 50-200ms depending on adapter size. Once loaded, the adapter remains in memory for subsequent requests, implementing an LRU cache that maintains the most recently used adapters.

Batching requests with the same adapter improves throughput. Rather than switching adapters for every request, I buffer incoming requests and batch together requests targeting the same adapter. This amortizes the adapter loading cost across multiple requests and maximizes GPU utilization by processing larger batches. The batching window typically ranges from 10-50ms, balancing latency against batch size.

Memory management of the adapter cache requires careful attention. Each adapter consumes additional GPU memory, so only a limited number can be cached simultaneously. I implement an LRU eviction policy that removes least recently used adapters when memory pressure increases. The cache size tuning depends on adapter sizes, GPU memory capacity, and traffic patterns. Typical configurations cache 4-16 adapters simultaneously.

For high-traffic scenarios with predictable adapter usage patterns, I implement adapter preloading. Based on traffic analysis, certain adapters are kept loaded proactively, ensuring zero latency for adapter switching on common requests. Less frequently used adapters still load on demand, providing a hybrid approach that optimizes for common cases while supporting the long tail of specialized adapters.

The inference serving infrastructure uses a separate model serving framework that supports adapter management. I have built custom serving layers using FastAPI and PyTorch that handle adapter loading and caching logic, though frameworks like vLLM are beginning to support this pattern natively. The serving layer exposes a simple API where requests specify task type and the system handles adapter selection transparently.

Monitoring adapter cache performance provides insights for optimization. I track metrics including cache hit rates, adapter loading times, memory utilization, and per-adapter request rates. This data informs cache sizing decisions and identifies opportunities for preloading high-traffic adapters or pruning rarely used adapters.

The approach extends to serving adapters with different ranks or configurations. More important tasks might use higher-rank adapters for better performance, while less critical tasks use lower-rank adapters for efficiency. The serving system dynamically allocates resources based on adapter requirements, balancing quality against capacity.

For production deployments, I implement graceful degradation. If adapter loading fails or memory limits are exceeded, the system falls back to the base model without adapters rather than failing the request entirely. While quality degrades without task-specific adaptation, providing some response proves better than complete failure in most applications.

### Full fine-tuning vs PEFT

The decision between full fine-tuning and parameter-efficient methods starts with task characteristics analysis. Tasks requiring substantial deviation from the pretrained model's capabilities often benefit from full fine-tuning's higher capacity for adaptation. Tasks that mainly require applying existing knowledge in new contexts typically work well with parameter-efficient approaches.

Dataset size strongly influences the choice. Full fine-tuning with small datasets, perhaps under 10,000 examples, risks severe overfitting as the vast parameter space cannot be constrained effectively. Parameter-efficient methods provide inherent regularization through architectural constraints, making them safer for smaller datasets. Conversely, very large datasets with millions of examples can leverage full fine-tuning's capacity without overfitting.

Available computational resources constrain the practical options. Full fine-tuning a 7B parameter model requires expensive GPU infrastructure, often multiple high-end GPUs with substantial memory. Parameter-efficient methods enable fine-tuning on consumer hardware or cloud instances costing orders of magnitude less. When budget limits exist, the decision becomes straightforward regardless of other factors.

Domain shift magnitude between pretraining data and target task affects performance of different approaches. Small domain shifts, like adapting a general language model to a specific writing style, work excellently with parameter-efficient methods. Large domain shifts, like adapting a language model pretrained on natural language to generate code or structured data, might benefit from full fine-tuning's higher adaptation capacity.

I conduct empirical comparison when stakes are high and resources permit. I run small-scale experiments with both full fine-tuning and parameter-efficient methods, measuring validation performance after comparable training compute. If parameter-efficient methods achieve 95%+ of full fine-tuning performance, the cost and complexity savings justify choosing them. If a significant performance gap exists, full fine-tuning may be necessary despite its costs.

Iteration speed requirements matter in practice. Parameter-efficient methods train faster and require less infrastructure, enabling more rapid experimentation. For research projects where fast iteration is crucial, this advantage often outweighs modest performance improvements from full fine-tuning. Production systems with established requirements might prioritize absolute performance over iteration speed.

The need to maintain model generality influences the decision. Parameter-efficient methods naturally preserve most pretrained capabilities since most weights remain frozen. Full fine-tuning risks catastrophic forgetting, overwriting general capabilities during adaptation. When maintaining broad model capabilities alongside task-specific adaptation is important, parameter-efficient approaches provide significant advantages.

Deployment constraints affect the practical choice. Parameter-efficient methods produce lightweight adapters that can be swapped dynamically, enabling serving many specialized models efficiently. Full fine-tuning produces complete separate models, multiplying storage and serving costs when supporting multiple tasks. For multi-task deployments, parameter-efficient approaches often prove more practical.

Recent research increasingly shows parameter-efficient methods matching or exceeding full fine-tuning performance with proper hyperparameter tuning and adequate rank selection. This shifts the default choice toward parameter-efficient methods, with full fine-tuning reserved for cases where empirical evidence demonstrates clear benefits. The burden of proof has reversed from justifying parameter-efficient methods to justifying the costs of full fine-tuning.
