技术与研究
Improving Word Embedding Models
Improving Word Embedding Models
This work presents an approach to improve text embedding models through contrastive fine-tuning on small datasets augmented with expert scores. It focuses on enhancing semantic textual similarity tasks and addressing text retrieval problems. The proposed method uses soft labels derived from expert-augmented scores to fine-tune embedding models, preserving their versatility and ensuring retrieval capability is improved. The paper evaluates the method using a Q&A dataset from an online shopping website and eight expert models. Results show improved performance over a benchmark model across multiple metrics on various retrieval tasks from the massive text embedding benchmark (MTEB). The method is cost-effective and practical for real-world applications, especially when labeled data is scarce. Table 1 and 2 present the evaluation of nDCG@10 and mAP@10 metrics, respectively, for different models across various datasets from MTEB retrieval tasks. The average nDCG@10 scores for Benchmark, Soft-1, Soft-2, and Hard label models are 39.675, 40.633, 40.334, and 37.574, respectively, with standard deviations of 29.963, 28.552, 28.167, and 27.081, respectively. And the average mAP@10 for Benchmark, Soft-1, Soft-2, and Hard label models are 34.419, 35.323, 35.04, and 32.243, respectively, with standard deviations of 29.693, 28.587, 28.221, and 26.585, respectively. The win rate of Soft-1 over the Benchmark is 50.37% in terms of nDCG@10, and is 55.38% with respect to mAP@10. This again confirms that no single text embedding method dominates across all tasks (Muennighoff et al., 2022). The Soft-1 and Soft-2 models demonstrate promising results with higher scores and smaller standard deviations compared to the Benchmark model, suggesting they perform well across various datasets and their performance is consistently stable. The Hard-label model, on the other hand, has worse nDCG@10 and mAP@10 scores compared to the Benchmark; although it has a smaller standard deviation. The improvement seen in the fine-tuning with Soft-1 and Soft-2 labels might be attributed to the reduced anisotropy in the fine-tuned models (meaning the text embeddings occupy a larger cone in the vector space after fine-tuning). This property is further supported by the results on the held-out set: the Soft-1 and Soft-2 models have better results in terms of area under precision-recall (PR) curve (see Section 4.3). The text embeddings of irrelevant pairs are then distributed across a wider range of the vector space. Pdf: https://arxiv.org/pdf/2408.11868
Large Language Model Compression
Large Language Model Compression
In this work, we tackle the critical challenge of compressing large language models (LLMs) to facilitate their practical deployment and broader adoption. We introduce a novel post-training compression paradigm that focuses on the low-rank decomposition of LLM weights. Our analysis identifies two main challenges in this task: the variability in LLM activation distributions and handling unseen activations from different datasets and models. To address these challenges, we propose a nested activation-aware framework (NSVD) for LLMs, a training-free approach designed to enhance the accuracy of low-rank decompositions by managing activation outliers by transforming the weight matrix based on activation distribution and the original weight matrix. This method allows for the absorption of outliers into the transformed weight matrix, improving decomposition accuracy. Our comprehensive evaluation across eight datasets and six models from three distinct LLM families demonstrates the superiority of NSVD over current state-of-the-art methods, especially at medium to large compression ratios or in multilingual and multitask settings. First, we evaluate the performance of LLaMA-7B compressed using NSVD (here, $k_1=0.95k$) and baselines under compression ratios ranging from 10% to 50% across all eight datasets. Our results include comparisons with ASVD-II, NSVD-I, and NSVD-II; since no improvements were observed using the proposed ASVD-III method, its results are omitted for brevity. Table 1 summarizes these findings. We observe that ASVD-I and ASVD-II yield equivalent performance when ignoring numerical errors. Similarly, NSVD-I and NSVD-II also produce comparable outcomes. NSVD-I or NSVD-II consistently outperforms standard SVD, ASVD-0, and ASVD-I across all the compression ratios. More importantly, NSVD exhibits significant advantages over baselines under medium to high compression ratios. Specifically, at a 30% compression ratio, compared to the best-performing baseline, NSVD-I reduces perplexity on PTB, C4, SNIPS, AlpacaEval, MCTest, CMRC (CN), and AlpacaEval (JP) by 7.1%, 5.4%, 12.1%, 6.3%, 1.3%, 16.1%, and 54.8%, respectively; when the compression ratio reaches 40%, NSVD can reduce perplexity by more than 60%. Pdf: https://arxiv.org/pdf/2503.17101