LLaMA beyond English // developing non-English LLMs

4 min readJan 4, 2024

In recent times, substantial advancements have been witnessed in large language models (LLMs), exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model’s level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model’s response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.

Here is a summary of the key points from the paper:

- The paper investigates how to effectively transfer language generation and instruction-following capabilities of large language models (LLMs) like LLaMA to non-English languages.

- It analyzes the impact of vocabulary extension, further pretraining, and instruction tuning on transferability through comprehensive experiments.

- Vocabulary extension is found to be unnecessary for small-scale incremental pretraining. Further pretraining with only 0.5 billion tokens on original vocabulary outperforms extended vocabulary model pretrained on over 30 billion tokens.

- Further pretraining scales below 100 billion tokens are insufficient to significantly improve LLaMA’s knowledge level, but enhancing response quality requires only hundreds of thousands of instruction tuning data.

- Exclusive reliance on target language corpora for transfer compromises original English capabilities. Multilingual joint training alleviates this concern.

- Experiments on 13 low-resource languages exhibit similar trends, with comparable performance to state-of-the-art transfer models achieved using less than 1% of their training data.

- The study provides helpful guidance for developing non-English LLMs efficiently, analyzing necessity of techniques like vocabulary extension and optimal training scales.

In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. Specifically, we conducts a comprehensive empirical study to analyze the necessity of vocabulary extension and the required training scale for effective transfer. We find that vocabulary extension is uncessary and that comparable transfer performance to stateof-the-art models can be achieved with less than 1% of the further pretraining data. Additionally, we observe instances of code-switching during the transfer training, suggesting that cross-lingual alignment might have been internalized within the model. Similar results are observed from the extension experiments on the 13 low-resource languages. Our analysis and findings offer assistance and guidance to the community in developing non-English LLMs.

https://www.linkedin.com/posts/mohdamir_google-colaboratory-activity-7148578388511809536-Lef4

Finetuning LLMs with UnSloth.ai. Following are few notebooks which you can use for #RoPE scaling, #finetuning or #quantization of #LLM
1. Zephyr DPO [ Colab](https://lnkd.in/dWZ95rnF)
2. Mistral 7b [ Colab](https://lnkd.in/dYg4xx7r)
3. Llama 7b [ Colab](https://lnkd.in/dz4MWF6a)
4. CodeLlama 34b [A100 on Colab](https://lnkd.in/dfVqdEcy)
5. Llama 7b [ Kaggle](https://lnkd.in/d7DDKbmK)
6. TinyLLama [Colab] (https://lnkd.in/dKqRA-TS)

LLaMA beyond English // developing non-English LLMs

Written by sbagency

No responses yet