We show that Maximal Update Parametrization (Mup) itself provides a model sequence that fits a modified scaling law and enables accurate loss prediction.
Mu-scaling paper: https://arxiv.org/abs/2304.06875
This implementation is based on Huggingface and MuTransformers, with modifications to improve stability and support Deepspeed.
You can use conda or other tools to manage your python environment. To make things easy, we recommend conda.
conda create -n mu_scaling python=3.8
conda activate mu_scaling
pip install -r requirements.txt
If you are in China, you can use pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
instead of pip install -r requirements.txt
to improve installation speed.
Preprocess datasets for causal language model following Huggingface instructions. We also provide an example of processed data in res/final_data/test.
sh run_grid_search_pair_wise_mup.sh
If Mup works correctly, loss basins for different widths should be aligned.
python visualize_lr_landscape.py
Record the training loss with the same data on the same step, then run
python fit_scale_loss_prediction.py
If you would like to run on evaluation data, we suggest training all the models for more steps, and then
sh run_eval_ppl_loss_pred.sh
If this project helps you, please star and cite us, thanks!
@article{DBLP:journals/corr/abs-2304-06875,
author = {Yiqun Yao and Yequan Wang},
title = {Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales},
journal = {CoRR},
volume = {abs/2304.06875},
year = {2023}
}