DeepMind: 训练LLM的scale law

type

status

date

slug

summary

1 main idea

作者发现目前的公开的LLM大多under-trained。为了探究计算资源（FLOPs）、模型大小（model size）、训练数据规模（training tokens）的关系。作者通过400个不同大小语言模型在5B-50B数据训练不同的时长，来探究LLM的scale law。实验发现，model size和training tokens应当scale equally，如：当model size扩大一倍，training tokens也应当扩大一倍。作者根据这个scale law训练chinchilla，在多个任务上实现SOTA。

之前训练大模型的scale law主要参考OpenAI的《Scaling laws for neural language models》
简单比较一下二者的观点。当计算资源增加10倍时
OpenAI建议：model size扩大5.5倍，training token扩大1.8倍
DeepMind建议：model size扩大3.09倍，training token扩大3.32倍

2 方法

2.1 问题建模

作者的研究问题建模如下：给定FLOPs，找到最优的模型参数与训练token数组合，使得最后的训练误差最小。

: 表示模型的参数

: 训练的token数量

: 最后的训练loss

: 计算资源

实验变量取值如下

变量	取值范围
model size	70M - 16B (400+ language model)
token数量	5B - 500B
FLOPs	6e18，1e19， 3e19，6e19，1e20，3e20，6e20，1e21，3e21

作者用平滑的训练误差作为评估指标。因为在作者的实验中训练的token数少于实际的语料数，此时平滑的训练误差是测试误差的无偏估计原文： For simplicity, we perform our analysis on the smoothed training loss which is an unbiased estimate of the test loss, as we are in the infinite data regime (the number of training tokens is less than the number of tokens in the entire corpus).