Google has developed and also benchmarked Switch Transformers, this is a technique to train language models, with over a trillion parameters. The research team stated the 1.6 trillion parameter model is the largest of its kind and has better speed than the existing T5-XX, also hailing from Google the model that had previously held the title.
As per the researchers, The Mixture of Experts (MoE) models, being more effective than other deep learning models they do face issues due to their complexity and also tend to lack accessibility and computational cost. Opposing the traditional parameters for all the inputs, MoE does select a different parameter for every input. The users may only get a sparsely activated model wit MoE which leads to the creation of a massive number of parameters which leads to disadvantages discussed above.
Google researchers have developed Switch Transformers to create a system that would increase the parameter count while maintaining the floating-point operations (FLOPS) per input constant. It does that by using only a part of the model’s weight or parameters to input data within a model.
The Switch Transformer is based on T5-Base and T5-Large models. In the T5 model (introduced by Google in 2019), all the NLP tasks are unified into a text-to-text format where both the input and output are always text strings.
In addition to the T5 models, Switch Transformers use hardware initially designed for dense matrix multiplication, and also used in language models such as GPUs and TPUs.
The researchers established a distributed training setup for the experiment, and the models split unique weights into different devices. While the weights increase in proportion to the number of devices, the memory and the computational footprint of each device remains manageable.
Switch Transformer models, using 32 TPUs, were pre-trained on the Colossal Clean Crawled Corpus — a 750 GB dataset composed of text snippets from Reddit, Wikipedia, among others. For the experiment, the Switch Transformer models were used to predict missing words in passages where 15% of the words were masked. Other challenges included language translation and answering a series of tough questions.
Overall this new model of Google seems extremely promising for the future of technology.