
This month, Google kicked out a top AI ethics researcher after she expressed frustration with the company for making her withdraw a research article. The article pointed out the risks of language processing artificial intelligence, the type used in Google Search and other text analysis products.
Among the risks is the large carbon footprint of the development of this type of AI technology. According to some estimates, training an AI model generates as many carbon emissions as is necessary to build and drive five cars over their lifetime.
I am a researcher who studies and develops AI models and I am very familiar with the dizzying energy and financial costs of AI research. Why have AI models become so power hungry and how are they different from traditional data center computing?
Today’s training is inefficient
Traditional data processing jobs in data centers include streaming video, email and social media. AI is more computer-intensive because it needs to read a lot of data until it learns to understand it, that is, to be trained.
This training is very inefficient compared to how people learn. Modern AI uses artificial neural networks, which are mathematical calculations that mimic neurons in the human brain. The connection strength of each neuron to its neighbor is a network parameter called weight. To learn to understand the language, the network starts with random weights and adjusts them until the output agrees with the correct answer.
A common way to train a language network is to feed it many texts from sites like Wikipedia and news outlets with some of the masked words and asking you to guess the masked words. An example is “my dog is cute”, with the word “cute” masked. Initially, the model misinterprets them all, but after many rounds of adjustment, the connection weights start to change and pick up patterns in the data. The network eventually becomes accurate.
A recent model, called Bidirectional Encoder Representations from Transformers (BERT), used 3.3 billion words from English books and Wikipedia articles. In addition, during training, BERT read this data set not once, but 40 times. For comparison, an average child who learns to speak can hear 45 million words at age five, 3,000 times less than BERT.
Looking for the right structure
What makes language models even more expensive to build is that this training process often happens during the development course. This is because researchers want to find the best structure for the network – how many neurons, how many connections between neurons, the speed at which parameters must change during learning, and so on. The more combinations they try, the greater the chances that the network will achieve high accuracy. Human brains, by contrast, need not find an optimal structure – they come with a pre-built structure that has been enhanced by evolution.
As companies and academics compete in the AI space, there is pressure to improve the state of the art. Even achieving a 1 percent improvement in accuracy on difficult tasks like machine translation is considered significant and leads to good publicity and better products. But, to achieve this 1% improvement, a researcher can train the model thousands of times, each time with a different structure, until the best one is found.
Researchers at the University of Massachusetts Amherst estimated the energy cost of developing AI language models by measuring the power consumption of common hardware used during training. They found that BERT training once had the carbon footprint of a passenger on a round trip between New York and San Francisco. However, when researching using different structures – that is, training the algorithm several times on the data with slightly different numbers of neurons, connections and other parameters – the cost became the equivalent of 315 passengers, or an entire 747 jet.
Bigger and hotter
AI models are also much larger than they need to be and are growing each year. A more recent language model similar to BERT, called GPT-2, has 1.5 billion pesos in its network. The GPT-3, which created a stir this year because of its high accuracy, is 175 billion pesos.
The researchers found that having larger networks leads to better accuracy, even if only a small fraction of the network turns out to be useful. Something similar happens in children’s brains when neuronal connections are added and then reduced, but the biological brain is much more energy efficient than computers.
AI models are trained on specialized hardware, such as graphics processor units, which consume more power than traditional CPUs. If you have a gaming laptop, it probably has one of these graphics processor units to create advanced graphics for, say, playing Minecraft RTX. You may also notice that they generate a lot more heat than normal laptops.
All of this means that the development of advanced AI models adds up to a large carbon footprint. Unless we move to 100% renewable energy sources, AI’s progress may be at odds with the goals of reducing greenhouse gas emissions and slowing climate change. The financial cost of development is also becoming so high that only a few selected labs can afford it, and it will be they who will set the agenda for the types of AI models that will be developed.
Doing more with less
What does this mean for the future of AI research? Things may not be as bleak as they seem. The cost of training can decrease as more efficient training methods are devised. Likewise, while data center energy usage was predicted to explode in recent years, this was not due to improvements in data center efficiency, more efficient hardware and cooling.
There is also a trade-off between the cost of training the models and the cost of using them, so spending more energy on training to create a smaller model can actually make them cheaper. As a model will be used many times during its lifetime, this can result in a huge energy saving.
In my lab research, we have been looking for ways to make AI models smaller by sharing weights or using the same weights in various parts of the network. We call these networks shapeshifters because a small set of weights can be reconfigured into a larger network of any shape or structure. Other researchers have shown that weight sharing performs better in the same amount of training time.
Looking to the future, the AI community must invest more in developing energy-efficient training schemes. Otherwise, there is a risk that AI will be dominated by a select group that can set the agenda, including what types of models are developed, what types of data are used to train them and what the models are used for .
This story originally appeared in The Conversation.