The Physics of AI: Compute, Power, and Reality | Imperym Labs Blog | Imperym Labs

From GPUs to gigawatts, how intelligence became an energy problem.

We all know that frontier AI models like GPT, Grok, Claude, and Gemini run inside massive data centers around the world. And they all require one thing in common.

Power. A lot of power.

So the real question is not how smart the model is. The real question is: what does it actually cost to train a large language model in terms of electricity?

Most companies do not publish exact energy numbers. So the best approach is to estimate using a model we roughly understand. The most famous example is OpenAI's GPT-4.

What does it take to train one giant LLM

GPT-4 is estimated to be roughly a 1.7 trillion parameter model. It is believed to have been trained on around 13 trillion tokens of data and required something close to 20 septillion floating point operations. That number is enormous — it simply means a ridiculous amount of math.

So how do you perform that much computation? The answer is GPUs. A lot of GPUs.

Estimates suggest OpenAI used up to 25,000 Nvidia A100 GPUs and training took roughly three months. Each A100 GPU can consume around 400 watts of power, and when you scale that across tens of thousands of GPUs, energy consumption rises extremely quickly.

25,000 GPUs × 400W = 10 MW

But stacking GPUs together is not enough. The real challenge is parallelization — ensuring all GPUs work efficiently instead of sitting idle.

What GPUs are really doing in LLM training

The core workload during training is matrix multiplication. For example, multiplying matrices of size 100,000 by 100,000 requires around 2.5 quadrillion floating point operations.

Running this on a single GPU would take an extremely long time, so the work must be distributed across many GPUs.

However, simply grouping 25,000 GPUs together in one cluster does not work efficiently because networking and communication quickly become the bottleneck.

To solve this, the industry relies on several parallelization techniques.

Tensor parallelization

Modern Nvidia GPUs are typically grouped in sets of eight. For example, eight A100 GPUs can be placed in a single system called an Nvidia HGX server.

You can think of an HGX system as a box containing eight GPUs working together.

An HGX A100 server can consume between three and six kilowatts of power. Newer systems such as HGX H100 or Blackwell-based HGX B200 systems can draw even more than 10 kW, but they are significantly faster and more efficient.

Within an HGX server, tensor operations can be parallelized across the eight GPUs using a high-speed interconnect called NVLink.

NVLink allows GPUs to share data extremely quickly, making distributed matrix operations possible.

Pipeline parallelization and the interconnect problem

When multiple HGX servers are connected together, communication between servers becomes slower than communication within a server. This creates an interconnect bottleneck.

To overcome this limitation, large models use pipeline parallelization.

Instead of splitting individual tensor operations, we divide the model architecture itself. GPT-4 is believed to contain around 120 neural network layers.

These layers can be split into pipelines. For example, the model could be divided into 15 pipelines, each running on a server with 8 GPUs.

15 pipelines × 8 GPUs = 120 GPUs

That means one full model training instance could run on around 120 GPUs.

Data parallelization

Once the model architecture is parallelized, the next step is data parallelism.

Multiple copies of the model are trained simultaneously, each processing different batches of training data. After each step, gradients are synchronized across all replicas.

25,000 GPUs ÷ 120 = ~208 parallel replicas

This is how companies fully utilize massive GPU clusters.

GPT-4 training energy demand

25,000 GPUs divided into groups of eight per HGX server results in roughly 3,125 servers.

E = 6.5 kW × 3,125 servers × 90 days × 24 hrs/day

E = 6.5 × 3,125 × 2,160 hours ≈ 43.9 GWh

That is roughly the same electricity consumption as a small city of about 50,000 people for an entire month.

Deployed model energy demand is even larger

Training happens once. Deployment happens continuously.

Once a model is deployed and millions of users interact with it daily, inference becomes the real energy challenge.

Estimates suggest OpenAI processes more than 2.5 billion prompts per day, with each query consuming roughly 0.3 watt-hours.

2.5 × 10⁹ prompts × 0.3 Wh = 750 × 10⁶ Wh

= 750 MWh per day

Over a 90-day period this equals about 67.5 gigawatt-hours — exceeding the energy used to train the model itself.

Training (90 days) = 44 GWh

Serving (90 days) = 750 MWh/day × 90 days = 67,500 MWh = 67.5 GWh

Cooling and PUE

Data centers also require cooling systems to prevent hardware overheating.

Cooling overhead is measured using Power Usage Effectiveness, or PUE.

PUE = Total Facility Power / IT Equipment Power

If PUE is 1.2, then every watt of computing power actually requires 1.2 watts of total facility power.

62.5 GWh × 1.2 = 81 GWh (with cooling)

Future energy demand

In 2023, data centers consumed roughly 176 terawatt-hours of electricity in the United States, representing around 4.4 percent of total electricity usage.

By 2030 this number could reach between 8 and 10 percent, with artificial intelligence being a major driver.

Why AI companies are literally building power plants

Major AI companies are now investing directly in energy infrastructure.

xAI purchased methane turbines to power its Colossus facility in Memphis. OpenAI is exploring massive Stargate facilities capable of delivering multiple gigawatts of power. Meta and Google are expanding their own power generation and hyperscale data centers.

China energy supply situation

China's energy infrastructure expansion is coordinated at a national level.

By 2023 China installed over 609 gigawatts of solar power and 441 gigawatts of wind energy, with dozens of nuclear reactors under construction.

This centralized planning approach allows energy capacity to scale rapidly alongside computing infrastructure.

The bigger picture

Energy is becoming one of the most critical constraints in the development of artificial intelligence.

New hardware like Nvidia GB200 chips, techniques such as mixture-of-experts architectures, quantization, and speculative decoding can reduce energy costs.

But ultimately, scaling AI requires scaling energy generation, power grids, and cooling technologies.

The countries and companies that can expand both compute infrastructure and energy capacity together will likely gain the biggest advantage in the global AI race.

Artificial intelligence may be powered by algorithms, but its real fuel is electricity.