Profits in AI

4567 words, 28692 characters

Edited by Dan

AI is all the rage these days. A short prompt synthesizes whole essays, images, and even code. It honestly feels a bit like magic.

Where there’s magic, there’s also money. Startups and venture capitalists are spending incredible sums to develop the next best machine learning models. It’s a rising tide in AI - companies that touch AI see a big bump in their stock price and net valuation.

Yet, amid all the excitement, there’s been little discussion on where profits are in the AI industry. Or put another way, who makes the most money in the AI wave.

The AI Layers

Broadly speaking, there are five different layers in the AI industry.

Semiconductors

Cloud Platforms

Data Labeling

Research

Applications

Cloud Platforms win

Of these five layers, cloud platforms are best positioned to capture profits from the AI wave.

AWS alone generated $62.2 billion in 2021 with $18.5 billion in operating profit. To put that in perspective, Nvidia’s 2021 revenue was $16.7 billion.

Further, most cloud platforms are part of massive companies - AWS is a part of Amazon, Azure is part of Microsoft, and GCP is part of Google. With scale and capital like that, it’s easy to strongarm other layers in the AI industry.

Semiconductor companies are reliant on cloud platforms for business growth. Datacenter sales are the largest top-line revenue for Nvidia and can’t afford to alienate cloud platforms as they continue to displace traditional on-premise servers.

Data labeling companies have low system power within the AI wave - anyone can collect, clean, and label data (to a certain extent). Most of the data labeling companies also use cloud platforms to store and run operations.

Research companies are also increasingly reliant on cloud platforms for the compute necessary to train models. This presents an opportunity for cloud platforms to leverage their capital and invest or acquire AI startups.

Azure invested $1 billion into OpenAI, so Microsoft gets access to OpenAI products before anyone else. Google/GCP acquired Deepmind, the creators of the first computer program to defeat a professional human Go player.

Application companies primarily host their infrastructure on cloud platforms. Further, most application companies are beholden to the research layer for production models. As a result, they also possess relatively little system power.

Long-term, cloud platforms are simply too dominant, too big, and too essential to lose.

Semiconductors

Semiconductors sit at the foundation of the AI industry. Even so, I’m not confident that semiconductor companies will be the main beneficiaries of AI.

That’s because there’s no single dominant semiconductor company.

To understand the industry, let’s look at one company in particular, Intel.

Intel is a semiconductor powerhouse famous for vertical integration. Everything from chip design to production is done internally. This leads to incredible margin on core products, yielding roughly $20 billion per year in net income within the past 5 years.

However, these margins come with a cost: rigidity.

Despite being the leading chipmaker when mobile smartphones first came out (producing the first chips for BlackBerry), Intel has never grown into a major mobile chip provider. Instead, iPhones today use in-house A-series chips while Androids primarily run on a mix of Qualcomm Snapdragons and MediaTek Helios.

In the GPU market, Intel unleashed chaos with integrated graphics. Despite that, Nvidia would hold on and find new footing. Nvidia’s market cap surpassed Intel’s in 2020 and is the largest semiconductor design company in the world today.

This new breed of semiconductor companies unseated Intel from its throne, at least in part, due to their focus on chip research and design.

Rather than invest billions on manufacturing capabilities, Nvidia and Qualcomm outsource chip production to semiconductor foundries like TSMC and Global Foundries (spun out from AMD).

The sheer scale of foundries has changed the semiconductor landscape. Today, even giants like Intel are forced to work with foundries to manufacture their cutting-edge chips. As a result, we can broadly divide the semiconductor industry into two distinct segments: chip designers like Nvidia and foundry companies like TSMC.

With context in place, let’s talk about semiconductors as they apply to AI.

It all started with Todd Martínez, a researcher at Stanford, going to Fry’s Electronics and buying PlayStation 2s (sorry gamers) to run computations for computational chemistry. [0]

Because the computation for training machine learning models is fairly basic, virtually any CPU or GPU is capable for the task. It was discovered that GPUs were particularly well-adapted for parallelizing the computations required to train a machine learning model. They were so good, in fact, that it became more cost-effective to cluster a bunch of GPUs than it was to use supercomputers.

Catching onto the trend, companies like Nvidia started up dedicated divisions to build AI-focused chips that dramatically increase computational throughput.

The optimization race has been of particular benefit to cloud providers.

After all, AWS doesn’t really mind if they buy 10,000 Nvidia or Intel chips. All they care about is the highest compute to cost ratio - especially true given that chips are generally one-time costs. Companies pay for the chip up-front and realize returns over time.

This begs the question: how do chipmakers capture profit from AI?

One way is to develop a full-stack suite of software specifically for AI, a strategy championed by Nvidia.

In the words of Nvidia CEO, Jensen Huang:

So deep learning needed a brand new stack, it just so happened that the best processor for deep learning at the time, ten years ago, was one of Nvidia’s GPUs. Well, over the years, in the last ten years, we have reinvented the GPU with this thing called Tensor Core, where the Tensor Core GPU is a thousand times better at doing deep learning than our original GPU, and so it grew out of that. But in the process, we essentially built up the entire stack of computer science, the computing, again, new processor, new compiler, new engine and new framework — and the framework for AI, of course, PyTorch and TensorFlow.

By investing heavily into software development and customer education, Nvidia built a point of differentiation against other semiconductor design companies.

Their bet has paid off - Nvidia’s Data Center division has been growing at breakneck pace, largely thanks to AI chip sales.

But differentiation through software still isn’t enough to seize complete control of the semiconductor layer. Design companies are forced to rely on foundries to manufacture chips.

Foundries have near-monopolies on chip fabrication. For a company like Nvidia to build their own fab would require billions in low-margin capital investments.

Foundries are able to leverage revenue from their dozens (or hundreds) of semiconductor customers and invest billions into semiconductor production facilities (fabs). TSMC recently committed to spending $100 billion in capital investments over a span of three years.

And yet… even foundries have a hard time capturing profit in AI.

The fundamental problem still lies in the fact that foundries have high fixed costs and low variable costs: billions to build a fab with next to nothing to produce new wafers (chips). The natural incentive for foundries is to squeeze every penny of profit out of a fab after it’s been constructed. In other words, they’ll continue running fabs until variable costs outweigh the revenue.

In applications like smartphones, even a 10% improvement is meaningful enough to make a difference in consumer behavior. For example, 3nm chips are roughly 10-15% more efficient than 5nm chips.

But in AI research, the key metric is cost-to-compute ratio, and that’s where foundries run into issues.

These cutting-edge chips offer only marginal boost in compute at drastically higher costs. Building a new state-of-the-art facility requires billions in investments while running an existing fab is almost all profit.

So if TSMC was the only company with fabs, they could unilaterally extract profit at the semiconductor level. After all, the design companies are unlikely to sink billions in building their own fabs. Unfortunately, other fab players include Intel, Global Foundries, Samsung and SMIC. Their fabs might not be as cutting edge as TSMC’s fabs, but are equally competitive on pricing.

As a result, the semiconductor layer is map of fiefdoms. Design companies like Nvidia are beholden to foundries. And foundries are unable to extract profit due to their unique business model.

The semiconductor layer lacks the differentiation factor that allows it to retain profit. It will be profitable with the rise of AI, but it won’t be the primary beneficiary.

Cloud Platforms

What do these cloud platforms actually do?

AWS and Azure wrap themselves in marketing talk.

Azure: “Achieve your goals with the freedom and flexibility to build, manage, and deploy your applications anywhere.”

And AWS: “the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally.”

Thankfully, GCP has a slightly better explanation of what cloud platforms are:

Google Cloud consists of a set of physical assets, such as computers and hard disk drives, and virtual resources, such as virtual machines (VMs), that are contained in Google’s data centers around the globe. To oversimplify things, a cloud platform is a computer in some remote location. You pay to access a portion of the computer’s resources for a set amount of time.

It’s a powerful business. Combined, the top three cloud platforms make over $100 billion per year.

AWS launched in 2002, Azure in 2010, and GCP in 2011.

This was perfect timing. AI research companies like Deepmind and OpenAI popped up around the same time, in 2010 and 2015 respectively.

This coincidence meant that AI research companies grew up in an environment where cloud platforms were the default choice for startups. After all, why plunk down millions for servers when you can access one in the cloud for just pennies?

These research companies were the perfect customer for the cloud platform business model. Platforms draw in newly formed AI research companies, leverage scale advantages to keep costs low, and reap the benefits as their customers scale up.

And boy, do research companies scale up.

OpenAI published GPT-1 with 117M parameters. GPT-2: 1.5B parameters. GPT-3: a whopping 175B parameters.

Even more staggering is how fast this happened - GPT-1 was published in 2018 while GPT-3 came out in 2020. In the span of just two years, the parameter size required to train models has increased over 1,000x - corresponding to a compute increase of similar magnitude.

Unfortunately, there’s a limit to the correlation between parameter size and model performance. Megatron-Turing NLG had 530B parameters without meaningful improvements over GPT-3.

However, this doesn’t mean cloud platforms will lose their hold on the market. A limit to parameter size simply shifts the AI research battlefield to a game of timelines.

If a research company can ship their model to market 2-3 months ahead of competitors, then they’ve essentially “won”.

Stability.ai publicly launched Stable Diffusion, an AI-enabled image generation service, on August 22nd 2022. Overnight, they took the world by storm and just 20 days later, they raised a $101 million seed round.

By being the first, widely available, high-quality image generation tool, they received all the press and early users.

On the other hand, OpenAI’s Dall-E (another image generation tool), had been in open beta since July 2022.

When it finally launched on September 28th, it received almost no attention. One Hackernews comment says it best:

It’s really amazing how DALL-E missed the boat. When it was launched, it was a truly amazing service that had no equal. In the months since then, both Midjourney and Stable Diffusion emerged and got to the point where they produce images of equal or better quality than DALL-E. And you didn’t have to wait in a long waitlist in order to gain access! They effectively gave these tools free exposure by not allowing people to use DALL-E. johnfn Speed matters.

The new question for research companies becomes - how do you get speed?

The simple answer: get the fastest equipment.

The ability for to ideate and execute on a model in minutes rather than days becomes the ultimate advantage given enough cycles.

But it becomes quite cost prohibitive for research companies to buy and maintain fleets of the best machine learning chips. This “burden” falls on cloud providers who can enthusiastically leverage their capital to purchase the best chips to lock customers into their platform.

What’s more, cloud platforms can use their capital advantage and large order volumes to bully their suppliers - semiconductor companies.

Previously, we talked about cloud platforms having the best machine learning chips. Well, the definition of “best” is subjective, depending on which platform you ask. GCP uses Tensor Processing Unit (TPU) chips that they designed internally. Azure uses a mixture of Intel and Xilinx Field-programmable gate arrays (FPGA). AWS primarily uses Nvidia’s GPUs with some workload shouldered by their own Trainium and Intel’s Habana.

It’s pretty messy.

At a high level, this chaos means that no single semiconductor company is a dominant supplier to cloud platforms. Cloud platforms can toss competing semiconductors houses against each other to drive prices down.

This extends even to foundries. TPUs use 7nm technology, marginally worse compute power at far more attractive prices. In contrast, the iPhone 14 that anyone can buy has a 4nm chip.

Cloud platforms are also diversified in other technical decisions.

GCP is known for its TPU, an application-specific integrated circuits (ASIC) which they designed internally. In general, ASIC chips are designed to perform a small set of computations very well. For example, the TPU is optimized specifically for TensorFlow, Google’s open-source machine learning framework.

The benefits of ASICs are speed and cost. ASIC chips are powerful in their specific domain and relatively cheap to manufacture with low-energy consumption.

The tradeoff is a lack of flexibility. Developing a specialized ASIC chip requires heavy up-front investment in both money and time. Then once developed, the chip needs to generate sufficient returns to justify the investment before it becomes obsolete (2-3 years).

For a special chip like the TPU, Google might only manufacture thousands of each generation since the chip is limited to the GCP platform.

Computational evolution presents another problem for ASICs. If AI training algorithms evolve significantly after the chip was designed (matrix multiplication changing from 49 steps to 47 steps), these chips are rendered ineffective.

With this ASIC obsolescence problem in mind, Microsoft instead chose to use an FPGA-based strategy. FPGAs are known for their versatility. More specifically, FPGAs have the power to customize the level of precision required for a particular layer in a deep neural network.

In their paper [1], Microsoft details how, “FPGAs accelerate both compute workloads, such as Bing web search ranking, and Azure infrastructure workloads, such as software-defined networks and network crypto.”

By previously building their cloud infrastructure (non-AI) on FPGAs, Azure can easily repurpose their Intel and Xilinx FPGAs for general cloud tasks or back to AI compute as needed.

The debate of ASIC against FPGA closely mirrors the RISC vs CISC approach in the early 90s. Intel focused on CISC, leading to decades of incredible growth. ARM and others chose RISC which has become a cornerstone architecture for mobile computing.

Regardless of their choice of ASIC or FPGA, cloud platforms always have the luxury of buying (or building) the best.

They’ll continue to hold a dominant position and capture profit from AI innovations.

Data Labeling

Early machine learning models required heavily sanitized datasets which were manually collected and labeled by grad students.

As dataset size increased, having a bunch of grad students spending their nights and weekends labeling data simply didn’t scale.

Today, most AI companies have evolved to using third-party services such as Scale AI [2], Appen, Hive, Labelbox, Upwork, and Snorkel AI to process and label data.

In the case of 2D images, labels can include bounding boxes (2d boxes around objects), cuboids (3d boxes around objects), and semantic segmentation (segment the image into different categories). Data collection and sanitation generally falls on the research company - especially given that data labeling companies charge on a per-task basis.

Over the past few years, there’s been plenty of great reporting on data labeling companies, such as a Not Boring profile of Scale and a Forbes piece on LabelBox.

Going on valuation, Scale is undoubtably the leader in the industry. Scale last raised at a $7.3B valuation, with Appen last valued at roughly $2.5B and Labelbox at $1B.

Rather than delving into these specific businesses, let’s talk about how data labeling companies fit into the AI ecosystem.

Machine learning models are essentially black boxes that make decisions based on patterns. They’ll give some output evaluation (usually paired with a confidence score) based on an input. Technically, it’s a bit more involved with discriminative models but that’s a different discussion.

Imagine having a machine learning model and trying to teach it what a cat is. You’d give the model millions of cat pictures. Afterward, the model gets pretty good at picking out cats.

But what if a couple of dog pictures get mixed in with the cat pictures? The model will still correctly classify cats but may occasionally classify dog inputs as cats.

As a result, having high-quality data is a prerequisite for accurate supervised and reinforcement learning (two types of machine learning models).

But high-quality data labeling is a commodity. Essentially, you’re taking human time and turning it into a product (you could say the same for most jobs).

Companies like Scale, LabelBox, or Appen all offer the same labeling product. The nuance that separates larger companies like Scale and smaller ones like LabelBox or Appen is the quality of the labeling and their throughput. Scale can process far more data and guarantee extremely high quality.

While that capability is necessary for the AI ecosystem and impressive in its own right, Scale is limited by the limitations of the data labeling layer.

The problem becomes clear with a simple question - if you remove Scale, does OpenAI still exist?

The answer: yes.

According to recent papers by OpenAI like 2009.01325 Learning to summarize from human feedback and 2203.02155 Training language models to follow instructions with human feedback, only 36 and 40 labelers were employed respectively.

Not exactly the amount of labeling necessary to justify hiring a third-party platform.

So despite Scale is fairly dominant within the data labeling industry, they don’t wield enough leverage to accrue profit from the AI ecosystem as a whole.

Now, this doesn’t mean that data labeling isn’t profitable.

Scale and other data labeling companies are likely to see higher revenue and better margins from the increased funding in the AI ecosystem. Research companies flush with cash and short on time will always need third-party services to schlep the unsexy parts of producing a machine learning model.

But at the end of the day, this isn’t enough to truly capture profit and leverage in the whole industry.

Research Companies

We’re roughly 4,000 words into this post and finally, I’m getting to the companies that actually build machine learning models. Better late than never.

Let’s first talk about why research companies won’t see most of the profit from the AI ecosystem.

They’re expensive.

This manifests itself in three places: talent, compute, and monetization.

Talent and compute are both fairly straightforward.

The machine learning/AI boom started in roughly 2009. Until then, AI and machine learning were sleepy backwaters. So, the AI talent pool today (2023) is still fairly limited. Add in aggressive recruitment from behemoths like Google, Microsoft, Facebook, and Amazon, and the cost of talent skyrockets. Once hired, it doesn’t make sense to give these researchers a shoestring budget. To develop a model, a researcher needs to test, iterate, and train dozens of different models.

It’s money all the way.

But the talent and compute costs pale in comparison to monetization costs. Research companies are building platforms - and platforms aren’t built by startups.

Models such as GPT-3 or Stable Diffusion are better classified platforms rather than products. The base GPT-3 isn’t very helpful to an everyday person. It’s cool, but not a painkiller. Application companies build atop the model with industry specific fine-tuning and prompting for their audience.

Could a research company do the fine-tuning and customization themselves? Yes, but that’s not where incentives are. Research companies are incentivized to build the best models possible.

With the best models, research companies can attract exponentially more application companies that build atop the model. Research companies can then levy a platform tax on the application companies.

This is also why speed matters.

Application companies don’t really wait around for months to figure out which model to build on. They’ll build on the strongest model available at the time, upgrading only when performance gains outweigh engineering cost. Releasing a model two months before the competition means that the model will attract a decent majority of application companies.

But as we hinted at earlier, platforms are a hard game for startups. The central problem with platforms is monetization. A platform starts by generating value for developers and only monetizes when it achieves strong adoption.

In other words, platforms can’t monetize too early. This dynamic is particularly tough for startups because not monetizing means that the company needs to find capital from other sources, and they’ll tend to need more of it.

OpenAI has raised at least $1 billion, Deepmind was acquired for $500 million, Anthropic was funded with $500 million by SBF, and Stability just raised a $101 million round.

These huge capital commitments come at the cost of significant equity concessions.

As noted previously, research companies can kill two birds with one stone by partnering with a cloud platform. They receive both capital to aggressive attract talent and lower cloud bills to take more bets.

At first glance, with this much funding, these research companies should be able to overcome the initial lack of monetization as a platform. Unfortunately, the amount of funding also brings a negative correlation - better capitalized research companies means more competition, which means more R&D spending and longer timelines before monetization.

It’s a bit of a death trap, one where research companies create immense value but capture very little of it.

Application Companies

And finally, we get to the last layer of the AI ecosystem - application companies.

Being a writer myself, we’ll look primarily at copywriting startups such as Copy AI, Jasper, Sudowrite, Lex, and more.

Let’s first talk about what application companies are.

In their own words, “Jasper is the AI Content Generator that helps you and your team break through creative blocks to create amazing, original content 10X faster.”

Copy AI’s tagline is, “Get better results in a fraction of the time. Finally, a writing tool you’ll actually use.”

Application companies add value in the productization (some fine-tuning, prompt engineering, and a pretty UI) and distribution of the underlying technology. They take a stock model, polish it, and make it easy for the average layperson to access and utilize.

The knee-jerk reaction to application companies is usually, “wow they’re pretty dependent on the research companies”. After all, the secret sauce is the model developed by research companies.

But surprisingly, research companies have less leverage over application companies than you’d think. That’s because each application company has multiple models to chose from. They could go with GPT-3, developed by OpenAI, or they could use GPT-Neo and GPT-J, open source models developed by EleutherAI, or they could even use BLOOM, a collaboration from several AI companies.

There are various performance differences and engineering requirements to adjust and deploy these models in production. But the technical difficulties of doing such things is drastically lower than developing your own model.

With that said, application companies can also develop their own specific machine learning models. These models are unlikely to be as sophisticated as the ones developed by research companies, but can gain an edge by being more application specific.

The Achilles’ heel to application companies is the low barrier to entry.

Anyone with a couple lines of code can spin up an application company. And with everyone having similar access to cutting-edge machine learning models, the product difference between a $1 billion application company and a hackathon project is remarkably small.

That means that application startups function largely as distribution channels. They exist to efficiently distribute the model to end consumers. They do so with various interesting strategies such as building in public (copy.ai) or a mailing list of marketers from a previous startup (Jasper).

That’s not a problem if these companies are bootstrapped and looking to eke out revenue in the low millions.

But both copy.ai and Jasper have raised venture capital, $14 and $137 million respectively. With that much capital, they’re aiming for home runs and it remains to be seen if the application layer can support that level of ambition.

Conclusion

At the end of the day, the AI field is still fairly nascent.

It’s possible that a couple years (or months) from now, value capture will shift to the research or application layer. Most likely, things will stay the same and cloud platforms capture most of the profit in the AI ecosystem.

Over the next few years, a lot of new AI companies will be born and a lot of money will flow into them. But most of that money will flow to cloud platforms.

Footnotes

[0] Playstation 2s and computational chemistry are two terms that I never thought would overlap in a news article.

[1] A caveat in the paper is, “Machine learning is one application that is sufficiently important to justify as a domain-specific ASIC at scale.”

[2] I previously worked at Scale AI, but these views are my own and do not represent the company.

· AI, Industries