**A new analysis of nearly 45,000 real-world web pages shows that the average web page is far longer than most people expect, with an average length of 10,403 tokens and a median of 3,201 tokens.**

[The study](https://dejan.ai/blog/how-long-are-web-pages/), conducted using Gemini’s token counter, reveals a highly skewed web where a small percentage of extremely long pages dramatically inflate averages. 

It is a reality with major implications for[AI systems](https://www.stanventures.com/news/ai-assistants-keep-turning-to-best-lists-new-study-shows-6261/), retrieval design, and cost planning.

## What Was Analyzed in This Web Token Study?

The research examined 44,684 live URLs, processing their content using [Gemini’s native tokenization](https://www.stanventures.com/news/gemini-live-model-redefines-real-time-conversations-with-ai-4544/). 

![Page token distribution](https://dejan.ai/wp-content/uploads/2025/12/page_token_distribution_normal_x.png)

This matters, because token counts, not word counts are what modern large language models actually “see.”

Across this dataset:

- Total page content tokens: 464,854,727
- Total tokens (all): 541,062,817

The sample intentionally covered a wide range of real-world content types, including blog posts, long-form articles, academic papers, documentation pages, product listings and full PDF documents. 

Five pages returned zero tokens due to fetch failures or blocking, but otherwise the dataset reflects the web as AI systems encounter it today. 

What came out was not just a picture of usual pages but an overall look of variance. 

## What Is the Usual Length of a Web Page in Tokens?

At first look, the median tells a comforting story.

The median web page length is 3,201 tokens, roughly equivalent to about 2,400 words or five pages of text. This aligns closely with what many people imagine when they think of an article, blog post, or informational page. 

But the average tells a very different story.

Which means, the average length jumps from 10403 tokens, more than three times the median. That gap immediately signals a right-skewed distribution, where a minority of very long pages pull the mean upward.

Percentile data confirms this imbalance:

- 25th percentile: 1,396 tokens
- 75th percentile: 8,207 tokens

Half the web lives in a relatively modest range but the other half stretches much further than intuition suggests.

## How Is Web Content Distributed Across Token Ranges?

Looking at token ranges reveals where most web content actually lives.

Nearly 50% of all pages fall between 1,000 and 5,000 tokens, making this range the true “center of gravity” for typical web pages. These are the articles, guides, and explainers most people interact with daily.

But beyond that midpoint, the web grows long and fast.

About 18% of pages contain between 10,000 and 50,000 tokens, representing deep-dive guides, documentation hubs, or pages filled with extensive supplementary content.

Even more striking is the long tail:

- 1.8% fall between 50,000 and 100,000 tokens
- 1.5% sit between 100,000 and 500,000 tokens
- A small but real 0.04% exceed 500,000 tokens

Only 16 pages in the entire dataset crossed the half-million-token mark but their existence fundamentally changes how averages behave.

## How Extreme Is the Long Tail of Web Content?

Percentile analysis shows just how far the web stretches.

- 90th percentile: 21,839 tokens
- 95th percentile: 35,852 tokens
- 99th percentile: 141,410 tokens

That means the top 1% of pages exceed 140,000 tokens, the equivalent of 100+ pages of text.

These pages are usually not traditional articles. They are often: full research PDFs, technical documentation portals, educational course material. 

Other than that, it also includes scraped book chapters and long policy or standards documents

The most extreme case in the dataset contained over 3 million tokens, roughly equivalent to four to five full-length novels on a single URL.

## What Do These Findings Mean for AI Context Windows?

With today’s large language models offering context windows ranging from 32K to over 2 million tokens, this dataset offers reassurance and a warning.

On the reassuring side:

- 95% of web pages fit within a 128K context window
- The median page leaves plenty of room for multi-page retrieval
- Only 0.04% exceed typical context limits

This means most single-page retrieval tasks are well within modern LLM capabilities.

But the warning lies in aggregation. Retrieval-augmented systems rarely pull just one page.

A typical RAG query retrieving 10 documents could range from: ~14K tokens (median pages, 350K+ tokens (90th percentile pages). That variability changes everything, from latency to cost.

## How Should RAG Systems Handle This Token Variance?

The study highlights several practical realities for AI engineers.

First, chunking strategy matters. With a median page around 3,000 tokens, chunk sizes aligned to this range make sense but they are insufficient for outliers.

Second, long-form content requires special handling. A 140K-token page cannot be treated the same way as a 3K-token article. Hierarchical chunking, summaries, or selective retrieval become essential.

Third, budgeting must account for outliers. While median costs might look manageable, average costs end up roughly 3× higher due to long-tail pages.

This is not a theoretical concern. It directly affects inference bills, latency expectations, and user experience.

## How Wrong Were People’s Guesses About Page Length?

Before publishing the data, the researcher ran a LinkedIn poll asking people to guess the average page size in tokens.

Out of 131 votes:

- 38% guessed 1,000 tokens
- 34% guessed 10,000 tokens
- 21% guessed 100 tokens
- 7% guessed 100,000 tokens

The correct average 10,403 tokens was only guessed by about a third of respondents.

Most people underestimated. And that’s understandable. The median supports the intuition that pages are closer to 1,000–3,000 tokens. But averages don’t respect intuition when long tails exist.

Interestingly, the small group who guessed 100,000 tokens weren’t entirely wrong, they just described the 99th percentile, not the average.

This gap between perception and reality explains why so many AI systems struggle with unexpected cost spikes.

## Key Takeaways 

- The study analyzed 44,684 real-world web pages using Gemini’s token counter.
- The median web page is ~3,200 tokens, but the average jumps to ~10,400 tokens due to long pages.
- Nearly 50% of pages fall between 1,000–5,000 tokens, representing typical articles and blogs.
- About 18% of pages are 10,000–50,000 tokens, often deep guides or documentation.
- The top 1% exceed 140,000 tokens, equivalent to 100+ pages of text.
- The largest page analyzed contained 3+ million tokens, or several full novels.
- Average AI costs are ~3× the median due to long-tail content.
- Most people underestimate web page size, leading to flawed system design assumptions.