October 30, 2018

Web Caching History



Before we delve into the details of modern operations of a CDN, I thought now would be a good time to cover the history of web caching and the development of CDNs. This development followed a natural progression. So let's start at the beginning.


A short timeline:

1989 Work on HTTP begins

1990 The internet consists of about 2.6m users globally

1990 The world's first website goes live in December 1990 at CERN

This website used the first version of HTTP. It had one method: GET. You sent a GET request to the server and it returned the HTML of the web page.

1991 HTTP 0.9 formally documented the HTTP protocol.

Note the simplicity of the protocol. The goal was to document a version of HTTP that they could maintain compatibility with in the future.

Up until this point, caching is not a consideration on the internet. HTTP 0.9 contains no support for caching resources.

1991 10 websites on the internet

1992 50 websites on the internet

1993 623 websites on the internet

1994 10,000 websites on the internet

1995 23,500 websites on the internet

1995 In the past 5 years growth of the internet has exploded. From 2.6m users in 1990 to over 44m globally. Nearly 20x growth.

This is where the internet meets it's first growing pains. Servers are expensive and slow. Development techniques were also inefficient with server resources. Websites were easy to overload. A moderate spike in traffic would take a website offline as it struggled to cope. With the rapid growth of the internet, this was a fairly frequent occurrence.

The architecture of a website was simple as well. A website would typically consist of a single web server. Around this time, the first dynamic websites started to appear, built using a website and database server.

Harvest Project

This is the beginning of caching on the internet. DARPA and the Internet Research Task Force launched the Harvest Project to develop a web cache and the Internet Cache Protocol (ICP). Some of you may recognize this project by the name it adopted after the Harvest Project ended: Squid.

Squid was the first cache server. ICP did not use HTTP headers for caching, and it could be used in front of FTP, Gopher, NetNews, and HTTP servers--all common WWW protocols in the early days of the internet.

To use Squid, you would install it on a server in front of your web server. It proxied requests to your web server, caching static resources on the Squid server.

The squid server would physically sit next to your web server. So there was no improvement in latency; requests still needed to travel the same distance. What it accomplished was alleviating some of the load on your web server, allowing it to handle more users.

Following Harvest, there were a number of caching products launched using a variety of protocols. Eventually HTTP would add the first cache headers in HTTP/1.1 (1996) and web proxies would adopt the HTTP standard for caching.

Akamai, The First CDN

One aspect of the internet we have not discussed yet is the speed of internet connections. The web of the '90s was far slower than today's internet. Here are the common internet speeds of the day:

  • Users Dial-up: 14.4Kbits/s - 56.6Kbits/s (that's 1.8-7Kbytes/s)
  • Servers ISDN at 64Kbits/s - 128Kbits/s; or T1 at 1Mbit/s
  • Backbones OC12 at 1Gbit/s - OC48 at 2.4Gbits/s

Latencies were much higher as well. A dial-up connection would typically have a latency of several hundred milliseconds. 300ms was not uncommon on a decent connection.

In 1998, this was the environment that Akamai began their development of the first CDN. Squid's reverse proxy server allowed servers to handle more users, but did nothing for latency on the internet. In the simplest terms, Akamai moved the reverse proxy server to the edge of the network near the user.

This diagram has a slight misrepresentation. The HTML of a webpage was loaded directly from the web server. In the HTML, you would use Akamai's server for images and other static resources.

By moving static resources closer to the user, they could be loaded faster than if they were loaded from the web server. By reducing the distance, Akamai reduced the latency.

The cache servers in this model are independent of each other. They do not generally communicate with each other, and the cache on each server is a separate cache.

Today's CDNs

Akamai's model is still basically the model for how CDNs operate today. The protocols have improved, the servers are faster, and internet connections are faster, but the structure hasn't changed much. Instead of developing their own caching server, CDNs are typically built on top of Nginx, Varnish, or one of a few other off the shelf proxy servers. It would be accurate to describe most CDNs as operations companies (buying and providing bandwidth), not software development companies.

There have been a few notable improvements. Many CDNs now allow you to load the HTML through the CDN instead of from the web server; but frequently they do very little to speed up the retrieval of the HTML. New types of CDNs have appeared to optimize delivery of video and other types of content. While CDNs continue to be primarily concerned with caching static content, this will change given the complexity of modern websites.

Latencies and bandwidth on the internet have improved, allowing edge servers to be moved further from the user. When latencies were in the hundreds of milliseconds, edge servers placed within a few miles of the user delivered a significantly improvement. Today, 50ms is enough time for a packet to go from California to New York. So instead of tens of thousands of edge cache servers, CDNs today frequently have a few dozen. With improvements in TLS 1.3, HTTP/2.0, QUIC and other developments that further reduce latencies, we expect this transition to continue.

The Future

The reduction in the number of PoPs CDNs operate is a necessary development to handle dynamic content that is common on today's internet. This reduction allows more complex distributed behavior to be developed. For example, some CDNs have introduced Function As A Service (FAAS); while others have introduced complex routing and configuration support.

In a similar direction, we've built a distributed pseudo-time filesystem that allows the cache servers to coordinate and share cache data. And as we continue to improve our software, we intend to build support for more complex operations on top of this filesystem. This is the model we see CDNs adopting in the future.

In later articles we'll be covering some of the improvements CDNs have made, more about our distributed filesystem, and other innovations we have developed to improve on this model.