BETA

About

Mission

Successful societies and institutions recognize the need to record their history - this provides a way to review the past, find explanations for current behavior, and spot emerging trends. In 1996 Brewster Kahle realized the cultural significance of the Internet and the need to record its history. As a result he founded the Internet Archive which collects and permanently stores the Web's digitized content.

In addition to the content of web pages, it's important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.

About this fork

This is a port of the original codebase to Python using the Pyramid framework for the pages, SQLAlchemy for managing database queries, GViz-Data-Table for visualisation. The data differs minimally from httparchive.org: only unique fully qualified, domains are included and folder-based URLs are excluded. Furthermore, websites are only included once they have been crawled more than once. The aim behind the port was to make working with the dataset easier and focus on query optimisation.

FAQ

How is the list of URLs generated?

Starting in November 2011, the list of URLs is based solely on the Alexa Top 1,000,000 Sites (zip). Only unique, fully qualified domains with more than one crawl are included.

From November 2010 through October 2011 there were 18,026 URLs analyzed. This list was based on the union of the following lists:

How is the data gathered?

The list of URLs is fed to WebPagetest.org. (Huge thanks to Pat Meenan!)

The WebPagetest settings are:

Each URL is loaded 3 times. The data from the median run (based on load) is collected via a HAR file. The HTTP Archive collects these HAR files, parses them, and populates our database with the relevant information.

For the HTTP Archive Mobile the data is gathered using Blaze.io's mobile web performance tool Mobitest using iPhone 4.3. Please see their methodology page for more information.

How accurate is the data, in particular the time measurements?

The "static" measurements (# of bytes, HTTP headers, etc. - everything but time) are accurate at the time the test was performed. It's entirely possible that the web page has changed since it was tested. The tests were performed using Internet Explorer 8. If the page's content varies by browser this could be a source of differences.

The time measurements are gathered in a test environment, and thus have all the potential biases that come with that:

Given these conditions it's virtually impossible to compare WebPagetest.org's time measurements with those gathered in other browsers or locations or connection speeds. They are best used as a source of comparison.

What are the limitations of this testing methodology (using lists)?

The HTTP Archive examines each URL in the list, but does not crawl the website other pages. Although these lists of websites (Fortune 500 and Alexa Top 500 for example) are well known, the entire website doesn't necessarily map well to a single URL.

Because of these issues and more, it's possible that the actual HTML document analyzed is not representative of the website.

What's a "HAR file"?

HAR files are based on the HTTP Archive specification. They capture web page loading information in a JSON format. See the list of tools that support the HAR format.

How is the HTTP waterfall chart generated?

The HTTP waterfall chart is generated from the HAR file via JavaScript. The code is from Jan Odvarko's HAR Viewer. Jan is also one of the creators of the HAR specification. Thanks Jan!

When looking at Trends what does it mean to choose the "intersection" URLs?

The number and exact list of URLs changes from run to run. Comparing trends for "All" the URLs from run to run is a bit like comparing apples and oranges. For more of an apples to apples comparison you can choose the "intersection" URLs. This is the maximum set of URLs that were measured in every run.

What are the definitions for the table columns for a website's requests?

The View Site page contains a table with information about each HTTP request in an individual page, for example http://www.w3.org/. The more obtuse columns are defined here:

Definitions for each of the HTTP headers can be found in the HTTP/1.1: Header Field Definitions.

How do I add a website to the HTTP Archive?

You can add a website to the HTTP Archive via the Add a Site page.

How do I get my website removed from the HTTP Archive?

You can have your site removed from the HTTP Archive via the Remove Your Site page.

How do I report inappropriate (adult only) content?

Please report any inappropriate content by creating a new issue. You may come across inappropriate content when viewing a website's filmstrip screenshots. You can help us flag these websites. Screenshots are not shown for websites flagged as adult only.

Who created the HTTP Archive?

Steve Souders created the HTTP Archive. It's built on the shoulders of Pat Meenan's WebPagetest system. Several folks on Google's Make the Web Faster team chipped in. I've received patches from several individuals including Jonathan Klein, Yusuke Tsutsumi, Carson McDonald, James Byers, Ido Green, Charlie Clark, and Mike Pfirrmann. Guy Leech helped early on with the design. More recently, Stephen Hay created the new logo.

The HTTP Archive Mobile test framework is provided by Blaze.io with much help from Guy (Guypo) Podjarny.

Who sponsors the HTTP Archive?

The HTTP Archive is possible through the support of these sponsors: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Strangeloop, dynaTrace Software, and Torbit.

The HTTP Archive is part of the Internet Archive, a 501(c)(3) non-profit. Donations in support of the HTTP Archive can be made through the Internet Archive's donation page. Make sure to designate your donation is for the "HTTP Archive".

Who do I contact for more information?

Please go to the HTTP Archive discussion list and submit a post.