Top of page

Web Archive Datasets

looking up through towers of a server farm

Explore content from the Library's web archives

In order to better facilitate research and to better understand the needs of users who might be interested in the Library of Congress Web Archives, the Library's Web Archiving Team is working to create and make available a number of derivative datasets available for users to download, re-use and explore. After participating in the Library's pilot to explore how we can better enable digital scholarship, beyond basic browsing of records and URL search, the team began exploring ways in which we could process our archives, the scale of which can be overwhelming to digital humanities researchers and other users. From this experimentation, the team created two derivative datasets which would provide researchers smaller bits of data to help users engage with and learn more from our archives.

We plan to launch additional datasets with other types of content in the future, so keep an eye on this page and follow The Signal blog for updates.

Datasets and Tutorials

.gov Datasets

The following is a series of datasets each containing 1,000 files of related media types selected from .gov domains.

  • 3000 .gov Tabular Dataset Download (CSV, TSV, and XLS formats)- Each of these datasets consist of 1,000 files generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as CSV, tab-separated (TSV), or Excel (XLS) files and hosted on .gov domains. Each set includes 1,000 unique CSV, TSV, and XLS files and minimal metadata about them, including links to their locations within the Library's web archive.
  • 1000 .gov PDF Dataset Download (673.5 MB zip file) - This dataset of 1,000 PDF files was generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as PDF files and hosted on .gov domains. The set includes 1,000 unique PDF files and minimal metadata about these PDFs, including links to their locations within the Library's web archive.
  • 1000 .gov Audio Dataset Download (4.6 GB zip file) - This dataset of 1,000 audio files was generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as audio files and hosted on .gov domains. The set includes 1,000 unique audio files and minimal metadata about them, including links to their locations within the Library's web archive.
  • 1000 .gov Image Dataset Download (128.16 MB zip file) - This dataset of 1,000 images was generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as image files and hosted on .gov domains. The set includes 1,000 unique image files (primarily with GIF, JPG, PNG, and TIFF extensions) and minimal associated metadata, including links to their locations within the Library’s web archive.
  • 1000 .gov PowerPoint Dataset Download (3.2 GB zip file) - This dataset of 1,000 PowerPoint files was generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as PowerPoint files and hosted on .gov domains.

United States Elections Web Archive Datasets

DISCLAIMER: The election datasets are currently unavailable. We are working with the Web Archiving Section to transition these datasets to data.labs.loc.gov for an improved user experience. If you have questions or concerns, reach out using the contact form below.

The following datasets come from the United States Elections Web Archive. This collection includes campaign sites archived weekly during United States election seasons since 2000, documenting sites associated with presidential, congressional, and gubernatorial elections. The Web Archiving Team has also released a Jupyter Notebook External that provides examples of how to further explore and analyze the data. This notebook includes annotated blocks of Python code that can be reused and altered to suit researcher needs. The dataset is also described in more detail in an accompanying blog post.

  • United States Elections Web Archive Indexes Dataset External -This multi-part dataset includes metadata about archived web objects in the United States Elections Web Archive. This metadata is in the form of web archive capture (CDX) indexes which are generated as part of the process to provide access via Wayback software. In total, the dataset is 260 GB and contains 411,815 .cdx.gz files, divided by election year. Please see the individual years’ README files for more information.
    • United States Election 2000 Web Archive Indexes – This dataset includes CDX indexes from the United States Election 2000 crawl. Please see the README External for further details, and you can fetch the files using this manifest External
    • United States Election 2002 Web Archive Indexes – This dataset includes CDX indexes from the United States Election 2002 crawl. Please see the README External for further details, and you can fetch the files using this manifest External
    • United States Election 2004 Web Archive Indexes – This dataset includes CDX indexes from the United States Election 2004 crawl. Please see the README External for further details, and you can fetch the files using this manifest External
    • United States Election 2006 Web Archive Indexes – This dataset includes CDX indexes from the United States Election 2006 crawl. Please see the README External for further details, and you can fetch the files using this manifest External
    • United States Election 2008 Web Archive Indexes – This dataset includes CDX indexes from the United States Election 2008 crawl. Please see the README External for further details, and you can fetch the files using this manifest External
    • United States Election 2010 Web Archive Indexes – This dataset includes CDX indexes from the United States Election 2010 crawl. Please see the README External for further details, and you can fetch the files using this manifest External
    • United States Election 2012 Web Archive Indexes – This dataset includes CDX indexes from the United States Election 2012 crawl. Please see the README External for further details, and you can fetch the files using this manifest External
    • United States Election 2014 Web Archive Indexes – This dataset includes CDX indexes from the United States Election 2014 crawl. Please see the README External for further details, and you can fetch the files using this manifest External
    • United States Election 2016 Web Archive Indexes – This dataset includes CDX indexes from the United States Election 2016 crawl. Please see the README External for further details, and you can fetch the files using this manifest External
    • United States Election 2018 Web Archive Indexes – This dataset includes CDX indexes from the United States Election 2018 crawl. Please see the README External for further details, and you can fetch the files using this manifest External

Web Cultures Web Archive Datasets

These datasets comprise of content found in the American Folklife Center's Web Cultures Web Archive. As part of an effort to review the coverage of web crawls of GIPHY and Meme Generator, the Web Archiving Team was able to query our web archives data, first finding the number of unique GIFs and distinct meme instances out of both web archives. Read Trevor Owens' post about this release, "Data Mining Memes in the Digital Culture Web Archive" here.

  • Meme Generator Dataset Download External - The Meme Generator data set was generated from content harvested from Meme Generator and includes 57,652 unique meme instances derived from base memes (meme images without text, waiting to be fashioned into meme instances). The data set includes some minimal metadata for these Memes. The data set does not include the Memes themselves, however it does provide links to where you can access their web archive copies within the Library's web archive.
  • Exploring the Meme Generator Metadata External - This Jupyter notebook demonstrates some of the basic things that can be done with the set of data from memegenerator using the Python programming language.
  • GIPHY Dataset Download External - The GIPHY dataset was generated from content harvested from GIPHY, and includes 10,972 unique GIFs. The data set includes some minimal metadata for these GIFs. The data set does not include the GIFs themselves, however it dos provide links to where you can access their web archive copies within the Library's web archive.
  • Exploring the GIPHY.com Metadata External - This Jupyter notebook demonstrates an intermediate approach to exploring the GIPHY.com data set produced by the Library of Congress using the Python programming language.

Webcomics Web Archive Datasets

The following dataset comes from the Webcomics Web Archive collection. This collection focuses on comics created specifically for the web and supplements the Library of Congress’ extensive holdings in both comic books, graphic novels, and original comic art.

Miscellaneous Datasets

Documentation

The Library's Web Archiving program page provides background information and includes details about our collection policies, technical approach, and information for searching and browsing the entire web archive. We post about web archiving and new collections on the Signal Blog.

Researchers interested in using web archive datasets may also want to explore The Archives Unleashed Project External, which is developing web archive search and data analysis tools for scholars, librarians, and archivists. Another resource for information about web archiving is the International Internet Preservation Consortium External (we're a founding member). IIPC has a handy Awesome Web Archiving External guide available for those interested in learning more web archiving.

We need you

Our attempts to work through analysis of this data have helped our team identify some potential ways to enhance our crawls of these sites to better harvest more of their content. At the same time, we are also interested in exploring how making this kind of derivative data available to users might help spur further use and engagement with the collections. We are interested in hearing from users of these datasets – what you did with them, feedback on the documentation, what other derivative datasets might be of interest, and any other feedback or comments you have for us. Please write to us via our Contact Us form - we'd love to hear from you!

Back to top