Skip to main content

Web Archive Datasets

looking up through towers of a server farm

Explore content from the Library's web archives

In order to better facilitate research and to better understand the needs of users who might be interested in the Library of Congress Web Archives, the Library's Web Archiving Team is working to create and make available a number of derivative datasets available for users to download, re-use and explore. After participating in the Library's pilot to explore how we can better enable digital scholarship, beyond basic browsing of records and URL search, the team began exploring ways in which we could process our archives, the scale of which can be overwhelming to digital humanities researchers and other users. From this experimentation, the team created two derivative datasets which would provide researchers smaller bits of data to help users engage with and learn more from our archives.

We plan to launch additional datasets with other types of content in the future, so keep an eye on this page and follow The Signal blog for updates.

Datasets and Tutorials

Dot Gov Datasets

The following is a series of datasets each containing 1,000 files of related media types selected from .gov domains.

  • 3000 .gov Tabular Dataset Download (CSV External, TSV External, XLS External) - Each of these datasets consist of 1,000 files generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as CSV, tab-separated (TSV), or Excel (XLS) files and hosted on .gov domains. Each set includes 1,000 unique CSV, TSV, and XLS files and minimal metadata about them, including links to their locations within the Library's web archive. If the zip file is too big, you can download the README (CSV README External, TSV README External, XLS README External) and/or the metadata csv (CSV METADATA External, TSV METADATA External, XLS METADATA External) separately.
  • 1000 .gov PDF Dataset Download (673.5 MB zip file) External - This dataset of 1,000 PDF files was generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as PDF files and hosted on .gov domains. The set includes 1,000 unique PDF files and minimal metadata about these PDFs, including links to their locations within the Library's web archive. If the zip file is too big, you can download the README External and/or the metadata csv External separately.
  • 1000 .gov Audio Dataset Download (4.6 GB zip file) External - This dataset of 1,000 audio files was generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as audio files and hosted on .gov domains. The set includes 1,000 unique audio files and minimal metadata about them, including links to their locations within the Library's web archive. If the zip file is too big, you can download the README External and/or the metadata csv External separately.
  • 1000 .gov Image Dataset Download (128.16 MB zip file) External - This dataset of 1,000 images was generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as image files and hosted on .gov domains. The set includes 1,000 unique image files (primarily with GIF, JPG, PNG, and TIFF extensions) and minimal associated metadata, including links to their locations within the Library’s web archive. If the zip file is too big, you can download the README External and/or the metadata csv External separately.
  • 1000 .gov PowerPoint Dataset Download (3.2 GB zip file) External - This dataset of 1,000 PowerPoint files was generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as PowerPoint files and hosted on .gov domains. The set includes 1,000 unique files and minimal metadata about these, including links to their locations within the Library's web archive. If the zip file is too big, you can download the README External and/or the metadata csv External separately.

Web Cultures Datasets

These datasets comprise of content found in the American Folklife Center's Web Cultures Web Archive. As part of an effort to review the coverage of web crawls of GIPHY and Meme Generator, the Web Archiving Team was able to query our web archives data, first finding the number of unique GIFs and distinct meme instances out of both web archives. Read Trevor Owens' post about this release, "Data Mining Memes in the Digital Culture Web Archive" here.

  • Meme Generator Dataset Download External - The Meme Generator data set was generated from content harvested from Meme Generator and includes 57,652 unique meme instances derived from base memes (meme images without text, waiting to be fashioned into meme instances). The data set includes some minimal metadata for these Memes. The data set does not include the Memes themselves, however it does provide links to where you can access their web archive copies within the Library's web archive.
  • Exploring the Meme Generator Metadata External - This Jupyter notebook demonstrates some of the basic things that can be done with the set of data from memegenerator using the Python programming language.
  • GIPHY Dataset Download External - The GIPHY dataset was generated from content harvested from GIPHY, and includes 10,972 unique GIFs. The data set includes some minimal metadata for these GIFs. The data set does not include the GIFs themselves, however it dos provide links to where you can access their web archive copies within the Library's web archive.
  • Exploring the GIPHY.com Metadata External - This Jupyter notebook demonstrates an intermediate approach to exploring the GIPHY.com data set produced by the Library of Congress using the Python programming language.

Miscellaneous Datasets

  • Iraq Selected Image Metadata (17.8 MB zip file) External - This dataset contains metadata for 306,954 image objects from the Iraq War 2003 Collection. The metadata was extracted using Apache Spark to query across the Library of Congress's Web Archive indexes. This dataset was created to satisfy a specific researcher request and contains metadata about image objects from 21 domains from the years 2003-2006, per the researcher's request.

Documentation

The Library's Web Archiving program page provides background information and includes details about our collection policies, technical approach, and information for searching and browsing the entire web archive. We post about web archiving and new collections on the Signal Blog.

Researchers interested in using web archive datasets may also want to explore The Archives Unleashed Project External, which is developing web archive search and data analysis tools for scholars, librarians, and archivists. Another resource for information about web archiving is the International Internet Preservation Consortium External (we're a founding member). IIPC has a handy Awesome Web Archiving External guide available for those interested in learning more web archiving.

We need you

Our attempts to work through analysis of this data have helped our team identify some potential ways to enhance our crawls of these sites to better harvest more of their content. At the same time, we are also interested in exploring how making this kind of derivative data available to users might help spur further use and engagement with the collections. We are interested in hearing from users of these datasets – what you did with them, feedback on the documentation, what other derivative datasets might be of interest, and any other feedback or comments you have for us. Please write to us via our Contact Us form - we'd love to hear from you!

 Back to top