Skip to main content

Web Archive Data Sets

looking up through towers of a server farm

Explore content from the Library's web archives

In order to better facilitate research and to better understand the needs of users who might be interested in the Library of Congress Web Archives, the Library's Web Archiving Team is working to create and make available a number of derivative data sets available for users to download, re-use and explore. After participating in the Library's pilot to explore how we can better enable digital scholarship, beyond basic browsing of records and URL search, the team began exploring ways in which we could process our archives, the scale of which can be overwhelming to digital humanities researchers and other users. From this experimentation, the team created two derivative data sets which would provide researchers smaller bits of data to help users engage with and learn more from our archives.

We plan to launch additional data sets with other types of content in the future, so keep an eye on this page and follow The Signal blog for updates.

Data Sets and Tutorials

Our initial offering is two data sets that comprise of content found in the American Folklife Center's Web Cultures Web Archive. As part of an effort to review the coverage of web crawls of GIPHY and Meme Generator, the Web Archiving Team was able to query our web archives data, first finding the number of unique GIFs and distinct meme instances out of both web archives.

  • Meme Generator Data Set Download External - The Meme Generator data set was generated from content harvested from Meme Generator and includes 57,652 unique meme instances derived from base memes (meme images without text, waiting to be fashioned into meme instances). The data set includes some minimal metadata for these Memes. The data set does not include the Memes themselves, however it does provide links to where you can access their web archive copies within the Library's web archive.
  • Exploring the Meme Generator Metadata External - This notebook demonstrates some of the basic things that can be done with the set of data from memegenerator.
  • GIPHY Data Set Download External - The GIPHY dataset was generated from content harvested from GIPHY, and includes 10,972 unique GIFs. The data set includes some minimal metadata for these GIFs. The data set does not include the GIFs themselves, however it dos provide links to where you can access their web archive copies within the Library's web archive.
  • Exploring the Metadata External - This notebook demonstrates an intermediate approach to exploring the data set produced by the Library of Congress.
  • 1000 .gov PDF Dataset Download (673.5 MB zip file) External - This dataset of 1,000 PDF files was generated from indexes of the Web archives, which were used to derive a random list of 1,000 items identified as PDF files and hosted on .gov domains. The set includes 1,000 unique PDF files and minimal metadata about these PDFs, including links to their locations within the Library's web archive. If the zip file is too big, you can download the README External and/or the metadata csv External separately.


Read Trevor Owens' post about this release, "Data Mining Memes in the Digital Culture Web Archive" here.

The Library's Web Archiving program page provides background information and includes details about our collection policies, technical approach, and information for searching and browsing the entire web archive. We post about web archiving and new collections on the Signal Blog.

Researchers interested in using web archive data sets may also want to explore The Archives Unleashed Project External, which is developing web archive search and data analysis tools for scholars, librarians, and archivists. Another resource for information about web archiving is the International Internet Preservation Consortium External (we're a founding member). IIPC has a handy Awesome Web Archiving External guide available for those interested in learning more web archiving.

We need you

Our attempts to work through analysis of this data have helped our team identify some potential ways to enhance our crawls of these sites to better harvest more of their content. At the same time, we are also interested in exploring how making this kind of derivative data available to users might help spur further use and engagement with the collections. We are interested in hearing from users of these data sets – what you did with them, feedback on the documentation, what other derivative data sets might be of interest, and any other feedback or comments you have for us. Please write to us via our Contact Us form - we'd love to hear from you!

 Back to top