Skip to main content

Web Archive Data Sets

looking up through towers of a server farm

Explore content from the Library's web archives

In order to better facilitate research and to better understand the needs of users who might be interested the Library of Congress Web Archives, the Library's Web Archiving team is working to create and make available a number of derivative data sets available for users to download, re-use and explore, and experiment with. After participating in the Library's pilot to explore how we can better enable digital scholarship, beyond basic browsing of records and URL search, the team began exploring ways in which we could process our archives, the scale of which can be overwhelming to digital humanities and other users, to create derivative data sets which would provide researchers smaller bits of data to help users engage with and learn more from our archives.

We plan to launch additional data sets with other types of content in the future, so keep an eye on this page and follow The Signal blog for updates.

Data Sets

Our initial offering is two data sets that comprise of content found in the American Folklife Center's Web Cultures Web Archive. As part of an effort to review the coverage of web crawls of GIPHY and Meme Generator, the Web Archiving Team was able to query our web archives data, first finding the number of unique GIFs and distinct meme instances out of both web archives.

  • Meme Generator Data Set Download External - The Meme Generator data set includes 86,310 total memes harvested from Meme Generator. There are 57,652 unique meme instances derived from base memes (meme images without text, waiting to be fashioned into meme instances). The data set include some minimal metadata for these Memes. The data set does not include the Memes themselves, however it does provide links to where you can access their web archive copies within the Library's web archive.
  • GIPHY Data Set Download External - The GIPHY dataset was generated from content harvested from GIPHY, and includes 14,787 total GIFs, of which 10,972 are unique. The data sets include some minimal metadata for these GIFs, as well as links to where you can access their web archive copies.

Documentation

Read Trevor Owens' post about this release, "Data Mining Memes in the Digital Culture Web Archive" here.

The Library's Web Archiving program page provides background information and includes details about our collection policies, technical approach, and information for searching and browsing the entire web archive. We post about web archiving and new collections on the Signal Blog.

Researchers interested in using web archive data sets may also want to explore The Archives Unleashed Project External, which is developing web archive search and data analysis tools for scholars, librarians, and archivists. Another resource for information about web archiving is the International Internet Preservation Consortium External (we're a founding member). IIPC has a handy Awesome Web Archiving External guide available for those interested in learning more web archiving.

We need you

Our attempts to work through analysis of this data have helped our team identify some potential ways to enhance our crawls of these sites to better harvest more of their content. At the same time, we are also interested in exploring how making this kind of derivative data available to users might help spur further use and engagement with the collections. We are interested in hearing from users of these data sets – what you did with them, feedback on the documentation, what other derivative data sets might be of interest, and any other feedback or comments you have for us. Please write to us via our Contact Us form - we'd love to hear from you!

 Back to top