Web Archive Datasets

looking up through towers of a server farm

A collaboration to produce web archive datasets between LC Labs and the Web Archiving Program (retired 2025)

The Web Archive Datasets were created through a collaboration between LC Labs and the Library’s Web Archiving Program in order to better facilitate research in the Library of Congress Web Archives. Since the initial incubation in 2020 by LC Labs, the datasets have moved into other locations and are maintained by the Library's Web Archive Program.

Web Archiving Experiment's Impact

Producing derivative datasets from the web archives helped computational researchers, learners, and teachers explore new ways of engaging with the web content of the past. Throughout its lifetime the Web Archiving Experiment page saw over 5,500 views.

Through this collaboration LC Labs and the Web Archiving Program:

Initiated processes and workflows for identifying datasets and packaging the data.
Created public documentation that supports engagement with and creative reuse of the datasets.
Developed a better understanding of the media objects that comprise the web archives.

Learn More About the Web Archiving Experiment

There are two data packages available on data.labs.loc.gov/dot-gov/: the United States Web Election Data Package and the Selected Dot Gov Media Types Data Package. Additionally, web archiving datasets in the Library's permanent collections can be found in the Select Datasets collection. Of these datasets the following were initially part of the Selected Dot Gov Media Types Data Package experiment:PDF Dataset, Audio Dataset, XLS Dataset, TSV Dataset, CSV Dataset, Image Dataset, and PowerPoint Dataset.

Check out this Signal blog post to learn more about the Dot Gov Media Types datasets and check out this Signal blog post to learn more about the United States Web Elections Data Package.

Want to see what the original Web Archive Dataset page looked like?

If you would like to see what this page looked like during the experiment, including the original content hosted on this page, please visit the archived version of this page captured by the Library’s Web Archiving Program.