open-science web-service API || wayback machine web archive

Scrapping Archived Data with the Wayback Machine

In a previous article, I wrote about the possibilities of the Wayback Machine for scientific writing. I argued that archiving web pages are essential for references as they prevent link rots when cited web resources are not available anymore. With this blog entry, I am looking quasi into the reverse option: How to find and retrieve archived web pages for research reasons?

Archives web pages as permanently stored data are indispensable for reproducibility issues. But they are also valuable research resources as they are data for historical and comparative research. I will demonstrate the research significance with the historical analysis of static website generators. This here is the first part and shows how to use the Wayback Machine for retrieving archived web pages. The second part displays the results of the analysis which would not be possible without web archiving.

The nitty-gritty of this article comes from the excellent work of Bob Rudis, who wrote many well documented Tools to Work with the Various Internet Archive Wayback Machine APIs.

Some simple exercises

Preliminaries

There are three different APIs for the Wayback Machine:

  1. Wayback Availability API
  2. Memento API and
  3. Wayback CDX Server API

I will explain the first two. The CDX Server is for complex querying, filtering and analysis and I didn’t use it for my example. At first you have to install the wayback and load Bob’s script collection.

if (!require("wayback"))
        {remotes::install_github("hrbrmstr/wayback", build_vignettes = TRUE)
        library(wayback)}
## Loading required package: wayback

Does the Internet Archive have my research URL cached?

The URL I am looking for is: https://www.staticgen.com/. I am going to use archive_available(url, timestamp): The timestamp parameter is optional. If it is lacking then the query date is used. If the URL is stored then the function returns the chronologically nearest archived version. Return value is a tibble with one observation and 5 variables:

staticgen_avail <- archive_available("https://www.staticgen.com/")
staticgen_avail
## # A tibble: 1 x 5
##   url         available closet_url               timestamp           status
##   <chr>       <lgl>     <chr>                    <dttm>              <chr> 
## 1 https://ww… TRUE      http://web.archive.org/… 2019-07-29 00:00:00 200

Retrieve site mementos from the Internet Archive

Mementos are prior versions of web pages cached by web crawlers and stored in web archives. The Internet Archive) is one of these web archives but there exist other software as well, including systems that support versioning such as wikis or revision control systems. The HTTP-based Memento framework is a description for Time-Based Access to Resource States.

The HTTP-based Memento framework bridges the present and past Web. It facilitates obtaining representations of prior states of a given resource by introducing datetime negotiation and TimeMaps. Datetime negotiation is a variation on content negotiation that leverages the given resource’s URI and a user agent’s preferred datetime. TimeMaps are lists that enumerate URIs of resources that encapsulate prior states of the given resource. The framework also facilitates recognizing a resource that encapsulates a frozen prior state of another resource.

There are several resources for a better understanding of the memento framework:

The Open Wayback software used by the Internet Archive is fully compliant with RFC 7089, the specification of the Memento protocol.

With get_mementos(url, timestamp = format(Sys.Date(), "%Y")) we will receive a short list of relevant links to the archived content. The function returns the four link relation types as in the Request for Comment for the Memento framework outlined.

  1. Link Relation Type “original”
  2. Link Relation Type “timemap”
  3. Link Relation Type “timegate”
  4. Link Relation Type “memento”

Besides these 4 main types of link relations the function provides also the first, previous and last available memento. Normally the last memento is identical with the memento link relation type. In addition to the two columns link' andrelthere is a third onets`, containing the time stamps (empty for the first three link relation types). The return value in total is a tibble with 7 observations (rows) and three columns.

staticgen_mntos <- get_mementos("https://www.staticgen.com/")
staticgen_mntos
## # A tibble: 7 x 3
##   link                                        rel       ts                 
##   <chr>                                       <chr>     <dttm>             
## 1 https://www.staticgen.com/                  original  NA                 
## 2 http://web.archive.org/web/timemap/link/ht… timemap   NA                 
## 3 http://web.archive.org/web/https://www.sta… timegate  NA                 
## 4 http://web.archive.org/web/20130905221150/… first me… 2013-09-05 22:11:50
## 5 http://web.archive.org/web/20190729021222/… prev mem… 2019-07-29 02:12:22
## 6 http://web.archive.org/web/20190801211621/… memento   2019-08-01 21:16:21
## 7 http://web.archive.org/web/20190801211621/… last mem… 2019-08-01 21:16:21

Get the point-in-time memento crawl list

Providing an URL in the search field of the Wayback Machine results in the interactive browser version to the calender view. Th dates with archived content are blue or green (redirected URL) circled. The bigger the circles the more snapshots were archived on these dates.

We get these dated crawl list with the second observation of the get_mementos function. The execution of the next code chunk can take several moments, depending how many pages of the URL are archived. Be aware that the Wayback server is strained by this query, so do not repeat it several times but store the result on your hard disk.

staticgen_tm <- get_timemap(staticgen_mntos$link[2])
staticgen_tm
## # A tibble: 489 x 5
##    rel      link                     type       from         datetime      
##    <chr>    <chr>                    <chr>      <chr>        <chr>         
##  1 original http://staticgen.com:80/ <NA>       <NA>         <NA>          
##  2 self     http://web.archive.org/… applicati… Thu, 05 Sep… <NA>          
##  3 timegate http://web.archive.org   <NA>       <NA>         <NA>          
##  4 first m… http://web.archive.org/… <NA>       <NA>         Thu, 05 Sep 2…
##  5 memento  http://web.archive.org/… <NA>       <NA>         Sun, 06 Oct 2…
##  6 memento  http://web.archive.org/… <NA>       <NA>         Wed, 06 Nov 2…
##  7 memento  http://web.archive.org/… <NA>       <NA>         Mon, 25 Nov 2…
##  8 memento  http://web.archive.org/… <NA>       <NA>         Thu, 26 Dec 2…
##  9 memento  http://web.archive.org/… <NA>       <NA>         Mon, 20 Jan 2…
## 10 memento  http://web.archive.org/… <NA>       <NA>         Mon, 20 Jan 2…
## # … with 479 more rows

Included into the 489 captures of the interactive browser version there are four rows, relating to the four link relation types mentioned above. The last line is empty.

Summary: Putting all together

We can put together all previous preliminay steps into a new function e.g., get_crawl(url). This functions gets an URL and returns a list of all archived versions for this URL.

  • Check if for the URL exists an archived version. If not: stop execeution.
  • If exists an archived version, then retireve mementos for this url from the Internet Archive.
  • Get the point-in-time memento crawl list for this URL.
  • Clean up so that only memento links remain
  • Delete unnecessary rows type and from.
  • Convert row datetime from class ‘character’ to class ‘datetime’.
  • Delete duplicate datetime records. (Sometimes there are more than one capture taken at the same day, refering to the URL and the port used.)
  • Filter rows with an algorithm, so that only those mementos remain which are suitable for the comparison analysis. For instance: Take the first memento for every year, or every month etc.

As these steps are either already illustrated or not specific to the Wayback Machine, I will stop here with my explications. I have provided an article on RPubs with all the details. But keep in mind that I am not very experienced in R and there may be much simpler and more elegant solutions.

Read my RPubs article with all the details of retrieving and web scrapping. Visit also my tutorial on How-to use the Wayback Machine interactively?.

Page created: 2019-08-01 | Last modified: 2019-08-02
comments powered by Disqus