Skip to content

Latest commit

 

History

History
111 lines (80 loc) · 3.75 KB

README.md

File metadata and controls

111 lines (80 loc) · 3.75 KB

Archivr

R-CMD-check Codecov test coverage

Archivr is a project by the Qualitative Data Repository that automates preservation of urls in Web Archives.

Installation

The easiest way to install is directly from this github using the devtools package:

library(devtools)
install_github("QualitativeDataRepository/archivr")
library(archivr)

Usage

The basic function is archiv that takes a list of urls and stores them in the Way Back Machine. It will return a dataframe containing the callback data for the service.

arc_df <- archiv(list("www.example.com", "NOTAURL", "www.github.com"))
arc_df$way_back_url   
#                                                        wayback_url
# 1 http://web.archive.org/web/20190128171132/http://www.example.com
# 2    http://web.archive.org/web/20190128171134/https://github.com/ ...

Archiv can archive all the urls in a webpage.

arc_url_df <- archiv.fromUrl("https://qdr.syr.edu/")
df <- data.frame(arc_url_df$url, arc_url_df$wayback_url)[8,]

#   arc_url_df.url                                    arc_url_df.wayback_url
# 8 http://syr.edu http://web.archive.org/web/20170110050058/http://syr.edu/

Archiv will also archive all the urls in a text file. It has been tested for docx, pdf and markdown, although other text-related files should also work. Note that text parsing can be subject to problems, especially if the document has rich features such as tables or columns.

arc_url_df <- archiv.fromText("path_to_file")

To allow for pre-processing of URLs before archiving, archivr also provides access to the funcitons used to extract URLs from a webpage (extract_urls_from_webpage("URL")), from a files (extract_urls_from_text("filepath")) (tested for .docx, markdown, and pdf), and from any supported text file in a folder (extract_urls_from_folder("filepath"))

Exempting Urls

Any of the functions that extract or archiv URLs from a document or URL, accept an except parameter, a regular expression (using R's grepl function) that will exclude URLs from extraction and archiving. E.g.

arc_url_df <- archiv.fromText("article.pdf", except="https?:\\/\\/(dx\\.)?doi\\.org\\/")

will exclude DOI links from archiving.

Checking archiving status

You can check whether URLs are archived by the Internet Archive's Wayback machine:

arc_url_df <- view_archiv(list("www.example.com", "NOTAURL", "www.github.com"), "wayback")

Using Perma.cc

If you wish to use perma.cc's archive, you will need to set your api key using:

set_api_key("YOUR_API_KEY")

if you wish to save the urls in a particular perma.cc folder, you will need to set the default folder id using

set_folder_id("FOLDER_ID")

If you do not remember the ids of your folders, you can retrieve these in a dataframe using:

get_folder_ids()

You can check your current folder using

get_folder_id()

and then you can archive materials:

arc_df <- archiv(list("www.example.com", "NOTAURL", "www.github.com"), "perma_cc")

To check if a list of urls are archived in perma_cc's public api, use:

arc_url_df <- view_archiv(list("www.example.com", "NOTAURL", "www.github.com"), "perma_cc")

Archivr is a project developed and maintained by the Qualitative Data Repository at Syracuse University, originally authored by Ryan Deschamps (greebie on github.com) and Agile Humanities.