-
Notifications
You must be signed in to change notification settings - Fork 0
Description
In many web archives, the captures are stored in collections that might offer a way to better attribute the source of a capture, e.g., if it was captured manually or in a focused web crawl.
For example, this capture https://web.archive.org/web/20230409121919/https://www.google.co.uk/search?q=how+to+learn+cinematography&ie=UTF-8&oe=UTF-8&hl=en-gb&client=safari stems from the "Archive Team" initiative:

This information can be recovered from the HTML headers of the capture (stored in the SERP WARC in our data):
HTTP/2 200
server: nginx
date: Wed, 16 Apr 2025 11:07:38 GMT
content-type: text/html; charset=UTF-8
x-archive-orig-date: Sun, 09 Apr 2023 12:19:19 GMT
x-archive-orig-expires: -1
...
x-archive-src: archiveteam_urls_20230409122307_bd241a16/urls_20230409122307_bd241a16.1650286618.megawarc.warc.zst
...
The x-archive-src header points to the file (i.e., archiveteam_urls_20230409122307_bd241a16/urls_20230409122307_bd241a16.1650286618.megawarc.warc.zst) on the archive's servers, which are in turn structured in folders (i.e., archiveteam_urls_20230409122307_bd241a16).
Now, when looking for that folder in the Internet Archive's item search (https://archive.org/search?query=archiveteam_urls_20230409122307_bd241a16), we find, that the folder corresponds to the "item" in the archive (https://archive.org/details/archiveteam_urls_20230409122307_bd241a16) for which additional metadata is available:

It seems straight-forward to read this metadata and then add it to the AQL.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status