Skip to content

Parse capture's collection in the archive from the WARC #94

@janheinrichmerker

Description

@janheinrichmerker

In many web archives, the captures are stored in collections that might offer a way to better attribute the source of a capture, e.g., if it was captured manually or in a focused web crawl.
For example, this capture https://web.archive.org/web/20230409121919/https://www.google.co.uk/search?q=how+to+learn+cinematography&ie=UTF-8&oe=UTF-8&hl=en-gb&client=safari stems from the "Archive Team" initiative:
Image

This information can be recovered from the HTML headers of the capture (stored in the SERP WARC in our data):

HTTP/2 200 
server: nginx
date: Wed, 16 Apr 2025 11:07:38 GMT
content-type: text/html; charset=UTF-8
x-archive-orig-date: Sun, 09 Apr 2023 12:19:19 GMT
x-archive-orig-expires: -1
...
x-archive-src: archiveteam_urls_20230409122307_bd241a16/urls_20230409122307_bd241a16.1650286618.megawarc.warc.zst
...

The x-archive-src header points to the file (i.e., archiveteam_urls_20230409122307_bd241a16/urls_20230409122307_bd241a16.1650286618.megawarc.warc.zst) on the archive's servers, which are in turn structured in folders (i.e., archiveteam_urls_20230409122307_bd241a16).

Now, when looking for that folder in the Internet Archive's item search (https://archive.org/search?query=archiveteam_urls_20230409122307_bd241a16), we find, that the folder corresponds to the "item" in the archive (https://archive.org/details/archiveteam_urls_20230409122307_bd241a16) for which additional metadata is available:
Image

It seems straight-forward to read this metadata and then add it to the AQL.

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions