Parse capture's collection in the archive from the WARC

In many web archives, the captures are stored in collections that might offer a way to better attribute the source of a capture, e.g., if it was captured manually or in a focused web crawl.
For example, this capture <https://web.archive.org/web/20230409121919/https://www.google.co.uk/search?q=how+to+learn+cinematography&ie=UTF-8&oe=UTF-8&hl=en-gb&client=safari> stems from the "Archive Team" initiative:
![Image](https://github.com/user-attachments/assets/5fa8d013-3f47-40ca-85e9-9cd034068e16)

This information can be recovered from the HTML headers of the capture (stored in the SERP WARC in our data):
```
HTTP/2 200 
server: nginx
date: Wed, 16 Apr 2025 11:07:38 GMT
content-type: text/html; charset=UTF-8
x-archive-orig-date: Sun, 09 Apr 2023 12:19:19 GMT
x-archive-orig-expires: -1
...
x-archive-src: archiveteam_urls_20230409122307_bd241a16/urls_20230409122307_bd241a16.1650286618.megawarc.warc.zst
...
```
The `x-archive-src` header points to the file (i.e., `archiveteam_urls_20230409122307_bd241a16/urls_20230409122307_bd241a16.1650286618.megawarc.warc.zst`) on the archive's servers, which are in turn structured in folders (i.e., `archiveteam_urls_20230409122307_bd241a16`).

Now, when looking for that folder in the Internet Archive's item search (https://archive.org/search?query=archiveteam_urls_20230409122307_bd241a16), we find, that the folder corresponds to the "item" in the archive (https://archive.org/details/archiveteam_urls_20230409122307_bd241a16) for which additional metadata is available:
![Image](https://github.com/user-attachments/assets/12b4f537-02df-46aa-b903-0672b8f9e9af)

It seems straight-forward to [read this metadata](https://archive.org/developers/internetarchive/quickstart.html#reading-metadata) and then add it to the AQL. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parse capture's collection in the archive from the WARC #94

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parse capture's collection in the archive from the WARC #94

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions