Skip to content

Conversation

@janheinrichmerker
Copy link
Contributor

@janheinrichmerker janheinrichmerker commented Sep 22, 2025

  • Migrate from elasticsearch-dsl to (type-safe) elasticsearch-pydantic
  • Restructured parsing to use hard-coded (i.e., unit-testable) parsers
  • Migrated monitoring (Flask → FastAPI, now more flexible API)
  • Implement web search result block landing page download (fix Download referenced webpages from search results #6)
  • Remove unused legacy code
  • Updated readme
  • Updated dependencies
  • Updated CI
  • Add dry-run option to test most CLI actions without actual writes
  • Add JSONL export (and local "fake" WARC store for testing)
  • Migrate "legacy" unit tests to work with the current parsing architecture

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Download referenced webpages from search results

2 participants