- have i18n stuff set 'Content-Language' in response header - only add id_ to PDF replay (not other content types) - add 'hdl' to schema (release extid) human names: https://www.librarian.net/stax/5222/what-you-learn-in-library-school-whats-in-a-name/ - refactor publisher domain/link lookup into python code (not jinja2 template logic) - async elasticsearch: https://elasticsearch-py.readthedocs.io/en/v7.10.1/async.html => not until elasticsearch-dsl support exists - adopt a better "remove tags" library in clean_str() (replace bs4) https://stackoverflow.com/a/57648173/4682349 work in progress: - SIM respect 'noindex' (verify) - SIM fetch much less metadata (changes in API?) copy editing: - "how it works" page - "Contribute" -> "How To Participate" - web.archive.org not found -> resolved workers: - fetch worker: filter by changelog or 'updated' datetime - work deletion: some bundle version/variant to allow deleting ES documents? - SIM updater x what is the process for updating issue DB? cronjob? at least have a makefile target => not really a problem now that SIM is ~done SIM: x SIM parallelism x fatcat lookup: ISSN/ISSN-L - ambiguity. sim_pubid only extra? are both ISSNs available? - add page count to issn-db (?) content/pipeline: - add gzip to intermediate files pipeline commands - makefile targets for bulk ingest cleanups: x "web assets" (CSS etc) in this repo or on *.archive.org in general x have "json to IntermediateBundle" be a helper method, instead of multiple implementations - better typing/annotation of work pipeline - test coverage - use settings.toml for defaults of CLI args ponder: - "search inside" phrasing - "counts" target to summarize (to console) - Onion-Location header - description - canonical links? - jinja2: "if xyz is defined" better than "if xyz" - "default" translation option (clear prefix, use browser default) - should SERP page have an