Skip to content

docs: remove orphaned translated manual pages#2165

Draft
dscho wants to merge 2 commits into
git:gh-pagesfrom
dscho:remove-orphan-translated-pages
Draft

docs: remove orphaned translated manual pages#2165
dscho wants to merge 2 commits into
git:gh-pagesfrom
dscho:remove-orphan-translated-pages

Conversation

@dscho
Copy link
Copy Markdown
Member

@dscho dscho commented May 16, 2026

This PR is the prerequisite for the follow-up update-dependencies PR (TBD), which bumps Hugo from 0.155.3 to 0.161.1 (alongside Pagefind, npm devDependencies, and a few related template and config migrations). Hugo refuses to build the site at the new version while 36 translated manual pages still sit at the pre-v0.144.0 front-matter shape lang: <code>. These files cannot be regenerated into the new shape by the existing update-translated-manual-pages.yml workflow because their upstream <lang>/<docname>.adoc sources have been removed from jnavila/git-html-l10n; the workflow only writes files it finds in the upstream source list, so outputs left behind from since-deleted sources are never revisited. I verified this locally by cloning jnavila/git-html-l10n at HEAD and confirming that none of the 36 removed files have a <lang>/<docname>.adoc or .txt counterpart there.

The second commit teaches script/update-docs.rb to avoid accumulating this kind of stale content in the future: it now tracks every (docname, lang) pair seen in upstream source during a run (recorded above the unchanged-content skip so the cleanup correctly distinguishes "source vanished" from "source unchanged since last run"), then deletes any translation file whose (docname, lang) is not in the seen set. The cleanup is gated on seen_translations being non-empty so the "nothing has changed since the last run" early exit cannot delete anything, and files whose front matter begins with redirect_to: are preserved so the script's per-language redirect-stub mechanism is not undone.

The change was end-to-end tested by integrating this branch into my fork's gh-pages and re-running the deploy workflow: the build that previously failed at https://github.com/dscho/git-scm.com/actions/runs/25937227599 with lang in front matter was deprecated in Hugo v0.144.0 and subsequently removed. now passes cleanly at https://github.com/dscho/git-scm.com/actions/runs/25958311574 (Hugo, Pagefind, lychee, Playwright).

dscho added 2 commits May 15, 2026 21:22
These 36 translated manual pages were generated by
`script/update-docs.rb` from `jnavila/git-html-l10n` at some point in
the past, but the corresponding `<lang>/<docname>.adoc` source files
no longer exist in that upstream repository. Verified locally by
cloning `jnavila/git-html-l10n` (HEAD as of this commit) and
checking, for each removed file, that neither
`<lang>/<docname>.adoc` nor `<lang>/<docname>.txt` is present.
Without an upstream source, no `update-docs.rb` invocation, even
with `RERUN=true` set, will refresh these files: the script iterates
the upstream source list and only writes files it finds, so an
output left behind from a since-deleted source is never revisited.

These translations have therefore drifted out of sync with the
English manual pages they once translated. One of them,
`git-parse-remote/{pt_BR,ru}.html`, even translates a Git command
whose own English manual page has since been removed from `git.git`
(the `Documentation/git-parse-remote.adoc` source 404s on the
`git/git` HEAD).

The trigger for cleaning them up now, rather than later, is that the
upcoming Hugo upgrade refuses to build the site while these files
are present. They use the pre-v0.144.0 front-matter shape
`lang: <code>`, where `lang` is a reserved Hugo key whose handling
Hugo deprecated in v0.144.0 and now escalates to a build-aborting
error 15 minor releases later
(`common/hugo/hugo.go:deprecationLogLevelFromVersion`). The
`update-translated-manual-pages.yml` workflow with
`force-rebuild: true` already migrated the still-maintained
translations to the new `params.lang:` nesting (commit
`Update translated manual pages` on the parent branch); only these
36 orphans remained on the old shape because no upstream source
was found to re-emit them.

A follow-up commit teaches `script/update-docs.rb` to detect this
case and remove orphaned outputs as part of every L10N regeneration
run, so this kind of stale content cannot accumulate again.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Before this change, `script/update-docs.rb`'s L10N path only ever
*added* or *overwrote* output under
`external/docs/content/docs/<docname>/<lang>.html`; it had no
mechanism to *remove* anything. As a result, when the upstream
`jnavila/git-html-l10n` repository deleted a translated source file
(typically because the corresponding English manual page itself had
been removed or replaced upstream), the previously-generated HTML
output was orphaned in this repository forever. The previous commit
swept up 36 such orphans by hand.

The fix is to track every `(docname, lang)` pair the script *saw*
in the upstream source during a run, then walk the on-disk output
tree once at the end and delete any translation file whose
`(docname, lang)` is not in that set.

The set is populated above the `next if !rerun && lang_data[lang]
== asciidoc_sha` short-circuit in the inner loop, so files whose
source is unchanged still count as "seen" and are *not* deleted.
This is the critical safety condition: an orphan must be
distinguishable from a file the script merely chose not to
re-render.

Two further safeguards keep the cleanup conservative. First, the
whole pass is gated on `seen_translations.empty?`, so the "nothing
has changed since the last run" early exit (the `next if !rerun &&
l10n["committed"] >= ts` at the top of the tag loop, which leaves
`seen_translations` empty) cannot trigger an erroneous deletion.
Second, files whose front matter begins with `redirect_to:` are
skipped: the `check_paths` loop in the same function writes these
redirect stubs whenever one translation links to another that does
not (yet) exist, and many of them sit in directories where the
"target" docname has never had a translation in this language.
Such stubs preserve historic URLs of the form
`/docs/<docname>/<lang>` from 404'ing and must not be confused
with orphaned translation outputs. Verified empirically by running
`RERUN=true bundle exec ruby script/update-docs.rb
/path/to/git-html-l10n l10n` against a tree seeded with a
deliberately-fake orphan: only the fake was removed; all 768
pre-existing redirect stubs were preserved.

The lang-code regex `\A[a-z]{2}(?:_[A-Z]+(?:-[A-Z]+)?)?\z` matches
translation file names like `fr.html`, `pt_BR.html`, and
`zh_HANS-CN.html`, and deliberately rejects the versioned English
shape like `2.30.0.html` that lives in the same directories and is
written by the unrelated `index_doc` (English) code path.

Future runs of `update-translated-manual-pages.yml` will therefore
keep the output tree in sync with upstream, both adding new
translations and removing those whose source has gone away.

Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho dscho self-assigned this May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant