docs: remove orphaned translated manual pages#2165
Draft
dscho wants to merge 2 commits into
Draft
Conversation
These 36 translated manual pages were generated by
`script/update-docs.rb` from `jnavila/git-html-l10n` at some point in
the past, but the corresponding `<lang>/<docname>.adoc` source files
no longer exist in that upstream repository. Verified locally by
cloning `jnavila/git-html-l10n` (HEAD as of this commit) and
checking, for each removed file, that neither
`<lang>/<docname>.adoc` nor `<lang>/<docname>.txt` is present.
Without an upstream source, no `update-docs.rb` invocation, even
with `RERUN=true` set, will refresh these files: the script iterates
the upstream source list and only writes files it finds, so an
output left behind from a since-deleted source is never revisited.
These translations have therefore drifted out of sync with the
English manual pages they once translated. One of them,
`git-parse-remote/{pt_BR,ru}.html`, even translates a Git command
whose own English manual page has since been removed from `git.git`
(the `Documentation/git-parse-remote.adoc` source 404s on the
`git/git` HEAD).
The trigger for cleaning them up now, rather than later, is that the
upcoming Hugo upgrade refuses to build the site while these files
are present. They use the pre-v0.144.0 front-matter shape
`lang: <code>`, where `lang` is a reserved Hugo key whose handling
Hugo deprecated in v0.144.0 and now escalates to a build-aborting
error 15 minor releases later
(`common/hugo/hugo.go:deprecationLogLevelFromVersion`). The
`update-translated-manual-pages.yml` workflow with
`force-rebuild: true` already migrated the still-maintained
translations to the new `params.lang:` nesting (commit
`Update translated manual pages` on the parent branch); only these
36 orphans remained on the old shape because no upstream source
was found to re-emit them.
A follow-up commit teaches `script/update-docs.rb` to detect this
case and remove orphaned outputs as part of every L10N regeneration
run, so this kind of stale content cannot accumulate again.
Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Before this change, `script/update-docs.rb`'s L10N path only ever
*added* or *overwrote* output under
`external/docs/content/docs/<docname>/<lang>.html`; it had no
mechanism to *remove* anything. As a result, when the upstream
`jnavila/git-html-l10n` repository deleted a translated source file
(typically because the corresponding English manual page itself had
been removed or replaced upstream), the previously-generated HTML
output was orphaned in this repository forever. The previous commit
swept up 36 such orphans by hand.
The fix is to track every `(docname, lang)` pair the script *saw*
in the upstream source during a run, then walk the on-disk output
tree once at the end and delete any translation file whose
`(docname, lang)` is not in that set.
The set is populated above the `next if !rerun && lang_data[lang]
== asciidoc_sha` short-circuit in the inner loop, so files whose
source is unchanged still count as "seen" and are *not* deleted.
This is the critical safety condition: an orphan must be
distinguishable from a file the script merely chose not to
re-render.
Two further safeguards keep the cleanup conservative. First, the
whole pass is gated on `seen_translations.empty?`, so the "nothing
has changed since the last run" early exit (the `next if !rerun &&
l10n["committed"] >= ts` at the top of the tag loop, which leaves
`seen_translations` empty) cannot trigger an erroneous deletion.
Second, files whose front matter begins with `redirect_to:` are
skipped: the `check_paths` loop in the same function writes these
redirect stubs whenever one translation links to another that does
not (yet) exist, and many of them sit in directories where the
"target" docname has never had a translation in this language.
Such stubs preserve historic URLs of the form
`/docs/<docname>/<lang>` from 404'ing and must not be confused
with orphaned translation outputs. Verified empirically by running
`RERUN=true bundle exec ruby script/update-docs.rb
/path/to/git-html-l10n l10n` against a tree seeded with a
deliberately-fake orphan: only the fake was removed; all 768
pre-existing redirect stubs were preserved.
The lang-code regex `\A[a-z]{2}(?:_[A-Z]+(?:-[A-Z]+)?)?\z` matches
translation file names like `fr.html`, `pt_BR.html`, and
`zh_HANS-CN.html`, and deliberately rejects the versioned English
shape like `2.30.0.html` that lives in the same directories and is
written by the unrelated `index_doc` (English) code path.
Future runs of `update-translated-manual-pages.yml` will therefore
keep the output tree in sync with upstream, both adding new
translations and removing those whose source has gone away.
Assisted-by: Opus 4.7
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is the prerequisite for the follow-up
update-dependenciesPR (TBD), which bumps Hugo from 0.155.3 to 0.161.1 (alongside Pagefind, npm devDependencies, and a few related template and config migrations). Hugo refuses to build the site at the new version while 36 translated manual pages still sit at the pre-v0.144.0 front-matter shapelang: <code>. These files cannot be regenerated into the new shape by the existingupdate-translated-manual-pages.ymlworkflow because their upstream<lang>/<docname>.adocsources have been removed fromjnavila/git-html-l10n; the workflow only writes files it finds in the upstream source list, so outputs left behind from since-deleted sources are never revisited. I verified this locally by cloningjnavila/git-html-l10nat HEAD and confirming that none of the 36 removed files have a<lang>/<docname>.adocor.txtcounterpart there.The second commit teaches
script/update-docs.rbto avoid accumulating this kind of stale content in the future: it now tracks every(docname, lang)pair seen in upstream source during a run (recorded above the unchanged-content skip so the cleanup correctly distinguishes "source vanished" from "source unchanged since last run"), then deletes any translation file whose(docname, lang)is not in the seen set. The cleanup is gated onseen_translationsbeing non-empty so the "nothing has changed since the last run" early exit cannot delete anything, and files whose front matter begins withredirect_to:are preserved so the script's per-language redirect-stub mechanism is not undone.The change was end-to-end tested by integrating this branch into my fork's gh-pages and re-running the deploy workflow: the build that previously failed at https://github.com/dscho/git-scm.com/actions/runs/25937227599 with
lang in front matter was deprecated in Hugo v0.144.0 and subsequently removed.now passes cleanly at https://github.com/dscho/git-scm.com/actions/runs/25958311574 (Hugo, Pagefind, lychee, Playwright).