Add to bibliography update a check of every URL if broken vs. accessible. by MattHeffron · Pull Request #337 · Interlisp/Interlisp.github.io

MattHeffron · 2026-06-19T05:53:04Z

Initial attempt. Too many "false positive" indicating broken when not.
I tried using LWP perl library. It was faster than calling "curl", but that gave even more false positives.
It may be having trouble with redirects.

stumbo · 2026-06-19T12:07:21Z

Sounds like the next step is going to be to separate the return values and based on that decide if it's a paywall or a broken link. We'll also need to be careful that we handle 429s correctly and back off on the number of requests we're making if needed.

Might want to add a flag to allow skipping the check - I imagine it takes a bit.

…ning why they're getting errors from curl, but (most) not when entered in a browser.

MattHeffron · 2026-06-20T19:23:26Z

Here's the bibSplit.err and the .hdr files generated by the bibSplit.pl I just committed.
bibSplit-info.zip

With the mostly 403 errors, I had hoped that adding the --referer from the bibliography might help. Nope.
Also, there are 3 SSL certificate errors, 1 DNS failure, and 1 failed to connect to server (possibly just a transient).

stumbo · 2026-06-26T11:08:49Z

I'm coming to the conclusion that URL testing is doomed to fail, or needs significant revisions to be valuable. I've been playing around with curl and comparing it to directly accessing the website. I was using this site as a test case:
http://jgs.lyellcollection.org/lookup/doi/10.1144/gsjgs.142.5.0925

It works if I access it via a browser. It gets to the final website via two redirects:

Testing the starting URL with curl fails:

curl --output /dev/null --silent --show-error --head --fail \
  --dump-header "test.hdr" --location \
  --referer "https://interlisp.org/history/bibliography;auto" \
  "http://jgs.lyellcollection.org/lookup/doi/10.1144/gsjgs.142.5.0925"

curl: (22) The requested URL returned error: 403

The problem from examining the returned headers is that curl fails a challenge from Cloudflare -- basically they are trying to prevent bots from accessing their site. And, curl falls into the bot category.

I asked Claude to interpret the resulting headers and its assessment was:
The 403 is not a true "access denied" — it's a Cloudflare bot challenge that requires a JavaScript-capable browser to solve. No amount of tweaking curl flags (user agent, referer, etc.) will fix this, because the challenge fundamentally requires browser execution. When you open the URL in a browser, Chrome/Firefox solves the challenge automatically and you get through.

Given anything I did with curl failed, I'm pretty comfortable that its assessment is correct.

It also suggested we could adjust our handling of 403s based on the server, if server: cloudflare we could treat 403s as warnings and ignore the return. The only problem with that approach is we would need to build a library of every server that issues a challenge. That seems brittle.

Initial attempt. Too many "false positive" indicating broken when not.

abb2809

Added output of curl errors and headers from bibSplit.pl, for determi…

ff0fcd4

…ning why they're getting errors from curl, but (most) not when entered in a browser.

MattHeffron self-assigned this Jun 20, 2026

MattHeffron added the enhancement New feature or request label Jun 20, 2026

MattHeffron added this to bibliography Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add to bibliography update a check of every URL if broken vs. accessible.#337

Add to bibliography update a check of every URL if broken vs. accessible.#337
MattHeffron wants to merge 2 commits into
mainfrom
mth74--Add_check_if_URL_is_broken

MattHeffron commented Jun 19, 2026

Uh oh!

stumbo commented Jun 19, 2026

Uh oh!

MattHeffron commented Jun 20, 2026

Uh oh!

stumbo commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

MattHeffron commented Jun 19, 2026

Uh oh!

stumbo commented Jun 19, 2026

Uh oh!

MattHeffron commented Jun 20, 2026

Uh oh!

stumbo commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants