Skip to content

Add to bibliography update a check of every URL if broken vs. accessible.#337

Draft
MattHeffron wants to merge 2 commits into
mainfrom
mth74--Add_check_if_URL_is_broken
Draft

Add to bibliography update a check of every URL if broken vs. accessible.#337
MattHeffron wants to merge 2 commits into
mainfrom
mth74--Add_check_if_URL_is_broken

Conversation

@MattHeffron

Copy link
Copy Markdown
Member

Initial attempt. Too many "false positive" indicating broken when not.
I tried using LWP perl library. It was faster than calling "curl", but that gave even more false positives.
It may be having trouble with redirects.

@stumbo

stumbo commented Jun 19, 2026

Copy link
Copy Markdown
Member

Sounds like the next step is going to be to separate the return values and based on that decide if it's a paywall or a broken link. We'll also need to be careful that we handle 429s correctly and back off on the number of requests we're making if needed.

Might want to add a flag to allow skipping the check - I imagine it takes a bit.

…ning why they're getting errors from curl, but (most) not when entered in a browser.
@MattHeffron

Copy link
Copy Markdown
Member Author

Here's the bibSplit.err and the .hdr files generated by the bibSplit.pl I just committed.
bibSplit-info.zip

With the mostly 403 errors, I had hoped that adding the --referer from the bibliography might help. Nope.
Also, there are 3 SSL certificate errors, 1 DNS failure, and 1 failed to connect to server (possibly just a transient).

@MattHeffron MattHeffron self-assigned this Jun 20, 2026
@MattHeffron MattHeffron added the enhancement New feature or request label Jun 20, 2026
@stumbo

stumbo commented Jun 26, 2026

Copy link
Copy Markdown
Member

I'm coming to the conclusion that URL testing is doomed to fail, or needs significant revisions to be valuable. I've been playing around with curl and comparing it to directly accessing the website. I was using this site as a test case:
http://jgs.lyellcollection.org/lookup/doi/10.1144/gsjgs.142.5.0925

It works if I access it via a browser. It gets to the final website via two redirects:

  1. https://www.lyellcollection.org/doi/10.1144/gsjgs.142.5.0925
  2. https://www.lyellcollection.org/doi/abs/10.1144/gsjgs.142.5.0925

Testing the starting URL with curl fails:

curl --output /dev/null --silent --show-error --head --fail \
  --dump-header "test.hdr" --location \
  --referer "https://interlisp.org/history/bibliography;auto" \
  "http://jgs.lyellcollection.org/lookup/doi/10.1144/gsjgs.142.5.0925"

curl: (22) The requested URL returned error: 403

The problem from examining the returned headers is that curl fails a challenge from Cloudflare -- basically they are trying to prevent bots from accessing their site. And, curl falls into the bot category.

I asked Claude to interpret the resulting headers and its assessment was:
The 403 is not a true "access denied" — it's a Cloudflare bot challenge that requires a JavaScript-capable browser to solve. No amount of tweaking curl flags (user agent, referer, etc.) will fix this, because the challenge fundamentally requires browser execution. When you open the URL in a browser, Chrome/Firefox solves the challenge automatically and you get through.

Given anything I did with curl failed, I'm pretty comfortable that its assessment is correct.

It also suggested we could adjust our handling of 403s based on the server, if server: cloudflare we could treat 403s as warnings and ignore the return. The only problem with that approach is we would need to build a library of every server that issues a challenge. That seems brittle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants