|
| 1 | +# Link Checker Documentation |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This repository includes an automated link checker (`check_links.py`) that verifies all HTTP/HTTPS links in markdown files are functional and not broken. |
| 6 | + |
| 7 | +## Quick Start |
| 8 | + |
| 9 | +Run the link checker: |
| 10 | + |
| 11 | +```bash |
| 12 | +python3 check_links.py |
| 13 | +``` |
| 14 | + |
| 15 | +## What It Does |
| 16 | + |
| 17 | +The script: |
| 18 | +1. Scans all `.md` files in the repository |
| 19 | +2. Extracts all HTTP/HTTPS URLs (both markdown links and plain URLs) |
| 20 | +3. Checks each link by making HTTP HEAD/GET requests |
| 21 | +4. Categorizes results as: OK, 404 Not Found, Connection Error, etc. |
| 22 | +5. Generates a detailed report |
| 23 | + |
| 24 | +## Output |
| 25 | + |
| 26 | +- **Console output**: Summary and detailed list of broken links |
| 27 | +- **`link_check_results.json`**: Complete results in JSON format (gitignored) |
| 28 | +- **`LINK_CHECK_REPORT.md`**: Human-readable report of findings |
| 29 | + |
| 30 | +## Configuration |
| 31 | + |
| 32 | +You can modify these settings in `check_links.py`: |
| 33 | + |
| 34 | +- `TIMEOUT`: Request timeout in seconds (default: 10) |
| 35 | +- `MAX_WORKERS`: Number of parallel requests (default: 10) |
| 36 | +- `SKIP_PATTERNS`: URL patterns to skip checking |
| 37 | + |
| 38 | +## Exit Codes |
| 39 | + |
| 40 | +- `0`: All links are functional |
| 41 | +- `1`: One or more broken links found |
| 42 | + |
| 43 | +## Interpreting Results |
| 44 | + |
| 45 | +### Link Statuses |
| 46 | + |
| 47 | +- **OK (200)**: Link is working correctly |
| 48 | +- **Not Found (404)**: Link is broken and should be fixed or removed |
| 49 | +- **Connection Error**: Could not connect (may be due to network restrictions) |
| 50 | +- **Timeout**: Request took too long |
| 51 | +- **Redirect**: Link redirects to another URL (informational) |
| 52 | + |
| 53 | +### Sandboxed Environments |
| 54 | + |
| 55 | +When running in sandboxed/restricted environments, many legitimate links may show as "Connection Error" due to network restrictions. These are NOT necessarily broken links. The script distinguishes between: |
| 56 | + |
| 57 | +- **404 errors**: Definitely broken (server responded but resource not found) |
| 58 | +- **Connection errors**: Cannot verify (network/DNS issues) |
| 59 | + |
| 60 | +## Maintenance |
| 61 | + |
| 62 | +Run the link checker periodically to catch: |
| 63 | +- Dead links as external resources move or are deleted |
| 64 | +- Typos in newly added links |
| 65 | +- Outdated documentation URLs |
| 66 | + |
| 67 | +## Example Output |
| 68 | + |
| 69 | +``` |
| 70 | +================================================================================ |
| 71 | +LINK CHECK SUMMARY |
| 72 | +================================================================================ |
| 73 | +
|
| 74 | +Total links checked: 112 |
| 75 | + ✓ OK: 74 |
| 76 | + ⚠ Redirects: 0 |
| 77 | + ✗ Not Found (404): 0 |
| 78 | + ✗ Errors: 2 |
| 79 | + ⏱ Timeouts: 0 |
| 80 | + 🔒 SSL Errors: 0 |
| 81 | + 🔌 Connection Errors: 32 |
| 82 | +``` |
| 83 | + |
| 84 | +## Contributing |
| 85 | + |
| 86 | +When adding new links to markdown files: |
| 87 | +1. Add your links |
| 88 | +2. Run `python3 check_links.py` to verify they work |
| 89 | +3. Fix any broken links before committing |
| 90 | + |
| 91 | +## Technical Details |
| 92 | + |
| 93 | +The script uses: |
| 94 | +- **Python 3**: Built-in `re` module for link extraction |
| 95 | +- **requests**: HTTP library for checking links |
| 96 | +- **ThreadPoolExecutor**: Parallel link checking for speed |
| 97 | +- **JSON**: Structured output format |
| 98 | + |
| 99 | +Links are checked using HTTP HEAD requests first (faster), falling back to GET if HEAD is not supported by the server. |
0 commit comments