web-scraper/notes.md at 915def3a5de46dc877343fa535b7912d8495c299 - web-scraper - Gitea

misc/web-scraper

Files

Simon Weald fb096b4468 add scratchpad for notes

2018-08-28 22:34:05 +01:00

230 B

Raw Blame History

Thoughts

for each URL, do the following:

mark it as crawled
get page content
- if that fails, mark the link as invalid
find all links in the content
- check each link for dupes
- add to pool or discard