updated notes

This commit is contained in:
2018-09-04 12:51:59 +01:00
parent 7d919039b6
commit 6abe7d68e0

View File

@@ -1,9 +1,10 @@
## Thoughts
###### for each URL, do the following:
* mark it as crawled
* get page content
* if that fails, mark the link as invalid
* find all links in the content
* check each link for dupes
* add to pool or discard
* ~~strip hashes and everything following (as they're in-page anchors)~~
* strip args
* ~~use `pop()` on the set instead of `.remove()`~~
* ~~return false once the set is empty~~
* ~~`WebPage.parse_urls()` needs to compare startswith to base url~~
* ignore any links which aren't to pages
* better url checking to get bare domain
* remove base url from initial urls with and without trailing slash