updated notes

This commit is contained in:
2018-09-04 12:51:59 +01:00
parent 7d919039b6
commit 6abe7d68e0

View File

@@ -1,9 +1,10 @@
## Thoughts ## Thoughts
###### for each URL, do the following: * ~~strip hashes and everything following (as they're in-page anchors)~~
* mark it as crawled * strip args
* get page content * ~~use `pop()` on the set instead of `.remove()`~~
* if that fails, mark the link as invalid * ~~return false once the set is empty~~
* find all links in the content * ~~`WebPage.parse_urls()` needs to compare startswith to base url~~
* check each link for dupes * ignore any links which aren't to pages
* add to pool or discard * better url checking to get bare domain
* remove base url from initial urls with and without trailing slash