updated notes
This commit is contained in:
15
notes.md
15
notes.md
@@ -1,9 +1,10 @@
|
||||
## Thoughts
|
||||
|
||||
###### for each URL, do the following:
|
||||
* mark it as crawled
|
||||
* get page content
|
||||
* if that fails, mark the link as invalid
|
||||
* find all links in the content
|
||||
* check each link for dupes
|
||||
* add to pool or discard
|
||||
* ~~strip hashes and everything following (as they're in-page anchors)~~
|
||||
* strip args
|
||||
* ~~use `pop()` on the set instead of `.remove()`~~
|
||||
* ~~return false once the set is empty~~
|
||||
* ~~`WebPage.parse_urls()` needs to compare startswith to base url~~
|
||||
* ignore any links which aren't to pages
|
||||
* better url checking to get bare domain
|
||||
* remove base url from initial urls with and without trailing slash
|
||||
Reference in New Issue
Block a user