updated notes
This commit is contained in:
15
notes.md
15
notes.md
@@ -1,9 +1,10 @@
|
|||||||
## Thoughts
|
## Thoughts
|
||||||
|
|
||||||
###### for each URL, do the following:
|
* ~~strip hashes and everything following (as they're in-page anchors)~~
|
||||||
* mark it as crawled
|
* strip args
|
||||||
* get page content
|
* ~~use `pop()` on the set instead of `.remove()`~~
|
||||||
* if that fails, mark the link as invalid
|
* ~~return false once the set is empty~~
|
||||||
* find all links in the content
|
* ~~`WebPage.parse_urls()` needs to compare startswith to base url~~
|
||||||
* check each link for dupes
|
* ignore any links which aren't to pages
|
||||||
* add to pool or discard
|
* better url checking to get bare domain
|
||||||
|
* remove base url from initial urls with and without trailing slash
|
||||||
Reference in New Issue
Block a user