updated notes

2018-09-04 12:51:59 +01:00
parent 7d919039b6
commit 6abe7d68e0
1 changed files with 8 additions and 7 deletions
--- a/notes.md
+++ b/notes.md
@@ -1,9 +1,10 @@
 ## Thoughts

-###### for each URL, do the following:
- * mark it as crawled
- * get page content
-   * if that fails, mark the link as invalid
- * find all links in the content
-   * check each link for dupes
-   * add to pool or discard
+  * ~~strip hashes and everything following (as they're in-page anchors)~~
+  * strip args
+  * ~~use `pop()` on the set instead of `.remove()`~~
+    * ~~return false once the set is empty~~
+  * ~~`WebPage.parse_urls()` needs to compare startswith to base url~~
+  * ignore any links which aren't to pages
+  * better url checking to get bare domain
+  * remove base url from initial urls with and without trailing slash