From 6abe7d68e03b2b68847a4a78132b58e452f2a9b4 Mon Sep 17 00:00:00 2001
From: Simon Weald <simon@simonweald.com>
Date: Tue, 4 Sep 2018 12:51:59 +0100
Subject: [PATCH] updated notes

---
 notes.md | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/notes.md b/notes.md
index 56c7794..737221e 100644
--- a/notes.md
+++ b/notes.md
@@ -1,9 +1,10 @@
 ## Thoughts
 
-###### for each URL, do the following:
- * mark it as crawled
- * get page content
-   * if that fails, mark the link as invalid
- * find all links in the content
-   * check each link for dupes
-   * add to pool or discard
\ No newline at end of file
+  * ~~strip hashes and everything following (as they're in-page anchors)~~
+  * strip args
+  * ~~use `pop()` on the set instead of `.remove()`~~
+    * ~~return false once the set is empty~~
+  * ~~`WebPage.parse_urls()` needs to compare startswith to base url~~
+  * ignore any links which aren't to pages
+  * better url checking to get bare domain
+  * remove base url from initial urls with and without trailing slash
\ No newline at end of file