From 6abe7d68e03b2b68847a4a78132b58e452f2a9b4 Mon Sep 17 00:00:00 2001 From: Simon Weald Date: Tue, 4 Sep 2018 12:51:59 +0100 Subject: [PATCH] updated notes --- notes.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/notes.md b/notes.md index 56c7794..737221e 100644 --- a/notes.md +++ b/notes.md @@ -1,9 +1,10 @@ ## Thoughts -###### for each URL, do the following: - * mark it as crawled - * get page content - * if that fails, mark the link as invalid - * find all links in the content - * check each link for dupes - * add to pool or discard \ No newline at end of file + * ~~strip hashes and everything following (as they're in-page anchors)~~ + * strip args + * ~~use `pop()` on the set instead of `.remove()`~~ + * ~~return false once the set is empty~~ + * ~~`WebPage.parse_urls()` needs to compare startswith to base url~~ + * ignore any links which aren't to pages + * better url checking to get bare domain + * remove base url from initial urls with and without trailing slash \ No newline at end of file