Commit Graph

79 Commits

Author SHA1 Message Date
simon 75d3756bbc fix errors discovered by pycyodestyle 2018-09-16 16:04:07 +01:00
simon 5262c23281 add flags to README 2018-09-16 15:58:17 +01:00
simon 524f6a45cd improve documentation 2018-09-16 15:53:47 +01:00
simon a926090bed update requirements 2018-09-16 15:44:30 +01:00
simon 91cd988f52 more comments and progress output 2018-09-16 15:26:49 +01:00
simon f1855f5add re-order imports because I'm fussy 2018-09-16 09:06:30 +01:00
simon 336517e84a more documentation and add back some required imports 2018-09-16 09:00:43 +01:00
simon 7bc9fe0679 improved documentation and remove unneeded set 2018-09-16 08:56:44 +01:00
simon 6548f55416 improve documentation 2018-09-15 21:48:50 +01:00
simon 0244435fea remove unecessary imports 2018-09-15 21:38:51 +01:00
simon d6964672b6 commit of working async crawler 2018-09-15 21:30:02 +01:00
simon 3808f72f73 correct semaphore usage 2018-09-14 16:06:17 +01:00
simon 7ebe4855b8 remove unecessary classes2 2018-09-14 16:02:20 +01:00
simon db986b0eba async crawler in a mostly-working state 2018-09-14 16:01:12 +01:00
simon 36e1f7693f initial foray into asynchronous crawling 2018-09-12 22:54:12 +01:00
simon 8698c21fda return from WebPage to indicate whether a link was actually crawlable and only actually crawl it if it was 2018-09-12 08:03:08 +01:00
simon 273cf56a3b add some basic tests 2018-09-11 13:42:15 +01:00
simon 1af26f50f2 added a docstring 2018-09-11 13:42:02 +01:00
simon c40c5cea50 add async info 2018-09-10 21:29:46 +01:00
simon a6224f9b6a updated readme 2018-09-10 20:56:12 +01:00
simon b64711973f add new thoughts 2018-09-10 11:58:58 +01:00
simon 9e125dfae0 added comments and docstrings 2018-09-09 22:49:55 +01:00
simon f16f82fdfb improved completion message 2018-09-09 22:40:42 +01:00
simon a523154848 display count of crawled/uncrawled URLs whilst running 2018-09-09 22:35:55 +01:00
simon 9e754a5584 improve handling of gzip/deflated data detection 2018-09-09 11:21:46 +01:00
simon 1b005570ee implement gzip compression requests and handling 2018-09-09 10:53:09 +01:00
simon 17fa9f93f9 tick off gzip encoding 2018-09-09 10:52:37 +01:00
simon 1e51e10db2 update with changes 2018-09-09 10:22:18 +01:00
simon 225fd8b3ea update with changes 2018-09-09 10:22:03 +01:00
simon d686ae0bc4 update with changes 2018-09-09 10:21:45 +01:00
simon 69f5788745 update notes 2018-09-09 10:16:22 +01:00
simon b5d644a223 various minor improvements to exception handling 2018-09-09 10:16:03 +01:00
simon 6508156aa4 use lxml as the parser and only find links on a page if we've got the source 2018-09-09 10:06:25 +01:00
simon 738ab8e441 adjust robots handling to deal with 404s and enforce a user agent which allows us to initially obtain the user agent 2018-09-09 09:57:16 +01:00
simon fdd84a8786 manually retrieve robots.txt to ensure we can set the user-agent 2018-09-07 12:40:12 +01:00
simon ab0ab0a010 add more thoughts 2018-09-07 11:50:53 +01:00
simon 6a1259aa7d update plans to add gzip encoding 2018-09-06 17:33:10 +01:00
simon 164239b343 more thoughts 2018-09-06 17:31:12 +01:00
simon ce1f2745c9 update thoughts 2018-09-06 17:30:28 +01:00
simon e70bdc9ca1 update requirements.txt 2018-09-06 17:25:30 +01:00
simon d1c1e17f4f report runtime of script in generated sitemap 2018-09-06 17:20:59 +01:00
simon 816a727d79 ignore generated file 2018-09-06 17:08:56 +01:00
simon 84ab27a75e render results as HTML 2018-09-06 17:08:26 +01:00
simon 6d9103c154 improved content-type detection 2018-09-06 17:08:12 +01:00
simon e57a86c60a only attempt to read html 2018-09-06 16:30:11 +01:00
simon a3ec9451e3 implement parsing of robots.txt 2018-09-05 18:56:20 +01:00
simon f2c294ebdb added new ideas to implement 2018-09-04 15:40:11 +01:00
simon 1b9b207a28 attempt to remove base url with trailing slash (if discovered) 2018-09-04 13:57:52 +01:00
simon 6abe7d68e0 updated notes 2018-09-04 12:51:59 +01:00
simon 7d919039b6 removed unecessary modules 2018-09-04 10:14:27 +01:00