• Joined on 2018-08-15
simon pushed to master at misc/web-scraper 2018-09-09 09:21:47 +00:00
d686ae0bc4 update with changes
simon pushed to master at misc/web-scraper 2018-09-09 09:16:23 +00:00
69f5788745 update notes
b5d644a223 various minor improvements to exception handling
Compare 2 commits »
simon pushed to master at misc/web-scraper 2018-09-09 09:06:26 +00:00
6508156aa4 use lxml as the parser and only find links on a page if we've got the source
simon pushed to master at misc/web-scraper 2018-09-09 08:57:22 +00:00
738ab8e441 adjust robots handling to deal with 404s and enforce a user agent which allows us to initially obtain the user agent
simon pushed to master at misc/web-scraper 2018-09-07 11:40:14 +00:00
fdd84a8786 manually retrieve robots.txt to ensure we can set the user-agent
simon pushed to master at misc/web-scraper 2018-09-07 10:50:55 +00:00
ab0ab0a010 add more thoughts
simon pushed to master at misc/web-scraper 2018-09-06 16:33:11 +00:00
6a1259aa7d update plans to add gzip encoding
simon pushed to master at misc/web-scraper 2018-09-06 16:31:14 +00:00
164239b343 more thoughts
ce1f2745c9 update thoughts
Compare 2 commits »
simon pushed to master at misc/web-scraper 2018-09-06 16:25:32 +00:00
e70bdc9ca1 update requirements.txt
simon pushed to master at misc/web-scraper 2018-09-06 16:21:01 +00:00
d1c1e17f4f report runtime of script in generated sitemap
simon pushed to master at misc/web-scraper 2018-09-06 16:08:58 +00:00
816a727d79 ignore generated file
simon pushed to master at misc/web-scraper 2018-09-06 16:08:27 +00:00
84ab27a75e render results as HTML
6d9103c154 improved content-type detection
Compare 2 commits »
simon pushed to master at misc/web-scraper 2018-09-06 15:30:16 +00:00
e57a86c60a only attempt to read html
simon pushed to master at misc/web-scraper 2018-09-05 17:56:22 +00:00
a3ec9451e3 implement parsing of robots.txt
simon pushed to master at misc/web-scraper 2018-09-04 14:40:13 +00:00
f2c294ebdb added new ideas to implement
simon pushed to master at misc/web-scraper 2018-09-04 12:58:08 +00:00
1b9b207a28 attempt to remove base url with trailing slash (if discovered)
simon pushed to master at misc/web-scraper 2018-09-04 11:52:00 +00:00
6abe7d68e0 updated notes
simon pushed to master at misc/web-scraper 2018-09-04 09:14:28 +00:00
7d919039b6 removed unecessary modules
simon pushed to master at misc/web-scraper 2018-09-04 08:21:56 +00:00
0726bcccb0 removed original file
05e907ecec too many changes to make a sensible commit message
Compare 2 commits »
simon pushed to master at misc/web-scraper 2018-08-31 18:18:02 +00:00
abc628106d added a docstring to the WebPage object