|
|
9a4e9ddfc7
|
add test for missing robots.txt
|
2018-09-18 10:53:13 +01:00 |
|
|
|
51f988e1bc
|
added more tests
|
2018-09-17 21:44:20 +01:00 |
|
|
|
73c21e5bd3
|
small improvements to docs and variables
|
2018-09-17 21:44:04 +01:00 |
|
|
|
eb2395d461
|
minor change to README
|
2018-09-17 08:11:26 +01:00 |
|
|
|
c53f62b55d
|
add most changes suggested by pycodestyle
|
2018-09-16 16:10:38 +01:00 |
|
|
|
75d3756bbc
|
fix errors discovered by pycyodestyle
|
2018-09-16 16:04:07 +01:00 |
|
|
|
5262c23281
|
add flags to README
|
2018-09-16 15:58:17 +01:00 |
|
|
|
524f6a45cd
|
improve documentation
|
2018-09-16 15:53:47 +01:00 |
|
|
|
a926090bed
|
update requirements
|
2018-09-16 15:44:30 +01:00 |
|
|
|
91cd988f52
|
more comments and progress output
|
2018-09-16 15:26:49 +01:00 |
|
|
|
f1855f5add
|
re-order imports because I'm fussy
|
2018-09-16 09:06:30 +01:00 |
|
|
|
336517e84a
|
more documentation and add back some required imports
|
2018-09-16 09:00:43 +01:00 |
|
|
|
7bc9fe0679
|
improved documentation and remove unneeded set
|
2018-09-16 08:56:44 +01:00 |
|
|
|
6548f55416
|
improve documentation
|
2018-09-15 21:48:50 +01:00 |
|
|
|
0244435fea
|
remove unecessary imports
|
2018-09-15 21:38:51 +01:00 |
|
|
|
d6964672b6
|
commit of working async crawler
|
2018-09-15 21:30:02 +01:00 |
|
|
|
3808f72f73
|
correct semaphore usage
|
2018-09-14 16:06:17 +01:00 |
|
|
|
7ebe4855b8
|
remove unecessary classes2
|
2018-09-14 16:02:20 +01:00 |
|
|
|
db986b0eba
|
async crawler in a mostly-working state
|
2018-09-14 16:01:12 +01:00 |
|
|
|
36e1f7693f
|
initial foray into asynchronous crawling
|
2018-09-12 22:54:12 +01:00 |
|
|
|
8698c21fda
|
return from WebPage to indicate whether a link was actually crawlable and only actually crawl it if it was
|
2018-09-12 08:03:08 +01:00 |
|
|
|
273cf56a3b
|
add some basic tests
|
2018-09-11 13:42:15 +01:00 |
|
|
|
1af26f50f2
|
added a docstring
|
2018-09-11 13:42:02 +01:00 |
|
|
|
c40c5cea50
|
add async info
|
2018-09-10 21:29:46 +01:00 |
|
|
|
a6224f9b6a
|
updated readme
|
2018-09-10 20:56:12 +01:00 |
|
|
|
b64711973f
|
add new thoughts
|
2018-09-10 11:58:58 +01:00 |
|
|
|
9e125dfae0
|
added comments and docstrings
|
2018-09-09 22:49:55 +01:00 |
|
|
|
f16f82fdfb
|
improved completion message
|
2018-09-09 22:40:42 +01:00 |
|
|
|
a523154848
|
display count of crawled/uncrawled URLs whilst running
|
2018-09-09 22:35:55 +01:00 |
|
|
|
9e754a5584
|
improve handling of gzip/deflated data detection
|
2018-09-09 11:21:46 +01:00 |
|
|
|
1b005570ee
|
implement gzip compression requests and handling
|
2018-09-09 10:53:09 +01:00 |
|
|
|
17fa9f93f9
|
tick off gzip encoding
|
2018-09-09 10:52:37 +01:00 |
|
|
|
1e51e10db2
|
update with changes
|
2018-09-09 10:22:18 +01:00 |
|
|
|
225fd8b3ea
|
update with changes
|
2018-09-09 10:22:03 +01:00 |
|
|
|
d686ae0bc4
|
update with changes
|
2018-09-09 10:21:45 +01:00 |
|
|
|
69f5788745
|
update notes
|
2018-09-09 10:16:22 +01:00 |
|
|
|
b5d644a223
|
various minor improvements to exception handling
|
2018-09-09 10:16:03 +01:00 |
|
|
|
6508156aa4
|
use lxml as the parser and only find links on a page if we've got the source
|
2018-09-09 10:06:25 +01:00 |
|
|
|
738ab8e441
|
adjust robots handling to deal with 404s and enforce a user agent which allows us to initially obtain the user agent
|
2018-09-09 09:57:16 +01:00 |
|
|
|
fdd84a8786
|
manually retrieve robots.txt to ensure we can set the user-agent
|
2018-09-07 12:40:12 +01:00 |
|
|
|
ab0ab0a010
|
add more thoughts
|
2018-09-07 11:50:53 +01:00 |
|
|
|
6a1259aa7d
|
update plans to add gzip encoding
|
2018-09-06 17:33:10 +01:00 |
|
|
|
164239b343
|
more thoughts
|
2018-09-06 17:31:12 +01:00 |
|
|
|
ce1f2745c9
|
update thoughts
|
2018-09-06 17:30:28 +01:00 |
|
|
|
e70bdc9ca1
|
update requirements.txt
|
2018-09-06 17:25:30 +01:00 |
|
|
|
d1c1e17f4f
|
report runtime of script in generated sitemap
|
2018-09-06 17:20:59 +01:00 |
|
|
|
816a727d79
|
ignore generated file
|
2018-09-06 17:08:56 +01:00 |
|
|
|
84ab27a75e
|
render results as HTML
|
2018-09-06 17:08:26 +01:00 |
|
|
|
6d9103c154
|
improved content-type detection
|
2018-09-06 17:08:12 +01:00 |
|
|
|
e57a86c60a
|
only attempt to read html
|
2018-09-06 16:30:11 +01:00 |
|