This website requires JavaScript.
5f7d66912f
add test files
asyncio
simon
2018-09-19 08:39:05 +01:00
d4cd93e3d4
update docs
simon
2018-09-19 08:38:49 +01:00
f5f6afd1a4
correct tests with new arg names
simon
2018-09-19 08:37:55 +01:00
679b1b7b53
rename all instances of base_url to rooturl, add more documentation
simon
2018-09-18 18:24:15 +01:00
32d7f1e54b
add talking points
simon
2018-09-18 18:23:12 +01:00
f6265f18a7
initial test for AsyncCrawler
simon
2018-09-18 18:22:55 +01:00
9a4e9ddfc7
add test for missing robots.txt
simon
2018-09-18 10:53:13 +01:00
51f988e1bc
added more tests
simon
2018-09-17 21:44:20 +01:00
73c21e5bd3
small improvements to docs and variables
simon
2018-09-17 21:44:04 +01:00
eb2395d461
minor change to README
simon
2018-09-17 08:11:26 +01:00
c53f62b55d
add most changes suggested by pycodestyle
simon
2018-09-16 16:10:38 +01:00
75d3756bbc
fix errors discovered by pycyodestyle
simon
2018-09-16 16:04:07 +01:00
5262c23281
add flags to README
simon
2018-09-16 15:58:17 +01:00
524f6a45cd
improve documentation
simon
2018-09-16 15:53:47 +01:00
a926090bed
update requirements
simon
2018-09-16 15:44:30 +01:00
91cd988f52
more comments and progress output
simon
2018-09-16 15:26:49 +01:00
f1855f5add
re-order imports because I'm fussy
simon
2018-09-16 09:06:30 +01:00
336517e84a
more documentation and add back some required imports
simon
2018-09-16 09:00:43 +01:00
7bc9fe0679
improved documentation and remove unneeded set
simon
2018-09-16 08:56:44 +01:00
6548f55416
improve documentation
simon
2018-09-15 21:48:50 +01:00
0244435fea
remove unecessary imports
simon
2018-09-15 21:38:51 +01:00
d6964672b6
commit of working async crawler
simon
2018-09-15 21:30:02 +01:00
3808f72f73
correct semaphore usage
simon
2018-09-14 16:06:17 +01:00
7ebe4855b8
remove unecessary classes2
simon
2018-09-14 16:02:20 +01:00
db986b0eba
async crawler in a mostly-working state
simon
2018-09-14 16:01:12 +01:00
36e1f7693f
initial foray into asynchronous crawling
simon
2018-09-12 22:54:12 +01:00
8698c21fda
return from WebPage to indicate whether a link was actually crawlable and only actually crawl it if it was
master
simon
2018-09-12 08:00:08 +01:00
273cf56a3b
add some basic tests
simon
2018-09-11 13:42:15 +01:00
1af26f50f2
added a docstring
simon
2018-09-11 13:42:02 +01:00
c40c5cea50
add async info
simon
2018-09-10 21:29:46 +01:00
a6224f9b6a
updated readme
simon
2018-09-10 20:56:12 +01:00
b64711973f
add new thoughts
simon
2018-09-10 11:58:58 +01:00
9e125dfae0
added comments and docstrings
simon
2018-09-09 22:49:55 +01:00
f16f82fdfb
improved completion message
simon
2018-09-09 22:40:42 +01:00
a523154848
display count of crawled/uncrawled URLs whilst running
simon
2018-09-09 22:35:55 +01:00
9e754a5584
improve handling of gzip/deflated data detection
simon
2018-09-09 11:21:46 +01:00
1b005570ee
implement gzip compression requests and handling
simon
2018-09-09 10:53:09 +01:00
17fa9f93f9
tick off gzip encoding
simon
2018-09-09 10:52:37 +01:00
1e51e10db2
update with changes
simon
2018-09-09 10:22:18 +01:00
225fd8b3ea
update with changes
simon
2018-09-09 10:22:03 +01:00
d686ae0bc4
update with changes
simon
2018-09-09 10:21:45 +01:00
69f5788745
update notes
simon
2018-09-09 10:16:22 +01:00
b5d644a223
various minor improvements to exception handling
simon
2018-09-09 10:16:03 +01:00
6508156aa4
use lxml as the parser and only find links on a page if we've got the source
simon
2018-09-09 10:06:25 +01:00
738ab8e441
adjust robots handling to deal with 404s and enforce a user agent which allows us to initially obtain the user agent
simon
2018-09-09 09:57:16 +01:00
fdd84a8786
manually retrieve robots.txt to ensure we can set the user-agent
simon
2018-09-07 12:40:12 +01:00
ab0ab0a010
add more thoughts
simon
2018-09-07 11:50:53 +01:00
6a1259aa7d
update plans to add gzip encoding
simon
2018-09-06 17:33:10 +01:00
164239b343
more thoughts
simon
2018-09-06 17:31:12 +01:00
ce1f2745c9
update thoughts
simon
2018-09-06 17:30:28 +01:00
e70bdc9ca1
update requirements.txt
simon
2018-09-06 17:25:30 +01:00
d1c1e17f4f
report runtime of script in generated sitemap
simon
2018-09-06 17:20:59 +01:00
816a727d79
ignore generated file
simon
2018-09-06 17:08:56 +01:00
84ab27a75e
render results as HTML
simon
2018-09-06 17:08:26 +01:00
6d9103c154
improved content-type detection
simon
2018-09-06 17:08:12 +01:00
e57a86c60a
only attempt to read html
simon
2018-09-06 16:30:11 +01:00
a3ec9451e3
implement parsing of robots.txt
simon
2018-09-05 18:56:20 +01:00
f2c294ebdb
added new ideas to implement
simon
2018-09-04 15:40:11 +01:00
1b9b207a28
attempt to remove base url with trailing slash (if discovered)
simon
2018-09-04 13:57:52 +01:00
6abe7d68e0
updated notes
simon
2018-09-04 12:51:59 +01:00
7d919039b6
removed unecessary modules
simon
2018-09-04 10:14:27 +01:00
0726bcccb0
removed original file
simon
2018-09-04 09:21:55 +01:00
05e907ecec
too many changes to make a sensible commit message
simon
2018-09-04 09:21:26 +01:00
abc628106d
added a docstring to the WebPage object
simon
2018-08-31 19:18:00 +01:00
c436016e0c
remove unecessary function
simon
2018-08-31 19:16:08 +01:00
03554fde80
add docstrings
simon
2018-08-31 19:15:35 +01:00
759f965e95
use more explicit names, use urljoin to combine urls
simon
2018-08-31 19:12:58 +01:00
0517e5bc56
crawler now initialises and populates crawled pool with urls it finds
simon
2018-08-31 19:02:21 +01:00
1b18aa83eb
corrected some small errors and added runner function
simon
2018-08-31 19:01:35 +01:00
5e0d9fd568
initial commit of crawler skeleton
simon
2018-08-31 18:26:49 +01:00
915def3a5d
rework url sanitiser to use urllib modules, move WebPage object to helpers
simon
2018-08-31 18:26:25 +01:00
453331d69d
simplified url qualifier
simon
2018-08-29 22:27:26 +01:00
2b812da26a
simplify UrlPoolManager to use a set instead of a dict
simon
2018-08-29 21:49:15 +01:00
fb096b4468
add scratchpad for notes
simon
2018-08-28 22:34:05 +01:00
5d94991167
start making the scraper an object
simon
2018-08-28 22:29:36 +01:00
482d23dd4f
blank __init__.py
simon
2018-08-28 22:29:11 +01:00
452de87f35
change name of pool management object to be more clear
simon
2018-08-28 22:28:49 +01:00
73cb883151
add a list manager object
simon
2018-08-28 22:28:16 +01:00
5c933fc5c9
initial commit of single-page scraper
simon
2018-08-28 18:29:34 +01:00
25f8c4c686
remove testing url with requests and assume that the user is correct
simon
2018-08-28 17:22:52 +01:00
0d0438670c
adjusted title
simon
2018-08-28 09:12:48 +01:00
8a1fd39dc4
added pycache dirs
simon
2018-08-27 19:38:13 +01:00
79b10798a3
initial commit of utils
simon
2018-08-27 19:37:41 +01:00
fb6b976391
initial commit of utils tests
simon
2018-08-27 19:36:43 +01:00
a04de7f4de
changed venv name
simon
2018-08-27 14:28:20 +01:00
665ec1d7a7
add readme
simon
2018-08-23 16:05:24 +01:00
65fc332925
ignore venv and vscode dirs
simon
2018-08-23 16:03:46 +01:00
c6ce63838f
bare script file
simon
2018-08-23 16:02:09 +01:00
c383fb7ee9
initial requirements file
simon
2018-08-23 15:59:18 +01:00
01a16a998c
initial gitignore
simon
2018-08-23 15:47:30 +01:00