web-scraper

Author	SHA1	Message	Date
Simon Weald	679b1b7b53	rename all instances of base_url to rooturl, add more documentation	2018-09-18 18:24:15 +01:00
Simon Weald	32d7f1e54b	add talking points	2018-09-18 18:23:12 +01:00
Simon Weald	f6265f18a7	initial test for AsyncCrawler	2018-09-18 18:22:55 +01:00
Simon Weald	9a4e9ddfc7	add test for missing robots.txt	2018-09-18 10:53:13 +01:00
Simon Weald	51f988e1bc	added more tests	2018-09-17 21:44:20 +01:00
Simon Weald	73c21e5bd3	small improvements to docs and variables	2018-09-17 21:44:04 +01:00
Simon Weald	eb2395d461	minor change to README	2018-09-17 08:11:26 +01:00
Simon Weald	c53f62b55d	add most changes suggested by pycodestyle	2018-09-16 16:10:38 +01:00
Simon Weald	75d3756bbc	fix errors discovered by pycyodestyle	2018-09-16 16:04:07 +01:00
Simon Weald	5262c23281	add flags to README	2018-09-16 15:58:17 +01:00
Simon Weald	524f6a45cd	improve documentation	2018-09-16 15:53:47 +01:00
Simon Weald	a926090bed	update requirements	2018-09-16 15:44:30 +01:00
Simon Weald	91cd988f52	more comments and progress output	2018-09-16 15:26:49 +01:00
Simon Weald	f1855f5add	re-order imports because I'm fussy	2018-09-16 09:06:30 +01:00
Simon Weald	336517e84a	more documentation and add back some required imports	2018-09-16 09:00:43 +01:00
Simon Weald	7bc9fe0679	improved documentation and remove unneeded set	2018-09-16 08:56:44 +01:00
Simon Weald	6548f55416	improve documentation	2018-09-15 21:48:50 +01:00
Simon Weald	0244435fea	remove unecessary imports	2018-09-15 21:38:51 +01:00
Simon Weald	d6964672b6	commit of working async crawler	2018-09-15 21:30:02 +01:00
Simon Weald	3808f72f73	correct semaphore usage	2018-09-14 16:06:17 +01:00
Simon Weald	7ebe4855b8	remove unecessary classes2	2018-09-14 16:02:20 +01:00
Simon Weald	db986b0eba	async crawler in a mostly-working state	2018-09-14 16:01:12 +01:00
Simon Weald	36e1f7693f	initial foray into asynchronous crawling	2018-09-12 22:54:12 +01:00
Simon Weald	8698c21fda	return from WebPage to indicate whether a link was actually crawlable and only actually crawl it if it was	2018-09-12 08:03:08 +01:00
Simon Weald	273cf56a3b	add some basic tests	2018-09-11 13:42:15 +01:00
Simon Weald	1af26f50f2	added a docstring	2018-09-11 13:42:02 +01:00
Simon Weald	c40c5cea50	add async info	2018-09-10 21:29:46 +01:00
Simon Weald	a6224f9b6a	updated readme	2018-09-10 20:56:12 +01:00
Simon Weald	b64711973f	add new thoughts	2018-09-10 11:58:58 +01:00
Simon Weald	9e125dfae0	added comments and docstrings	2018-09-09 22:49:55 +01:00
Simon Weald	f16f82fdfb	improved completion message	2018-09-09 22:40:42 +01:00
Simon Weald	a523154848	display count of crawled/uncrawled URLs whilst running	2018-09-09 22:35:55 +01:00
Simon Weald	9e754a5584	improve handling of gzip/deflated data detection	2018-09-09 11:21:46 +01:00
Simon Weald	1b005570ee	implement gzip compression requests and handling	2018-09-09 10:53:09 +01:00
Simon Weald	17fa9f93f9	tick off gzip encoding	2018-09-09 10:52:37 +01:00
Simon Weald	1e51e10db2	update with changes	2018-09-09 10:22:18 +01:00
Simon Weald	225fd8b3ea	update with changes	2018-09-09 10:22:03 +01:00
Simon Weald	d686ae0bc4	update with changes	2018-09-09 10:21:45 +01:00
Simon Weald	69f5788745	update notes	2018-09-09 10:16:22 +01:00
Simon Weald	b5d644a223	various minor improvements to exception handling	2018-09-09 10:16:03 +01:00
Simon Weald	6508156aa4	use lxml as the parser and only find links on a page if we've got the source	2018-09-09 10:06:25 +01:00
Simon Weald	738ab8e441	adjust robots handling to deal with 404s and enforce a user agent which allows us to initially obtain the user agent	2018-09-09 09:57:16 +01:00
Simon Weald	fdd84a8786	manually retrieve robots.txt to ensure we can set the user-agent	2018-09-07 12:40:12 +01:00
Simon Weald	ab0ab0a010	add more thoughts	2018-09-07 11:50:53 +01:00
Simon Weald	6a1259aa7d	update plans to add gzip encoding	2018-09-06 17:33:10 +01:00
Simon Weald	164239b343	more thoughts	2018-09-06 17:31:12 +01:00
Simon Weald	ce1f2745c9	update thoughts	2018-09-06 17:30:28 +01:00
Simon Weald	e70bdc9ca1	update requirements.txt	2018-09-06 17:25:30 +01:00
Simon Weald	d1c1e17f4f	report runtime of script in generated sitemap	2018-09-06 17:20:59 +01:00
Simon Weald	816a727d79	ignore generated file	2018-09-06 17:08:56 +01:00

1 2

87 Commits