only attempt to read html

This commit is contained in:
2018-09-06 16:30:11 +01:00
parent a3ec9451e3
commit e57a86c60a
2 changed files with 4 additions and 2 deletions

View File

@@ -5,7 +5,7 @@
* ~~use `pop()` on the set instead of `.remove()`~~ * ~~use `pop()` on the set instead of `.remove()`~~
* ~~return false once the set is empty~~ * ~~return false once the set is empty~~
* ~~`WebPage.parse_urls()` needs to compare startswith to base url~~ * ~~`WebPage.parse_urls()` needs to compare startswith to base url~~
* ignore any links which aren't to pages * ~~ignore any links which aren't to pages~~
* better url checking to get bare domain * better url checking to get bare domain
* ~~remove base url from initial urls with and without trailing slash~~ * ~~remove base url from initial urls with and without trailing slash~~
* investigate using [tldextract](https://github.com/john-kurkowski/tldextract) to match urls * investigate using [tldextract](https://github.com/john-kurkowski/tldextract) to match urls

View File

@@ -61,6 +61,8 @@ class WebPage(object):
request = urllib.request.Request(self.url, headers=self.headers) request = urllib.request.Request(self.url, headers=self.headers)
page = urllib.request.urlopen(request, timeout=5) page = urllib.request.urlopen(request, timeout=5)
headers = page.info()
if headers['content-type'] == "text/html":
self.source = page.read() self.source = page.read()