web-scraper/notes.md at 1af26f50f23c1d0b5a2c9344b4879277041b4432

misc/web-scraper

Fork 0

Files

Simon Weald c40c5cea50 add async info

2018-09-10 21:29:46 +01:00

1.6 KiB

Raw Blame History

Thoughts

~~strip hashes and everything following (as they're in-page anchors)~~
~~strip args~~
~~use pop() on the set instead of .remove()~~
- ~~return false once the set is empty~~
~~WebPage.parse_urls() needs to compare startswith to base url~~
~~ignore any links which aren't to pages~~
~~better url checking to get bare domain~~ #wontfix
~~remove trailing slash from any discovered url~~
~~investigate lxml parser~~
~~remove base url from initial urls with and without trailing slash~~
~~investigate using tldextract to match urls~~ #wontfix
~~implement parsing of robots.txt~~
~~investigate gzip encoding~~
~~implement some kind of progress display~~
async
better exception handling
randomise output filename

Async bits

in __main__:

loop = asyncio.get_event_loop()
try:
    loop.run_until_complete(main())
finally:
    loop.close()

initialises loop and runs it to completion
needs to handle errors (try/except/finally)

async def run(args=None):
    tasks = []

    for url in pool:
        tasks.append(url)
    # for i in range(10):
        # tasks.append(asyncio.ensure_future(myCoroutine(i)))

    # gather completed tasks
    await asyncio.gather(*tasks)

Getting the contents of the page needs to be async too

async def get_source():
    blah
    blah
    await urlopen(url)

1.6 KiB Raw Blame History

Thoughts

Async bits

1.6 KiB

Raw Blame History