Thursday, December 6, 2012

Web Scraping with Google App Engine

Here is a quick tutorial on how you can scrape google search results asynchronously with app engine and caching its result in memcache. You should not use this directly because you can get blocked by google, this is just a sample for you on scraping web pages, feeds, xml, etc.

But if you do want to do something like this, I recommend adding delays, and act more like a human on your scrapes. But I believe that is against their TOS.

I added the use of async here for people who don't know how to use them yet so they can learn in the process. The code below is a complete working google search scraper, read the code comments to understand everything.

This is all done with python 2.7 with ndb

app.yaml
application: your-application-id
version: 1
runtime: python27
api_version: 1
threadsafe: true

handlers:
- url: /.*
  script: main.app

libraries:
- name: lxml
  version: latest
main.py
import urllib
from urlparse import urlparse, parse_qs
from google.appengine.ext import webapp, ndb
from lxml import html

# make the function an ndb.tasklet so you don't need to wait for each search
@ndb.tasklet
def search_google_async(keyword):
    """
    ndb has all the async methods of memcache & urlfetch
    and tries to auto batch everything behind the scenes
    """
    ctx = ndb.get_context()
    url = 'http://www.google.com/search?' + urllib.urlencode({ 'q' : keyword })
    """
    if you don't know yield, you should read up on it a bit, google yield and generators with python
    simple explanation: your function will stop here and do all the operations in batches
    then continue on with the next yields
    """
    # check first if you already cached the results
    cache = yield ctx.memcache_get(url)
    if cache:
        """
        tasklets returns by raising an exception so converting a normal function to its async
        counterpart you just add yield before any async calls
        then change return to raise
        """
        # if you did return the cached results
        raise ndb.Return(cache)

    # we use async method of urlfetch from ndb context
    response = yield ctx.urlfetch(url)

    links = []
    if response.status_code == 200:
        raw_html = response.content
        # use the lxml library to convert the string to dom
        dom = html.fromstring(raw_html)
        # use a css selector to get all anchor tags
        anchors = dom.cssselect('a')
        for anchor in anchors:
            # get its href attribute
            link = anchor.get('href')
            """
            since google put all the results like this,
            you can probably do a[href^=/url?q=] on the css selector
            """
            if link.startswith('/url?q='):
                # we get the query string q= you can do this however you want
                # it stores the url of the results
                parsedUrl = urlparse('http://www.google.com' + link)
                queryStr = parse_qs(parsedUrl.query)
                links.append(queryStr['q'])
        
        """
        now we set the results in memcache with url key and value of list of links
        you can remove yield here and batch all of it later since we have
        app = ndb.toplevel(app)
        meaning it will not terminate until all async methods are finished
        """
        yield ctx.memcache_set(url, links)
        
    # we return the links of result
    raise ndb.Return(links)

class MainHandler(webapp.RequestHandler):

    def get(self):
        keywords = [
            'how to make pizza',
            'where can i buy a dog',
            'how big is the grand canyon'
        ]
        """
        an ndb.tasklet return sets of futures
        so we get them all then do everything with as little
        as possible calls, let the ndb stuff handle the batching
        """
        futures = []
        for keyword in keywords:
            futures.append(search_google_async(keyword))
        # so here is where everything waits for the results
        ndb.Future.wait_all(futures)
        # you call .get_result() which is the value you raised/returned in your tasklet
        self.response.out.write([future.get_result() for future in futures])


app = webapp.WSGIApplication([('/', MainHandler)],
                             debug=True)

# to make sure all unhandled async task are finished
app = ndb.toplevel(app)