Web Scraping with Google App Engine
Here is a quick tutorial on how you can scrape google search results asynchronously with app engine and caching its result in memcache. You should not use this directly because you can get blocked by google, this is just a sample for you on scraping web pages, feeds, xml, etc.
But if you do want to do something like this, I recommend adding delays, and act more like a human on your scrapes. But I believe that is against their TOS.
I added the use of async here for people who don't know how to use them yet so they can learn in the process. The code below is a complete working google search scraper, read the code comments to understand everything.
This is all done with python 2.7 with ndb
app.yaml
But if you do want to do something like this, I recommend adding delays, and act more like a human on your scrapes. But I believe that is against their TOS.
I added the use of async here for people who don't know how to use them yet so they can learn in the process. The code below is a complete working google search scraper, read the code comments to understand everything.
This is all done with python 2.7 with ndb
app.yaml
application: your-application-id version: 1 runtime: python27 api_version: 1 threadsafe: true handlers: - url: /.* script: main.app libraries: - name: lxml version: latestmain.py
import urllib from urlparse import urlparse, parse_qs from google.appengine.ext import webapp, ndb from lxml import html # make the function an ndb.tasklet so you don't need to wait for each search @ndb.tasklet def search_google_async(keyword): """ ndb has all the async methods of memcache & urlfetch and tries to auto batch everything behind the scenes """ ctx = ndb.get_context() url = 'http://www.google.com/search?' + urllib.urlencode({ 'q' : keyword }) """ if you don't know yield, you should read up on it a bit, google yield and generators with python simple explanation: your function will stop here and do all the operations in batches then continue on with the next yields """ # check first if you already cached the results cache = yield ctx.memcache_get(url) if cache: """ tasklets returns by raising an exception so converting a normal function to its async counterpart you just add yield before any async calls then change return to raise """ # if you did return the cached results raise ndb.Return(cache) # we use async method of urlfetch from ndb context response = yield ctx.urlfetch(url) links = [] if response.status_code == 200: raw_html = response.content # use the lxml library to convert the string to dom dom = html.fromstring(raw_html) # use a css selector to get all anchor tags anchors = dom.cssselect('a') for anchor in anchors: # get its href attribute link = anchor.get('href') """ since google put all the results like this, you can probably do a[href^=/url?q=] on the css selector """ if link.startswith('/url?q='): # we get the query string q= you can do this however you want # it stores the url of the results parsedUrl = urlparse('http://www.google.com' + link) queryStr = parse_qs(parsedUrl.query) links.append(queryStr['q']) """ now we set the results in memcache with url key and value of list of links you can remove yield here and batch all of it later since we have app = ndb.toplevel(app) meaning it will not terminate until all async methods are finished """ yield ctx.memcache_set(url, links) # we return the links of result raise ndb.Return(links) class MainHandler(webapp.RequestHandler): def get(self): keywords = [ 'how to make pizza', 'where can i buy a dog', 'how big is the grand canyon' ] """ an ndb.tasklet return sets of futures so we get them all then do everything with as little as possible calls, let the ndb stuff handle the batching """ futures = [] for keyword in keywords: futures.append(search_google_async(keyword)) # so here is where everything waits for the results ndb.Future.wait_all(futures) # you call .get_result() which is the value you raised/returned in your tasklet self.response.out.write([future.get_result() for future in futures]) app = webapp.WSGIApplication([('/', MainHandler)], debug=True) # to make sure all unhandled async task are finished app = ndb.toplevel(app)
Comments
Post a Comment