Saturday, August 02, 2014

Quick and Dirty Web Crawling with ScraPy


I haven't crawled using Python before - we use Apache Nutch (NutchGORA actually) for large crawls, and its been a while since I had to do small focused crawls. At that time, my tool of choice was WebSPHINX, a small but powerful Java library for building crawlers.

We recently needed to crawl a small site, and I had heard good things about ScraPy, a Python toolkit used for building crawlers and crawl pipelines, so I figured this may be a good opportunity to learn how to use it. This post describes my experience, as well as the resulting code. Overall, this is what I did.

  1. Create a Scrapy crawler and download all the pages as HTML, as well as some document metadata. This writes to a single large JSON file.
  2. Pull out the HTML from the JSON into multiple HTML documents, one HTML file for each web page.
  3. Parse out the HTML and merge all metadata back into individual JSON files, one JSON per document.

I installed Scrapy using apt-get based on the advice on this page. Earlier, I had tried using "pip install" but it failed with unknown libffl errors.

Once installed, the first thing to do is create a scrapy project. Unlike most other Python modules, scrapy is both a toolkit as well as a library. The following command creates a stub project.

1
sujit@tsunami:~mtcrawler$ scrapy startproject mtcrawler

Based on the files in the project stub, I am pretty sure that the three steps I show above could have been done in one shot, but for convenience I just used Scrapy as a crawler, deliberately keeping the interaction with the target site as minimal as possible. I figured that fewer steps translate to fewer errors, and thus less chance of having to run this step multiple times. Once the pages were crawled, I could then iterate as many times as needed against the local files without bothering the website.

Scrapy needs an Item and a Spider implementation. The Spider's parse() method yields Request objects (for outlinks in the page) or Item objects (representing the document being crawled). The default output mechanism is to capture the results of the crawl as a List of List of Items.

The Item class is defined inside items.py. After the change the items.py looks like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Source: mtcrawler/items.py
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class MTCrawlerItem(Item):
    link = Field()
    body = Field()
    sample_name = Field()
    type_name = Field()

And here is the code for the Spider. Because the primary use case for Scrapy appears to be scraping single web pages for interesting contents, the tutorial doesn't provide much help for multi-page crawls. My code is heavily adapted from this post from Milinda Pathirage.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Source: mtcrawler/spiders/MTSpider.py
# -*- coding: utf-8 -*-
import time
import urlparse

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http.request import Request

from mtcrawler.items import MTCrawlerItem

ROOT_PAGE = "http://my_site_name.com/"
SLEEP_TIME = 1

class MTSpider(BaseSpider):
    name = "my_site_name"
    allowed_domains = ["my_site_name.com"]
    start_urls = [ ROOT_PAGE ]
    already_crawled = set()

    def get_page_id(self, url):
        params = urlparse.parse_qs(urlparse.urlparse(url).query)
        sample_name = "_"
        if params.has_key("Sample"):
            sample_name = params["Sample"][0]
        elif params.has_key("sample"):
            sample_name = params["sample"][0]
        type_name = "_"
        if params.has_key("Type"):
            type_name = params["Type"][0]
        elif params.has_key("type"):
            type_name = params["type"][0]
        return "::".join([sample_name, type_name])
        
    def parse(self, response):
        selector = Selector(response)
        for sel in selector.select("//a"):
            title = sel.xpath("text()").extract()
            if len(title) == 0: continue
            url = sel.xpath("@href").extract()
            if len(url) == 0: continue
            if "sample.asp" in url[0] or "browse.asp" in url[0]:
                child_url = url[0]
                if not child_url.startswith(ROOT_PAGE):
                    child_url = ROOT_PAGE + child_url
                page_id = self.get_page_id(child_url)
                if page_id in self.already_crawled:
                    continue
                self.already_crawled.add(page_id)
                yield Request(child_url, self.parse)
        # now download the file if it is a sample
        if "sample.asp" in response.url:
            item = MTCrawlerItem()
            item["link"] = response.url
            item["body"] = selector.select("//html").extract()[0]
            page_ids = self.get_page_id(response.url).split("::")
            item["sample_name"] = page_ids[0]
            item["type_name"] = page_ids[1]
            yield item
        time.sleep(SLEEP_TIME)

The site consists of two kinds of pages that we are interested in - the browse.asp refers to directory style pages and sample.asp refers to pages representing actual documents. The code above looks for outlinks in each page as it comes to it, and if the URL contains browse.asp or sample.asp, then it creates a Request for the crawler to crawl. Otherwise if it encounters a page returned by sample.asp it saves the output (along with some metadata) to the crawler output. We haven't specified a maximum depth to crawl - since a sample is uniquely specified by its sample_name and type_name, we maintain a set of ((sample_name, tuple_name)) values crawled so far. The crawler is run using the following command:

1
sujit@tsunami:~mtcrawler$ scrapy crawl my_site_name -o results.json -t json

This brings back approximately 5,169 items. One issue with the results.json is that scrapy forgets to put in a terminating "]" character - it could be a bug or something about my environment. In any case, I was unable to parse this file (and view it in Chrome) until I added a terminating "]" character. The data looks something like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
[
  [
    {
      body: "<html>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. 
        Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque 
        penatibus et magnis dis parturient montes, nascetur ridiculus mus. 
        Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. 
        Nulla consequat massa quis enim. Donec pede justo, fringilla.</html>",
      type_name: "34-Neurosurgery",
      sample_name: "1234-Wound Closure",
      link: "http://www.mysite.com/sample.asp?Type=34-Neurosurgery&Sample=1234
        -Wound Closure"
    },
    ...
  ]
]

A nice convenience is Scrapy's REPL (a customized Python REPL), which allows you to test out your XPaths against live pages. I used it here as well as later to parse the HTML in the files. You can invoke it like so:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
sujit@tsunami:~mtcrawler$ scrapy shell http://path/to/url.html
2014-08-02 08:30:25-0700 [scrapy] INFO: Scrapy 0.24.2 started (bot: mtcrawler)
2014-08-02 08:30:25-0700 [scrapy] INFO: Optional features available: ...
2014-08-02 08:30:25-0700 [scrapy] INFO: Overridden settings: ...
2014-08-02 08:30:25-0700 [scrapy] INFO: Enabled extensions: ...
2014-08-02 08:30:26-0700 [scrapy] INFO: Enabled downloader middlewares: ...
2014-08-02 08:30:26-0700 [scrapy] INFO: Enabled spider middlewares: ...
2014-08-02 08:30:26-0700 [scrapy] INFO: Enabled item pipelines: 
2014-08-02 08:30:26-0700 [default] INFO: Spider opened
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f7589495510>
[s]   item       {}
[s]   request    <GET http://path/to/url.html>
[s]   response   <200 http://path/to/url.html>
[s]   settings   <scrapy.settings.Settings object at 0x7f7589bb9150>
[s]   spider     <Spider 'default' at 0x7f7588ec3b90>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>>> response.xpath("//title/text()").extract()[0]
u'Title of Sample Page'
>>> 

The next step is to extract the HTML from the large JSON file into multiple small files, one document per file. We do this using this simple program below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Source: mtcrawler/json_to_files.py
# -*- coding: utf-8 -*-
import json
import re
import md5
import os
import urllib

DATA_DIR = "/path/to/data/directory"
JSON_FILE = "results.json"

p = re.compile(r"\d+-(.*)")

def remove_leading_number(s):
    m = re.search(p, s)
    if m is None:
        return s
    else:
        return m.group(1)

def normalize(s):
    return urllib.quote_plus(s)
    
def get_md5(s):
    m = md5.new()
    m.update(s)
    return m.hexdigest()
    
fin = open(os.path.join(DATA_DIR, JSON_FILE), 'rb')
jobj = json.load(fin)
fin.close()
print "#-records:", len(jobj[0])
unique_recs = set()
rawhtml_dir = os.path.join(DATA_DIR, "raw_htmls")
os.makedirs(rawhtml_dir)
for rec in jobj[0]:
    type_name = remove_leading_number(normalize(rec["type_name"]))
    sample_name = remove_leading_number(normalize(rec["sample_name"]))
    body = rec["body"].encode("utf-8")
    md5_body = get_md5(body)
    print "%s/%s.html" % (type_name, sample_name)
    unique_recs.add(md5_body)
    dir_name = os.path.join(rawhtml_dir, type_name)
    try:    
        os.makedirs(dir_name)
    except OSError:
        # directory already made just use it
        pass
    fout = open(os.path.join(dir_name, sample_name) + ".html", 'wb')
    fout.write(body)
    fout.close()
print "# unique records:", len(unique_recs)

The program above reads the crawled data and writes out a directory structure organized as type_name/sample_name. I check for uniqueness of content by calculating the md5 digest of the contents and I find that there are 5,117 unique documents. However, because the same document can be arrived at from different paths, and presumably they differ in HTML markup slightly, the actual number of unique documents across type and sample is 4,835.

We then parse the HTMLs and the directory metadata back to a flat JSON format, one file per sample. There are only 2,224 unique files because the same document can be mapped to multiple categories.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# -*- coding: utf-8 -*-
import re
import os
import urllib
import json
import md5
from scrapy.selector import Selector

DATA_DIR = "/path/to/data/directory"
DISCLAIMER_TEXT = "anything else to real world is purely incidental"

def normalize(s):
    return urllib.unquote_plus(s)

blob_pattern = re.compile(r"<b[.*?>]*>[A-Z0-9:() ]+</b>")

def get_candidate_blob(line, sel):
    line = line.strip()
    # check for the obvious (in this case) <b>HEADING:</b> pattern
    m = re.search(blob_pattern, line)
    if m is not None:
        # return the matched block
        return line
    # some of the blobs are enclosed in a multi-line div block
    div_text = None
    for div in sel.xpath("//div[@style]"):
        style = div.xpath("@style").extract()[0]
        if "text-align" in style:
            div_text = div.xpath("text()").extract()[0]
            div_text = div_text.replace("\n", " ")
            div_text = re.sub(r"\s+", " ", div_text).strip()
            break
    if div_text is not None and DISCLAIMER_TEXT not in div_text:
        return div_text
    # finally drop down to just returning a long line (> 1000 chars)
    # This can probably be more sophisticated by checking for the 
    # density of the line instead
    if len(line) > 700 and DISCLAIMER_TEXT not in line:
        return line
    return None
    
def unblobify(text):
    if text is None:
        return text
    # convert br tags to newline
    text = re.sub("<[/]*br[/]*>", "\n", text)
    # remove html tags
    text = re.sub("<.*?[^>]>", "", text)
    return text

def md5_hash(s):
    m = md5.new()
    m.update(s)
    return m.hexdigest()

rawhtml_dir = os.path.join(DATA_DIR, "raw_htmls")
json_dir = os.path.join(DATA_DIR, "jsons")
os.makedirs(json_dir)
json_fid = 0
doc_digests = set()
for root, dirnames, filenames in os.walk(rawhtml_dir):
    for filename in filenames:
        in_path = os.path.join(root, filename)
        fin = open(in_path, 'rb')
        text = fin.read()        
        fin.close()
        json_obj = {}
        # extract metadata from directory structure
        json_obj["sample"] = normalize(in_path.split("/")[-1].replace(".html", ""))
        json_obj["category"] = normalize(in_path.split("/")[-2])
        # extract metadata from specific tags in HTML
        sel = Selector(text=text, type="html")
        json_obj["title"] = sel.xpath("//title/text()").extract()[0]
        json_obj["keywords"] = [x.strip() for x in 
          sel.xpath('//meta[contains(@name, "keywords")]/@content').
          extract()[0].split(",")]
        json_obj["description"] = sel.xpath(
          '//meta[contains(@name, "description")]/@content').extract()[0]
        # extract dynamic text blob from text using regex
        for line in text.split("\n"):
            text_blob = get_candidate_blob(line, sel)
            if text_blob is not None:
                break
        if text_blob is None:
            print "=====", in_path
            continue
        json_obj["text"] = unblobify(text_blob)
        doc_digests.add(md5_hash(json_obj["text"]))
        print "Output JSON for: %s :: %s" % (json_obj["category"], json_obj["sample"])
        json_fname = "%04d.json" % (json_fid)
        fout = open(os.path.join(json_dir, json_fname), 'wb')        
        json.dump(json_obj, fout)
        fout.close()
        json_fid += 1
print "# unique docs:", len(doc_digests)

As mentioned earlier, the site is completely dynamic and renders using ASP files. The objective of the code above is to be able to extract the dynamic block of text from the page template. Unfortunately, there does not seem to be a single way of recognizing this block. The three heuristics I used was to check for a regular expression that seems to begin a majority of the texts (in this case the text did not contain line breaks), to check for the contents of a div block with the "text-align" style attribute, and finally to check for lines of longer than 700 characters. The last 2 are true also for disclaimer text (which is constant across all the pages on the site) so I use a phrase from the disclaimer to eliminate text blocks from there. For this I use Scrapy's XPATH API as well as some plain old regex hacking. An output file looks like the following now.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
  category: "Neurosurgery",
  description: "Donec vitae sapien ut libero venenatis faucibus. Nullam quis 
    ante. Etiam sit amet orci eget eros faucibus tincidunt.",
  title: "Donec vitae sapien ut libero",
  text: "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean 
    commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et 
    magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, 
    ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa 
    quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, 
    arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo.",
  sample: "Wound Care",
  keywords: [
    "dor fundoplication", 
    "lysis of adhesions", 
    "red rubber catheter"
  ]
}

And thats all I have today. The next step is to analyze these files - I will share if I learn something new or find something interesting. By the way, if you are wondering about the Pig Latin in the examples, its deliberate and done to protect the website I was crawling for this work. The actual text for these examples was generated by this tool.

4 comments (moderated to prevent spam):

Anonymous said...

Scrapy is useful for web scrape. Sometimes i use scrapy to web scrape and sometimes i write custom script in php to web scrape.I have created custom web scraper tool to scrape websites like Ebay, Facebook, Yelp and many more.

Sujit Pal said...

Sorry about the delay in publishing your comment, and thanks for the link, this is very cool!

Unknown said...

Hello. Thank you for the wonderful tutorial! I added it to my website's list of great Scrapy-based website crawler tutorials.

Sujit Pal said...

You are welcome and thank you for including my post in your list - you have a very nice collection of scrapy articles on this page.