Here is a somewhat silly but useful little script I wrote. We often have to do a discovery crawl of client sites, using seed URLs they give us. One such time, we ended up with abnormally few records in the index, so naturally the question arose as to whether something was wrong with the crawl.
During the debugging process, one of our crawl engineers sent out an email with numbers pulled from Google's site search. So if you wanted to know how many pages were indexed by Google for a site (say foo.bar.com), you would enter the query "site:foo.bar.com" in the search box, and the number you are looking for would be available on the right hand side of the blue title bar of the results.
1 2 | Results 1 - 10 of about 1001 from foo.bar.com (0.15 seconds)
^^^^
|
His email started with the phrase, "According to Google...", which is the inspiration for the name of the script and this blog post. I thought of writing this script with the idea that we could tack this on at the end of the crawl as a quick check to verify that our crawler crawled the "correct" number of pages. Obviously the number returned by Google is an approximation, since the number of pages could have changed between their crawl and ours, so we want to verify that we are within a factor, say 10%, of the Google numbers.
The script can be run from the command line with a list of seed URLs as arguments. Here is an example of calling it and the resulting output.
1 2 3 4 5 6 7 8 9 10 | sujit@sirocco:~$ ./acc2google.py \
sujitpal.blogspot.com \
www.geocities.com/sujitpal \
foo.bar.baz
According to Google...
#-pages for sujitpal.blogspot.com : 323
#-pages for www.geocities.com/sujitpal : 71
#-pages for foo.bar.baz : 0
----------------------------------------
Total pages: 394
|
The last URL is bogus, but the script will correctly report 0 pages for it. The script can be modified to extract the seed URLs from your configuration quite easily, but the location of the information would be crawler specific. Here is the script.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | #!/usr/bin/python
# acc2google.py
# Reads a set of seed lists from a list of sites in the command line params
# and returns the approximate number of pages that are indexed for each site
# by Google
#
import httplib
import locale
import re
import string
import sys
import urllib
def main():
numargs = len(sys.argv)
if (numargs < 2):
print " ".join([sys.argv[0], "site ..."])
sys.exit(-1)
print "According to Google..."
totalIndexed = 0
for i in range(1, numargs):
site = sys.argv[i]
query = "site:" + urllib.quote_plus(site)
try:
conn = httplib.HTTPConnection("www.google.com")
conn.request("GET", "".join(["/search?hl=en&q=", query, "&btnG=Google+Search"]))
response = conn.getresponse()
except:
continue
m = re.search("Results <b>\\d+</b> - <b>\\d+</b> " +
"of about <b>([0-9][0-9,]*)</b>", response.read())
if (not m):
numIndexed = "0"
else:
numIndexed = m.group(1)
print " " + " ".join(["#-pages for", site, ":", numIndexed])
totalIndexed = totalIndexed + int(string.replace(numIndexed, ",", ""))
print " ----------------------------------------"
locale.setlocale(locale.LC_NUMERIC, '')
print " " + " ".join(["Total pages:", locale.format("%.*f", (0, totalIndexed), True)])
if __name__ == "__main__":
main()
|
As you can see, there is not much to the script, but it can be very useful as an early warning system. There are also situations where you want to do a custom discovery crawl with a large number of seed URLs for a handpicked group of public sites, and its useful to know how many records to expect in the index. In that case, its easier to run this once rather than have to do site searches for each of the individual seed URLs.
No comments:
Post a Comment
Comments are moderated to prevent spam.