Screen Scraping With A Saw: A Nokogiri Tutorial With Examples

Chances are you know a bit about screen scraping and already have an opinion on it, but if you don’t here is a quick summary:

Here is what you need to know about screen scraping. Screen scraping is taking something you can see on your computer, typically in a browser, and making accessible inside your code to do store or do some sort of operation on.

Some people love it, some people hate it. Often times both groups feel this way for the same reasons. It is both an equally loved and equally reviled art and science. Some love it because it seems to be more of an art as it can take a lot of creativity in pattern generating. Others dislike it for this same reason.

Screen scraping makes you dependent on code outside of your control. This can happen in other programming situations, but many of those changes are made to avoid upsetting people (i.e. rational versioning). This isn’t the case with screen scraping, you’re taking someones code who most likely only intended it to be viewed in a browser and absorbed by human beings and feeding it to a computer.

Why do such a thing? Because sometimes you have to. Sometimes there isn’t a better way. Sometimes screen scraping is the better way. It all depends on your situation. Basically you’ll screen scrape any time that you need data that is viewable by a human being, but hasn’t yet been formatted or delivered in a way that a computer might like.

Once upon a time this used to mean stocks and banking data inside of terminals. This can still mean that today, but primarily when you hear screen scraping we’re talking about the web which means that chances are we’re really talking about HTML.

Install

We’re going to use Nokogiri, a Ruby gem (also a Japanese saw) to help us parse the HTML (or XML). You may have heard that its hard to install if you’ve used it before, but thats not really true anymore. The header of the Nokigiri website contains: Install sudo gem install nokogiri. If you need more help than that, check out their installation guide.

Nokogiri will let you get your HTML or XML pretty much anywhere you’d like. You can get your data from a file, from a string, from stdin, or most likely from the web. That’s what we’ll be doing, like normal people, we get our beer from the store.

Once you give Nokogiri data you have to tell it what to do with it. That is you have to tell it what nodes that you want and what you want done with them. Nokogiri will let you edit documents, which means you can add or delete nodes, but we’re going to stick with grabbing data out of them for now.

You can communicate with Nokogiri in a few different ways. One is with XPath, the other with CSS selectors. Be warned, that Nokogiri doesn’t always speak CSS selectors as well as you can. Also XPath is more powerful than CSS selectors, but can also be more complicated. As always which you use is up to you as there is no “best” answer.

First we’ll take a look at doing things the XPath way and then we’ll look at CSS selectors. Like your year of foreign language in college, we’re going to work with immersion. That is, we’ll develop the XPath together with a real example, but we’re not going to look at a tutorial or table or other stuff that you probably don’t really need to know right now.

If you disagree with me, you’re always free to check out some XPath tutorials. Don’t worry I’ll link to them later.

Finding Caffeine

I’m a caffeine addict. This has only gotten worse over the years. I’ve resigned myself to this and decided that I need new and interesting caffeine delivery systems, preferably at a good price point.

This of course has lead me to ThinkGeek. They have all kinds of edibles and goodies and fun stuff. They’re also a prefect example of why you might want to screen scrape. They’ve got a few RSS feeds, but none of them tell us exactly what we need to know.

So we’re going to get a list of all the items in the caffeine and edibles category on the site and display them in a terminal with pricing info.

You may want to check out the page before we start but you don’t have to. This is a good time to talk a little more about the downsides to screen scraping, namely that by you’re counting on the page not to change or at least not change in the time that you need to get the data.

This one is a toss up of whether it applies to you. Many big sites don’t change often, its up to you to take that in to account when you decide if screen scraping is the way to go for you. As an aside, in the time that I was writing and editing this article the site we’re scraping change slightly, causing me to change the XPath query we’re gong to use, but not a lot. Not to worry though, I’ll show you what I’m working off of at the time of this writing.

Getting Items

To begin with we’re going to grab the name of all the items. This is important because believe it or not there are days where I don’t want to just order random chemicals or food to stuff in my face.

So here’s what an item “looks” like in the HTML:

<div class="product">
		<a href="/product/c399/" class="product_link">
			<img
				src="http://a.tgcdn.net/images/products/thumb/largesquare/c399_tactical_canned_bacon.jpg"
				title="Delicious bacon strips in a can - with a 10 plus year shelf life. Perfect for surviving zombie invasions."
				alt="Tac Bac - Tactical Canned Bacon"
				width="125"
				height="125"
			/>
			<h4>Tac Bac - Tactical Canned Bacon</h4>
		</a>
		<p>$22.99</p>
</div>

We’re going to look for repeated patterns and develop a rule set that is as consistent as possible and then use Nokogiri to apply that ruleset to the HTML. If that sounded confusing don’t worry it just means we’ll make an XPath and then give it to Nokogiri.

For example in screen scraping we could be looking at things like: “the second link in every paragraph” or even “all of the bold text that is not in a table”. Things like that. Once we’ve found our pattern, we’ll translate it in to XPath for Nokogiri to act on.

Thinkgeek has made it pretty easy on us, with the div class of product. Big pattern giveaway there. This is something to keep in mind: people designing the sites you want to scrape often need things organized in the same or similar manner as you do.

Continuing down the tree, we see that all of the products are links. Since we’re trying to develop the most accurate patterns, we can check that all of the product names are links inside of h4 tags. Going through the code you’ll see that this is always the case. So far so good, sounds pretty specific.

The next step in developing any pattern is to look for what could break it. Call it an outlier or a boundary condition, we’re just hunting for things we left out, or things we’re catching that we didn’t want to.

Here’s a good one. Every product that is the last item in a row, has a different class:

	<div class="product lastcol">
		<a href="/product/f05f/" class="product_link">
			<img
				src="/images/dot_clear.gif"
				title="Destroy sleep with this powerful energy shot - in a reusable shotgun shell bottle."
				alt="Zombie Blast Energy Shots 3 Pack"
				width="125"
				height="125"
				class="lazy"
				data-original="http://a.tgcdn.net/images/products/thumb/largesquare/f05f_zombie_blast_energy_shots.jpg"
			/>
			<h4>Zombie Blast Energy Shots 3 Pack</h4>
		</a>
						<p>$9.99</p>
	</div>

This means in order to get the name of the products, we’d say:

English: Starting at the root of the document: look in every div that has a class name containing the word ‘product’. Inside that find a link. In that link find h4 text.

XPath: //div[contains(@class,'product')]/a/h4

Why the contains in there? The XPath equality operator only matches complete values, in this case a string. XPath only matches whole class names so div[@class='product'] in Xpath would not work to get the last column as you might expect.

As with most things in programming there is of course more than one way to do it. XPath allows us to be verbose and very specific as we just demonstrated, but that doesn’t always mean we need to be. Its possible to say exactly what we mean without using too many words.

Now that we’ve developed the verbose pattern, we can review the code and our statement and realize that there is not time where an h4 tag shoes in side of a link inside of a div that isn’t a product name. That means less chance for ambiguity, which means easier pattern recognition and easier screen scraping.

This means keeping the same English, we could say:

English: Starting at the root of the document: look in every div that has a class name containing the word ‘product’. Inside that find a link. In that link find h4 text.

XPath: //div/a/h4

Getting Prices

Now we’ve retrieved all the product names, its time to get the prices. Here’s that example item again:

<div class="product">
		<a href="/product/c399/" class="product_link">
			<img
				src="http://a.tgcdn.net/images/products/thumb/largesquare/c399_tactical_canned_bacon.jpg"
				title="Delicious bacon strips in a can - with a 10 plus year shelf life. Perfect for surviving zombie invasions."
				alt="Tac Bac - Tactical Canned Bacon"
				width="125"
				height="125"
			/>
			<h4>Tac Bac - Tactical Canned Bacon</h4>
		</a>
		<p>$22.99</p>
</div>

We can use some of what we learned already. We see that we’re still going to be looking inside that div, but in this case pricing information seems to be contained in p tags.

Lets look again for anything that will break our pattern. Checking the items on sale is a good start:

<div class="product">
	<a href="/product/e1d0/" class="product_link">
		<img
			src="/images/dot_clear.gif"
			title="Huggable plush bacon for kids and kids at heart 3 and older"
			alt="My First Bacon - Talking Plush"
			width="125"
			height="125"
			class="lazy"
			data-original="http://a.tgcdn.net/images/products/thumb/largesquare/e1d0_my_first_bacon.jpg"
		/>
		<h4>My First Bacon - Talking Plush</h4>
	</a>
				<div class="sale-tag"><img src="http://a.tgcdn.net/images/refresh/search/sale_tag.png" width="21" height="48" alt="Sale Tag" /></div>
				<p class="sale-price">
					$4.99
					<span class="sale-was-price"><s>$19.99</s></span>					</p>
						<p><span class="sale-savings">Save 75%</span></p>
</div>

Lets also check things that are out of stock:

<div class="product">
		<a href="/product/da14/" class="product_link">
			<img
				src="/images/dot_clear.gif"
				title="It looks like bacon, it smells like bacon, but it won't give you diseases if you rub it all over your body. In a collectible tin!"
				alt="Bacon Soap"
				width="125"
				height="125"
				class="lazy"
				data-original="http://a.tgcdn.net/images/products/thumb/largesquare/da14_bacon_soap.jpg"
			/>
			<h4>Bacon Soap</h4>
		</a>
				<p style="padding:0 0 3px; font-size:0.9em; font-weight:400;" class="outofstock"><img src="http://a.tgcdn.net/images/other/checkmark_x.png" /> Out of stock!</p>
						<p>$5.99</p>
	</div>

This makes things a little trickier, we still have our div classes that we can use, but we can’t say that every p inside of a div is going to give us what we need. This is especially true with styling information or times when an item is out of stock.

That doesn’t mean that we can’t find a pattern though. This is a good time to review what we know:

Every item’s price is contained as a child of a div whose class contains the word product.
Prices are contained in paragraph tags
Not all paragraph tags that are children of the div contain the price
Tags that we don’t want have a style attribute

English: Starting at the root of the document, take all the divs whose class contains the word product and get the text that is contained inside the paragraph tag that doesn’t have a style attribute.

XPath: //div[contains(@class,'product')]/p[not(@style)]/text()

The XPath is a bit different this time. Each time we’ve used XPath we’ve been after text (as opposed id or class information), but only now are we using text(). text() is a “node test” as its called in XPath lingo that allows you to match well text nodes only. We’re also using the XPath operator not to eliminate tags that have style attributes.

For this example, I purposefully chose an XPath that parses the price regardless of stock. How much an item runs for is more broadly useful than whether a specific site has it in stock.

Now that we know how to say what we want in XPath, we still need to work with Nokogiri. While you can require only Nokogiri, this doesn’t make a whole lot of sense if you’re trying to get HTML or XML from the web like are so we’re going to require nokogiri and open-uri.

require 'nokogiri'
require 'open-uri'

Next, we need to tell Nokogiri where to get our document. We’ll use the Nokogiri::HTML module to do that:

doc = Nokogiri::HTML(open("http://www.thinkgeek.com/caffeine/feature/desc/0/all"))

Now we need to tell Nokogiri what part of the document it is that we want, starting with item names. We’ll do that by using the xpath method which checks each node for the XPath query.

Once the data is obtained, how its used will vary from project to project of course, but lets take a look at a typical example, storing and displaying:

To store them the names in an item array:

items = doc.xpath("//div/a/h4").collect {|node| node.text.strip}

and again with prices in it’s own array:

prices = doc.xpath("//div[contains(@class,'product')]/p[not(@style)]/text()").collect {|node| node.text.strip}
prices.delete("")

We use prices.delete("") because some of the nodes will be blank. This is another thing to consider when screen scraping, not all the data will be in the right format as needed, sometimes it needs massaged a bit.

So to put it all together, we come up with something like:

require 'nokogiri'
require 'open-uri'

items = Array.new
prices = Array.new

doc = Nokogiri::HTML(open("http://www.thinkgeek.com/caffeine/feature/desc/0/all"))

items = doc.xpath("//div/a/h4").collect {|node| node.text.strip}

prices = doc.xpath("//div[contains(@class,'product')]/p[not(@style)]/text()").collect {|node| node.text.strip}
prices.delete("")


items.zip(prices).each do |title,price|
 puts title+" "+price
end

Resources

Want to know everything about XPath?
Here is a tutorial.

Looking for lots of data to play with? The US Gov’t has all kinds of stuff if you’re in to that kind of thing.

Have a question or a story about your best use of screen scraping? Share it in the comments, I’d love to hear it!