Move Night And Bacon: A Mechanize Tutorial With Examples

We've talked about scraping or parsing data from a single page using Nokogiri, but what about those situations where the data is not directly accessible? For example, times where you need to perform or search or login first. Essentially any time you need to have interaction, not just parsing you In this case we can use Mechanize

As I'm not a fan of "fake" examples, we're going to scrape a site that you can actually visit, and view the code of again. We're going to automate searching the AFI film catalog for a particular term and grabbing some information about the resulting movies. Who doesn't love movie night? Now you'll be able to easily search for some ideas.

The first step in any scraping is define your entry point. This is going to be the URL of the first page that we can address directly. In some cases this may be the first page of the site, but often times you'll be able to link a little deeper, like straight to a login or search page.

Overview

Screen scraping is a fairly simple process on the surface, most projects are going to look something like this:

  1. Get the webpage
  2. Interact with an element
  3. Get some data
  4. Get the new page
  5. Start over

First you'll get a page, then you'll interact with it, and then get the resultant page until you end up where you need to be, at which point we'll gather data.

Building The Project

For us that means we'll start at the first page of the catalog, which displays a search box and work from there

From this point on, I'm going to assume that you have Mechanize already installed and you've put in the requisite require. If not, you can take a look the Mechanize page here if you need install help.

Now we're going to do a bit of setup for our project. We know that we're going to end up getting some information about movies. To keep track of the information we gather for the movies, we're going to create a Struct, which lets us quickly and easily keep all the data together without having to explicitly create the class ourselves.

For now we're just going to create our struct with a place for the title, release year, and summary from the catalog.

Movie = Struct.new(:title, :year, :summary)

Setting up Mechanize

Now that we have a place to store everything we're going to build our Mechanize object. Mechanize lets us decide how we want it to be identified with the User-Agent String. We don't have to set the whole thing by hand (though you can if you want), as Mechanize has a helpful method user_agent_alias. Check out github for a full list.

agent = Mechanize.new { |agent| 
    agent.user_agent_alias = 'Windows Chrome'
}

We've got our brand new shiny Mechanize object, but we still haven't figured out what to do with it exactly. Well there are a few different ways about it. Mechanize can identify forms and elements by name, URL, or other properties specific to each element. Now in order to find these things out we can of course look in the source of the page, but mechanize also provides a way to do that as well, by using the Ruby PP (Pretty Print) class.

Here is a bit of the pretty print output for the same search box as pictured above:

#<Mechanize::Form
   {name "Search"}
   {method "POST"}
   {action "BasicSearch.aspx"}
   {fields
    [hidden:0x3ff4bd5097d0 type: hidden name: s value: ]
    [text:0x3ff4bd50944c type:  name: SearchText value: Simple search: enter title or name]
    [hidden:0x3ff4bd508fb0 type: hidden name: Field value: All]
    [hidden:0x3ff4bd508dd0 type: hidden name: SearchType value: All]}
   {radiobuttons}
   {checkboxes}
   {file_uploads}
   {buttons [imagebutton:0x3ff4bd50926c type: image name: Go1 value: ]}

See that section? Forms? That's what were after. Now that we know the information about the form and we know the URL where it lives we can get the page and then find the form. Once we tell mechanize what form we want, we then need to specify what item on that form it is that we're going to interact with.

Once we get the page, Mechanize is going to hand us back a page object that we can use to do the searching.

agent.get(afisearchpage) { |page| results = page.form_with(:name => 'Search') do |search| search.SearchText = 'Bacon' end.submit Here we're going to find the form by name (our only option since there isn't an ID), and then we're going to take that form object, find our search form and then fill it in, then submit the form.

We'll be searching for movies about bacon, because we can, and it turns out that a few exist with bacon in the title.

So now that we have submitted a form, we capture the new page, and continue the cycle, acting on it. Recalling the process, we should now be at our search results page.

For the purposes of our example, we're only going to take a look at the first results on the page for movie title matches, not actors.

On this page we're going to grab the year each movie was released. After examining the pages in detail, we can see that on the movie specific page, the release dates are displayed in a wide variety for formats, grabbing the year here keeps us from more complicated parsing later.

We could also grab the movie title at this point as well, but for the purposes of this example, we're going to get that from the movie page.

We're going to use the links_with method from Mechanize::Page object that is returned, this allows us to get a list with all the link objects that match the criteria. Of course, since its a list, we can loop through it and act on each one:

results.links_with(:href => /DetailView.aspx\?s=\&Movie=/).each do |link|

Since we don't know what the text of the link is going to be, that's the title of the movie, we're going to look for the pattern in the URL's to determine its a link to a movie page. Once we pair that knowledge with the fact that the movie links all end with the year of release in parentheses, we can use that to get the years.

Since we now will have some movie data lets make a movie object to hold it:

current_movie = Movie.new

And get the year:

current_movie.year = link.text.match(/\(\d{4}\)$/)[0].gsub(/\D/, "")

Here we're asking Mechanize to take the text of the link (the movie title with year) and apply a regular expression that matches any four digits surrounded by parentheses that are at the end. Then, we replace anything that's not a digit with nothing, leaving just the number for us.

We could get the title as well at this point, but we'll grab it on the movie's individual page, just so we can do a little bit different scraping.

So now that we have the years of each movie and we know that each link in the list we got is to a movie results page, we'll go through and click on it and get the resultant page to do more scraping with.

Since we already found the links, in order to parse their text, we can simply click on them:

description_page = link.click

Now is where our XPath and Nokogiri knowledge comes in, the search method takes an XPath or CSS expression and returns the matching nodes.

Fortunately our movie title on this page is the only bold text that is also centered with html tags, allowing us to search with a simple XPath expression and parse each node:

# Get the movie title
    description_page.search("//center//b").each do |node|
        current_movie.title = node.text.strip;
    end

While we're on the page we may as well also get the summary if its there. This XPath is going to be a bit more complicated and there are lots of ways to do it. In a nutshell, we're looking for a table data cell that has a table row that contains the text "Summary:" and then we want the 2nd node. With all that in mind here is an example of what we can use: //td//tr[contains(., 'Summary:')]/td[2]

Since not all pages have Summaries, but may have the section, we'll double check that the resultant node, actually has any words in it, and if so, we'll store it:

description_page.search("//td//tr[contains(., 'Summary:')]/td[2]").each do |node|
        if ((node.text =~ /\w/))
            current_movie.summary = node.text.strip
        end
    end

At this point we've found a form, filled in an element, submitted a form, found links, and parsed data. This is the essence of screen scraping with Mechanize. We have our Movie objects with the data gathered and could expand it to more advanced searches or other data about the movies if needed, but we'll leave it as is for today.

Here's everything all put together:

require 'mechanize'

Movie = Struct.new(:title, :year, :summary)

found_movies = []

afi_search_page = "http://www.afi.com/members/catalog/default.aspx"

agent = Mechanize.new { |agent| 
    agent.user_agent_alias = 'Windows Chrome'
}

agent.get(afi_search_page) { |page| 
    results = page.form_with(:name => 'Search') do |search|
        search.SearchText = 'Bacon'
end.submit

results.links_with(:href => /DetailView.aspx\?s=\&Movie=/).each do |link|

    current_movie = Movie.new

    current_movie.year = link.text.match(/\(\d{4}\)$/)[0].gsub(/\D/, "")


    description_page = link.click

    # Get the movie title
    description_page.search("//center//b").each do |node|
        current_movie.title = node.text.strip;
    end

    # Get movie summary if available
    description_page.search("//td//tr[contains(., 'Summary:')]/td[2]").each do |node|
        if ((node.text =~ /\w/))
            current_movie.summary = node.text.strip
        end
    end
end
}

Have any questions? Something I could do better? Let me know in the comments, I'd love to hear what you have to say.

If you liked this, sign-up free below and get more helpful Ruby solutions and get notified when Ruby For System Administrators, is released:

Liked what you saw? Get more automation insights delivered straight to your inbox.

indicates required