Pitfalls of Parsing

I was pretty pleased with myself a few days ago. I was staring down some badly-formed HTML pages containing real estate listings that I needed to scrape in order to include direct links to a few properties at koppelmanteam.com. Using a tidy, standards-compliant XML parser was out of the question. When I tried to load a page into one, I got an eyeful of error messages starting with the abesnce of an XML DTD declaration. Tags didn’t close properly. Barely anything had a Class or ID, and the few that did didn’t have the attribute values quoted properly. This was no good.

RubyfulSoup was just what the doctor ordered. Within a half-hour of tinkering and making the best of so-so documentation, I was pulling out prices, MLS numbers, image URLs and all that other good stuff via syntax something like

doc.html.find_all('table')[4].each do |tr|
  listing['price'] = tr.find_all('td')[3].string
end

Not too shabby.

Like a good lad, I put the data on each property in a data structure. First I used Structs, then I tried a simple custom class, then I arrived on keeping it simple with a hash. Then I bundled the hashes of a whole set of properties into an outer hash. Armed with my hash of hashes, the next step was to store them.

Since I’m not building an MLS search engine and simply wanted to be able to display these lists of properties as-is, there was no need to create ActiveRecord objects to store the individual properties’ information. All I needed was a way to store and retrieve the entire hash.

Enter ActiveRecord’s serialize declaration, which would automagically serialize my hash into a text column in the database. I’m using the acts_as_commentable plugin to store these search objects, so I had to put the declaration elsewhere, but for a normal AR model, it’s as simple as something like:

class Feed > ActiveRecord::Base
  belongs_to :community
  serialize :search_results
end

Easy as pie! Except it didn’t work.

Debugging and hair-pulling ensued. I tried serializing other things to the field, which worked perfectly fine. I tried serializing data structures that resembled my real one: hashes of strings, hashes of hashes, all fine. Eventually, after clearing away a few other bugs, it was still complaining that I was trying to serialize a Proc, which is a no-no. What? It was a hash of hashes of strings, right? I wasn’t creating anything with a custom constructor. What gave?

When I constructed a hash of hashes seemingly just like the ones I was generating and that serialized, it finally dawned on me: RubyfulSoup was the culprit.

I went into the Rails console, generated one of the real hashes of hashes that was failing, and started examining the members of the inner hashes with Object#class. And there it was. The RubyfulSoup #string method wasn’t returning Strings. It was returning instances of its own custom class, a NavigableString. Which made sense, of course. How else would one of the <TR> elements magically respond to a #find_all method to pull the <TD> tags inside it?

Duh.

So tr.find_all('td')[3].string became tr.find_all('td')[3].string.to_s and all was well in the world.

If you ain’t returning a String, don’t call the method #string, I say.

Published in: on December 11, 2006 at 12:08 pm Leave a Comment