Skip to content

Uptimetry 2.0: Advanced URL Monitoring with Nokogiri and HTTParty

June 24, 2013

We’re working on a great new feature for the upcoming Uptimetry 2.0 release. In the process, I cobbled together this magnificent ruby one-liner that I simply couldn’t be quiet about.

Nokogiri::HTML(HTTParty.get(url)).xpath("//*/@href").map(&:value).select {|e| e[0..3]=='http'}.select {|e| e.match(URI.parse(url).host).nil?}

That’s quite a mouthful. Let’s decompose:

HTTParty retrieves the contents of the given url.
HTTParty.get(url)

Nokogiri parses the response body as HTML.
Nokogiri::HTML(...)

Nokogiri performs an XPath match to find elements with href attributes.
.xpath("//*/@href")

From the resulting set, we reduce to the values of the attributes.
.map(&:value)

Then, we select only values starting with “http”
.select {|e| e[0..3]=='http'}

Finally, we remove any URLs pointing to resources on the same domain.
.select {|e| e.match(URI.parse(url).host).nil?}

We’re left with an array of URLs linking to external resources. If any of these links are dead, the user experience will suffer. Uptimetry (http://uptimetry.com) already offers a powerful cloud-based solution for URL monitoring. Soon, we will offer the option to crawl your web properties automatically for external links to monitor, saving you time and giving you peace of mind every month.

About these ads
One Comment leave one →
  1. Gavin Stark permalink
    June 24, 2013 11:18 am

    The parsing of the URL in the last phase of the processing just to get the host could be fairly expensive on pages with many hrefs. Also e[0..3] generates additional objects (which you might be sensitive to). #select plus a nil? check can typically be expressed with reject instead (though one might argue it is less readable) — Also the #match could be include?

    I might suggest something like the following instead (though the pre-extraction of the host value kinda kills your ‘one liner’ bit)

    host = URI.parse(url).host; Nokogiri::HTML(HTTParty.get(url)).xpath(“//*/@href”).map(&:value).select {|e| e.start_with?(‘http’) }.reject {|e| e.include?(host)}

    or

    host = URI.parse(url).host; Nokogiri::HTML(HTTParty.get(url)).xpath(“//*/@href”).map(&:value).select {|e| e.start_with?(‘http’) && !e.include?(host)}

    Kinda wishing Ruby had a “exclude?” for string. ;-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 122 other followers

%d bloggers like this: