Ruby Web Scraping

0
13كيلو بايت

This post covers the main tools and techniques for web scraping in Ruby. We start with an introduction to building a web scraper using common Ruby HTTP clients and how to parse HTML documents in Ruby.

This approach to web scraping does have its limitations, however, and can come with a fair dose of frustration. Particularly in the context of single-page applications, we will quickly come across major obstacles due to their heavy use of JavaScript. We will have a closer look on how to address this, using web scraping frameworks, in the second part of this article.

Note: This article does assume that the reader is familiar with the Ruby platform. While there is a multitude of gems, we will focus on the most popular ones and use their Github metrics (use, stars, and forks) as indicators. While we won't be able to cover all the use cases of these tools, we will provide good grounds for you to get started and explore more on your own.


Part I: Static pages

0. Setup

In order to be able to code along with this part, you may need to install the following gems:

gem install 'pry' #debugging tool
gem install 'nokogiri' #parsing gem
gem install 'HTTParty' #HTTP request gem

Moreover, we will use open-urinet/http, and csv, which are part of the standard Ruby library so there's no need for a separate installation. As for Ruby, we are using version 3 for our examples and our main playground will be the file scraper.rb.

1. Make a request with HTTP clients in Ruby

In this section, we will cover how to scrape a Wikipedia page with Ruby.

Imagine you want to build the ultimate Douglas Adams fan wiki. You would for sure start with getting data from Wikipedia. In order to send a request to any website or web app, you would need to use an HTTP client. Let's take a look at our three main options: net/httpopen-uri, and HTTParty. You can use whichever of the below clients you like the most and it will work with the step 2.

Net::HTTP

Ruby's standard library comes with an HTTP client of its own, namely, the net-http gem. In order to make a request to Douglas Adams' Wikipedia page easily, we first need to convert our URL string into a URI object, using the open-uri gem. Once we have our URI, we can pass it to get_response, which will provide us with a Net::HTTPResponse object and whose body method will provide us with the HTML document.

require 'open-uri'
require 'net/http'

url = "https://en.wikipedia.org/wiki/Douglas_Adams"
uri = URI.parse(url)

response = Net::HTTP.get_response(uri)
html = response.body

puts html
#=> "\n<!DOCTYPE html>\n<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n<head>\n<meta charset=\"UTF-8\"/>\n<title>Douglas Adams - Wikipedia</title>..."

Pro tip: Should you use Net::HTTP with a REST interface and need to handle JSON, simply require 'json' and parse the response with JSON.parse(response.body).

That's it - it works! However, the syntax of net/http may be a bit clunky and less intuitive than that of HTTParty or open-uri, which are, in fact, just elegant wrappers for net/http.

HTTParty

The HTTParty gem was created to make http fun. Indeed, with the intuitive and straightforward syntax, the gem has become widely popular in recent years. The following two lines are all we need to make a successful GET request:

require "HTTParty"

response = HTTParty.get("https://en.wikipedia.org/wiki/Douglas_Adams")
html = response.body

puts html
# => "<!DOCTYPE html>\n" + "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n" + "<head>\n" + "<meta charset=\"UTF-8\"/>\n" + "<title>Douglas Adams - Wikipedia</title>\n" + ...

get returns an HTTParty::Response object which, again, provides us with the details on the response and, of course, the content of the page. If the server provided a content type of application/json, HTTParty will automatically parse the response as JSON and return appropriate Ruby objects.

OpenURI

The simplest solution, however, is making a request with the open-uri gem, which also is a part of the standard Ruby library:

require 'open-uri'

html = URI.open("https://en.wikipedia.org/wiki/Douglas_Adams")
##<File:/var/folders/zl/8zprgb3d6yn_466ghws8sbmh0000gq/T/open-uri20200525-33247-1ctgjgo>

This provides us with a file descriptor and allows us to read from the URL as if it were a file, line by line.

The simplicity of OpenURI is already in its name. It only sends one type of request, and does it well, with sensible HTTP defaults for SSL and redirects.

البحث
الأقسام
إقرأ المزيد
Programming
Microsoft Bing AI
The new Bing is an experience that combines the Microsoft search engine with a custom version of...
بواسطة Jesse Thomas 2023-04-20 17:49:33 0 13كيلو بايت
Business
What are the Benefits of Being a Solopreneur?
The rise of the digital age has given birth to a new generation of business...
بواسطة Dacey Rankins 2025-02-12 15:16:34 0 10كيلو بايت
Visual Arts
The Significance of Visual Arts: A Journey Through Creativity and Expression
Visual arts encompass a diverse range of artistic expressions, including painting, sculpture,...
بواسطة Dacey Rankins 2024-10-11 15:52:30 0 25كيلو بايت
Другое
Рождество на двоих. Last Christmas. (2019)
Кейт работает в магазине рождественских товаров, злоупотребляет выпивкой и давно бросила...
بواسطة Nikolai Pokryshkin 2022-09-11 23:04:49 0 30كيلو بايت
Business and Corporate Finance
What Skills Does a CFO Need?
What Skills Does a CFO Need? Financial, Strategic, Leadership, and Communication Skills The...
بواسطة Leonard Pokrovski 2026-01-12 18:55:44 0 822

BigMoney.VIP Powered by Hosting Pokrov