Nokogiri

Guide to Nokogiri


This article in Spanish.


Nokogiri is an HTML, XML, SAX and Reader parser. Its main use is that of any analyzer of this type, which is what is known as web scraping. Basically, the idea behind this is to extract data from sites on the internet through programmatic means. The technique is fascinating and, due to its results, it has been related not only to web development, but also to marketing and even science research.

With the passage of time in the ever more primordial use of the internet, it is clear that there is a large amount of publicly available data on the network, with which we are in contact with every time we go online to perform a web search. The idea behind web scraping is to be able to take that information, which is normally present in an unstructured way in the files that compose the websites, and to be able to transform it into something analyzable. This allows to generate a data source that can be used for various purposes, from site indexing, monitoring and comparison of prices, review of competing products, obtain real estate business listings, presence and online reputation monitoring, integration of data, as well as research in general.

It is clear that there are legal implications regarding web scraping, so you have to be cautious when working on this. The basis of legality when carrying out this procedure is in what it is you want to do the information obtained, as well as what are the terms and policies of the sites on which you are working. It is clear that search engines like Google or Yahoo perform web scraping on the network constantly, but that context is totally different than one can face when developing an application for similar purposes. The recommendation at this point is to always read the terms of the sites on which you work, as well as being careful with how the information obtained is used and where it is published.

Within the analyzers used in this technique, Nokogiri stands out for being able to search in documents using native libraries (both C and Java), which implies that it is fast and that it is also standard-compliant. Nokogiri is a "Ruby gem" that transforms a web page into an object and allows the obtaining of information from it to become a very simple task. Ruby gems are code libraries that community members have made available to other developers so they can be used. As in any open-source context, the greater the utility of the library, the greater its fame and its growth in new functionalities and performance improvements. In recent times Nokogiri has become very popular due to the simplicity of its use and the usefulness it provides when working with the analysis of HTML and XML files, which are the most common ones from which information is typically extracted.

Nokogiri allows you to parse HTML and XML files through different strategies (DOM, Reader, Pull, SAX; the first being the most common), and also allows working with CSS3 and XPath selectors as well as XSLT transformations.

It is also important to note that the fact that Nokogiri is based on Ruby brings important advantages, and has also being one of the reasons why this gem has become so used. Ruby is a very simple and powerful language, which makes it a very interesting place for people who are just starting out in website development. Additionally, the Ruby community is one of the biggest growing ones in the last years, which means that there are many developers working with this language, generating knowledge and a very good level of documentation which is available online.

Let's take a couple of basic examples to understand the simplicity with which Nokogiri allows to work. In the first, the code below allows us to get the news from Google News and, from there, create a list of their titles.

The first step is to parse the contents of Google News in the object that Nokogiri creates from the HTML file; to do the following line is used:

    doc = Nokogiri::HTML(open('https://news.google.com'))

Next, we define the CSS selectors that we must use to access the articles:

    article_css_class =".esc-layout-article-cell"

    article_header_css_class ="span.titletext"

To extract all news articles, the line that is used is the following:

    articles = doc.css(article_css_class)

And finally, for each article, we will get the first of its titles (since Google News includes more than one title per news item) as follows:

    articles.each do |article|

      title_nodes = article.css(article_header_css_class)

      prime_title = title_nodes.first

    end

The next example shows how to work with an HTML file and get the attributes that are defined in it.

Suppose we have the following HTML:

html = "This is the title of the pageThis is the body of the page"

To convert it to an object we include the line:

data = Nokogiri::HTML.parse(html)

Getting the title of this page becomes as simple as typing the line below:

data.title

In this example, showing that will give us "This is the title of the page".

Similarly, if we wanted to get all the links of a page we could use the XPath method on the object that Nokogiri generates, as follows:

data.xpath("//a[@href]")

As we see, Nokogiri allows us to work in a simple manner with the unstructured information that is normally found in HTML pages. It allows to structure it and bring it to the object model, becoming a highly valuable add-on when working with information present on the web.