Web Scraping Ajax Heavy Page with Ruby

In the comment section of my previous blog post an WordPress user named Dr Slater asked if I could help out solving a coding problem. This is Dr Slater’s request:

I need a crawl to count the frequency of each users name on this page, and then hash it as an array or a similar solve, where the most frequently occurring user name is max (ranked 1) and least occurring user name is min

I believe one learns better when trying to solve problems oneself (questions are always welcomed of course). But at the same time I can see how this task can be daunting for a new person to coding. So Dr Slater I am going to do the best of both worlds. Here I will lay out how I would do this so you are able to accomplish it yourself.

Programming Language

You are able to crawl a web page with any programming language of your choice. Here I will be using ruby, a language that I find it a joy to use and is user friendly. More info about installing ruby can be found here.

Ruby Library

I will be using watir-webdriver, nokogiri and open-uri which are ruby libraries that will help us with this job. I noticed that the website www.vindexfunding.com/all-funding-projects/ has awfully layout data with multiple ajax requests that are not friendly to web scraping. I guess that is the price to pay when using different third party plugins for a WordPress site. I must admit I had to summon some of the Jedi powers to do this.

Anyways, after installing ruby, go to the command line interface of your computer(if you are using OSX or Linux operating system it will be “terminal”) and type

 gem install watir-webdriver
 gem install nokogiri

The Code

After this set up, let’s get to it.

Create a file for your script

 touch vindex_rank_scraping.rb

Open this file in your favorite coding editor. I personally use Atom and recommend it.

And add this script to the code editor(Caveat: not the prettiest code but served the purpose for a quick job)

#load the libraries 
require 'nokogiri'
require 'open-uri'
require 'watir-webdriver'

#instance of watir webdriver where we open the page with firefox
browser = Watir::Browser.new :firefox
#self explanatory
browser.goto "http://www.vindexfunding.com/all-funding-projects/"
# giving some time for website to load
sleep 2

# start a nokogiri instance where we store the page's html
data = Nokogiri::HTML(browser.html)

#we will use this to store the links to individual projects so we can go to those pages later
link_to_individual_projects = []

#all the projects are within the div with id category-menu
active_projects = data.css('#category-menu') #

# loop through the categories and get their link
active_projects.css('a').each do |link_tag|
  browser.link(:text => link_tag.text).when_present.click
  sleep 1
  ajax_called_data = Nokogiri::HTML(browser.html)
  card_name = ajax_called_data.css('.bbcard_name') # card where project is
  link_to_project = card_name.css('a').attribute('href').value # href to project
  link_to_individual_projects << link_to_project
end

# let's store here the backers for these projects. We are storing them no matter how much they have donated
backers_array = []

# Go each individual page and collect the backer's name
link_to_individual_projects.each do |page|
 browser.goto(page)
 sleep 2

 browser.link(:text => "Backers")
 backers_tab = Nokogiri::HTML(browser.html)

 funders = backers_tab.css('#project-funders')
 backers_array << funders.text.strip
end

# Setting up part where we rank the backers
rank = Hash.new(0)

# iterate over the array, counting duplicate entries
backers_array.each do |backers_name|
  rank[backers_name] += 1
end

# show who the good souls are
rank.each do |name, number|
  puts "#{name} appears #{number} times"
end

Run the Program
Now you can run this script by typing the command

 ruby vindex_rank_scraping.rb

in the command line.

When I did this on Jan 18th 2016 it showed

 No backers yet! appears 2 times

It makes sense as there are no backers yet in the two projects listed on the site.

Dr Slater, good luck with the rest and all the best in your journey.