Web Scraping Ajax Heavy Page with Ruby

In the comment section of my previous blog post an WordPress user named Dr Slater asked if I could help out solving a coding problem. This is Dr Slater’s request:

I need a crawl to count the frequency of each users name on this page, and then hash it as an array or a similar solve, where the most frequently occurring user name is max (ranked 1) and least occurring user name is min

I believe one learns better when trying to solve problems oneself (questions are always welcomed of course). But at the same time I can see how this task can be daunting for a new person to coding. So Dr Slater I am going to do the best of both worlds. Here I will lay out how I would do this so you are able to accomplish it yourself.

Programming Language

You are able to crawl a web page with any programming language of your choice. Here I will be using ruby, a language that I find it a joy to use and is user friendly. More info about installing ruby can be found here.

Ruby Library

I will be using watir-webdriver, nokogiri and open-uri which are ruby libraries that will help us with this job. I noticed that the website www.vindexfunding.com/all-funding-projects/ has awfully layout data with multiple ajax requests that are not friendly to web scraping. I guess that is the price to pay when using different third party plugins for a WordPress site. I must admit I had to summon some of the Jedi powers to do this.

Anyways, after installing ruby, go to the command line interface of your computer(if you are using OSX or Linux operating system it will be “terminal”) and type

 gem install watir-webdriver
 gem install nokogiri

The Code

After this set up, let’s get to it.

Create a file for your script

 touch vindex_rank_scraping.rb

Open this file in your favorite coding editor. I personally use Atom and recommend it.

And add this script to the code editor(Caveat: not the prettiest code but served the purpose for a quick job)

#load the libraries 
require 'nokogiri'
require 'open-uri'
require 'watir-webdriver'

#instance of watir webdriver where we open the page with firefox
browser = Watir::Browser.new :firefox
#self explanatory
browser.goto "http://www.vindexfunding.com/all-funding-projects/"
# giving some time for website to load
sleep 2

# start a nokogiri instance where we store the page's html
data = Nokogiri::HTML(browser.html)

#we will use this to store the links to individual projects so we can go to those pages later
link_to_individual_projects = []

#all the projects are within the div with id category-menu
active_projects = data.css('#category-menu') #

# loop through the categories and get their link
active_projects.css('a').each do |link_tag|
  browser.link(:text => link_tag.text).when_present.click
  sleep 1
  ajax_called_data = Nokogiri::HTML(browser.html)
  card_name = ajax_called_data.css('.bbcard_name') # card where project is
  link_to_project = card_name.css('a').attribute('href').value # href to project
  link_to_individual_projects << link_to_project
end

# let's store here the backers for these projects. We are storing them no matter how much they have donated
backers_array = []

# Go each individual page and collect the backer's name
link_to_individual_projects.each do |page|
 browser.goto(page)
 sleep 2

 browser.link(:text => "Backers")
 backers_tab = Nokogiri::HTML(browser.html)

 funders = backers_tab.css('#project-funders')
 backers_array << funders.text.strip
end

# Setting up part where we rank the backers
rank = Hash.new(0)

# iterate over the array, counting duplicate entries
backers_array.each do |backers_name|
  rank[backers_name] += 1
end

# show who the good souls are
rank.each do |name, number|
  puts "#{name} appears #{number} times"
end

Run the Program
Now you can run this script by typing the command

 ruby vindex_rank_scraping.rb

in the command line.

When I did this on Jan 18th 2016 it showed

 No backers yet! appears 2 times

It makes sense as there are no backers yet in the two projects listed on the site.

Dr Slater, good luck with the rest and all the best in your journey.

Passing Params With Rails Render AKA Lesson on Rails Render vs Redirect_to

Problem: User is directed to the purchase page of the website with a discount code
url/discount_code=trial5&product=greatest-product
While purchasing user forgets to add password, hit submit button
Rails(with Devise gem) sends user to url/users but now discount is gone!

This bug … well…bugged me for a morning. I hope this post helps others with the same problem.

This is how my RegistrationsController which inherits from Devise looked

  def new
    @user = User.new
    @purchase = Purchase.new
    super
  end

  def create
    @user = User.new(user_params)
    if @user.save
       redirect_to desired_path
    else
      @purchase = Purchase.new
      return render 'new'
    end
  end
 

I figured what I needed was to send the discount params to render ‘new’. How could I could that. First let me tell you the difference from redirect_to and render.

Render
We can give render the name of an action to cause the corresponding view template to be rendered.

For instance, if you’re in an action named update, Rails will be attempting to display an update.html.erb view template. If you wanted it to display the edit form, associated with the edit action, then render can override the template selection.

When render :edit is executed it only causes the edit.html.erb view template to be displayed. The actual edit action in the controller will not be executed.

Redirect_to

Redirect_to creates a new http request. When a user submits data it comes in as a POST request. If we successfully process that data we likely next display them the data they just created. We could display the article using render in the same POST that sent us the data. But it is not the Rails way. If the user hit the back arrow on the browser it will prompt the User to submit form data again?! when you successfully store data you want to respond with an HTML redirect. That will force the browser to start a new request.
That’s why in the create method in the registrations controller I want to redirect if @user is successfully saved. In this example, the application saves user if new user buys a product.

The Solution

After this brief lesson on the difference between render and redirect. The solution to persist discount when the new.html.erb page is rendered by loading the discount object before rendering the action.

  def new
    @user = User.new
    @purchase = Purchase.new
    super
  end

  def create
    @user = User.new(user_params)
    if @user.save
       redirect_to desired_path
    else
      @purchase = Purchase.new
      @discount = Discount.find_by(code: params[:purchase][:discount_code])
      return render 'new'
    end
  end
 

The params[:purchase][:discount_code] was being passed to the render new action but what I just needed was to load define @discount with the Discount object so my new.html.erb knew what to do with it.

Debugging is a skill that comes with time and good understanding of the framework and tools you are using to develop software. Slowly but surely, we are all on our way there.