So a few months ago I put together a side project at http://www.myrecipesavour.com/ . Basically, the site allows you to put in the URL of a cooking recipe page and will then parse the recipe for your collection.
So it turns out, reading data from another site is very easy with Nokogiri.
The source code is available here https://github.com/abreckner/MyRecipeSavour
There is a lot I am going to cover in the next few posts based on this code base (like Devise and Heroku), but for now we are focussed on this file https://github.com/abreckner/MyRecipeSavour/blob/master/app/models/site.rb
So we are going to look at the add_recipe method.
First we need to require a few packages
i.e.
html = Nokogiri::HTML(open(url).read) # open the page
title = html.css(site.title_selector).text.strip # read the title
I then populate a recipe object with these pieces
My code around the ingredients and instructions is a little more complex as my Recipe model has many Ingredients and Instructions (eventually I am going to allow users to manipulate them individually). Each ingredient/instruction is parsed based on a line break, so I need to pull in the ingredient array from Nokogiri and then merge it into a string separated by line breaks.
ingredients = html.css(site.ingredient_selector).children.inject(''){|sum, n| sum + n.text + "\n"
...
Ingredient.multi_save(ingredients, recipe)
The reason I convert it to a string and then back into an array is so that the user can later edit the ingredients via a textarea. It's fair to say that I actually write the multi_save code from a textarea for input before I did the screen scrape and I wanted to reuse it.
The other interesting piece of this add_recipe method is that I store a new Site in case the user tries to add a recipe from an "uncatalogued" site. This automatically builds up a list of the sites people are interested in saving recipes from and allows me to catalogue it at a later date
So it turns out, reading data from another site is very easy with Nokogiri.
The source code is available here https://github.com/abreckner/MyRecipeSavour
There is a lot I am going to cover in the next few posts based on this code base (like Devise and Heroku), but for now we are focussed on this file https://github.com/abreckner/MyRecipeSavour/blob/master/app/models/site.rb
So we are going to look at the add_recipe method.
First we need to require a few packages
Unfortunately, I haven't yet figured out a heuristic for separating a recipe web page into a recipe's components (Title, Ingredients, Instructions, Amounts, etc...) but as a workaround, I maintain a catalogue of CSS selectors which define these elements per domain. When I read the page, I use NokoGiri to parse those elements for me using the CSS selectorsrequire 'open-uri'require 'rubygems'require 'nokogiri'
i.e.
html = Nokogiri::HTML(open(url).read) # open the page
title = html.css(site.title_selector).text.strip # read the title
I then populate a recipe object with these pieces
recipe = Recipe.newrecipe.name = title...recipe.save
My code around the ingredients and instructions is a little more complex as my Recipe model has many Ingredients and Instructions (eventually I am going to allow users to manipulate them individually). Each ingredient/instruction is parsed based on a line break, so I need to pull in the ingredient array from Nokogiri and then merge it into a string separated by line breaks.
ingredients = html.css(site.ingredient_selector).children.inject(''){|sum, n| sum + n.text + "\n"
...
Ingredient.multi_save(ingredients, recipe)
The reason I convert it to a string and then back into an array is so that the user can later edit the ingredients via a textarea. It's fair to say that I actually write the multi_save code from a textarea for input before I did the screen scrape and I wanted to reuse it.
The other interesting piece of this add_recipe method is that I store a new Site in case the user tries to add a recipe from an "uncatalogued" site. This automatically builds up a list of the sites people are interested in saving recipes from and allows me to catalogue it at a later date
site_domain = URI.parse(url).hostsite = Site.find_by_domain site_domainif site.nil?site = Site.newsite.domain = site_domainsite.url = urlsite.user = current_usersite.save!falseelse... #Nokogiri scraping code goes hereend
Comments