header image

Viagra pills
June 14th, 2006

Viagra pills, Update: A lot of things happened since the publication of this article. Viagra pills, First of all, viagra pills, I have updated this article with HPricot and scRUBYt! examples - then I wrote the second part, viagra pills, I hacked up a Ruby web-scraping toolkit, viagra pills, scRUBYt! which also has a community web page - check it out, viagra pills, it’s hot right now!

Introduction

Despite of the ongoing Web 2.0 buzz, viagra pills, the absolute majority of the Web pages are still very Web 1.0: They heavily mix presentation with content. [1] This makes hard or impossible for a computer to tell off the wheat from the chaff: to sift out meaningful data from the rest of the elements used for formatting, viagra pills, spacing, viagra pills, decoration or site navigation.

To remedy this problem, viagra pills, some sites provide access to their content through APIs (typically via web services), viagra pills, but in practice nowadays this is limited to a few (big) sites, viagra pills, and some of them are not even free or public. In an ideal Web 2.0 world, viagra pills, where data sharing and site interoperability is one of the basic principles, viagra pills, this should change soon(?) - but what should one do if he needs the data NOW and not in the likely-to-happen-future?

Manic Miner

The solution is called screen/Web scraping or Web extraction - mining Web data by observing the page structure and wrapping out the relevant records. Viagra pills, In some cases the task is even more complex than that: The data can be scattered over more pages, viagra pills, triggering of a GET/POST request may be needed to get the input page for the extraction or authorization may be required to navigate to the page of interest. Viagra pills, Ruby has solutions for these issues, viagra pills, too - we will take a look at them as well.

The extracted data can be used in any way you like - to create mashups (e.g. Viagra pills, chicagocrime.org by Django author Adrian Holovaty), viagra pills, to remix and present the relevant data (e.g. Viagra pills, rubystuff.com com by ruby-doc.org maintainer James Britt), viagra pills, to automatize processes (for example if you have more bank accounts, viagra pills, to get the sum of the money you have all together, viagra pills, without using your browser), viagra pills, monitor/compare prices/items, viagra pills, meta-search, viagra pills, create a semantic web page out of a regular one - just to name a few. Viagra pills, The number of the possibilities is limited by your imagination only.

Tools of the trade

In this section we will check out the two main possibilities (string and tree based wrappers) and take a look at HTree, viagra pills, REXML, viagra pills, RubyfulSoup and WWW::Mechanize based solutions.

String wrappers

The easiest (but in most of the cases inadequate) possibility is to view the HTML document as a string. Viagra pills, In this case you can use regular expressions to mine the relevant data. Viagra pills, For example if you would like to extract names of goods and their price from a Web shop, viagra pills, and you know that they are both in the same HTML element, viagra pills, like:

<td>Samsung SyncMasta 21''LCD     $750.00</td>

you can extract this record from Ruby with this code snippet:

scan(page, viagra pills, /<td>(.*)\s+(\$\d+\.\d{2})<\/td>/)

Let’s see a real (although simple) example:

1 require 'open-uri'

2 url = "http://www.google.com/search?q=ruby"
3 open(url) {
4   |page| page_content = page.read()
5   links = page_content.scan(/<a class=l.*?href=\"(.*?)\"/).flatten
6   links.each {|link| puts link}
7 }

The first and crucial part of creating the wrapper program was the observation of the page source: We had to look for something that appears only in the result links. In this case this was the presence of the ‘class’ attribute, viagra pills, with value ‘l’. Viagra pills, This task is usually not this easy, viagra pills, but for illustration purposes it serves well.

This minimalistic example shows the basic concepts: How to load the contents of a Web page into a string (line 4), viagra pills, and how to extract the result links on a google search result page (line 5). Viagra pills, (After execution, viagra pills, the program will list the first 10 links of a google search query for the word ‘ruby’ (line 6)).

However, viagra pills, in practice you will mostly need to extract data which are not in a contiguous string, viagra pills, but contained in multiple HTML tags, viagra pills, or divided in a way where a string is not the proper structure for searching. Viagra pills, In this case it is better to view the HTML document as a tree.[2]

Tree wrappers

The tree-based approach, viagra pills, although enables more powerful techniques, viagra pills, has its problems, viagra pills, too: The HTML document can look very good in a browser, viagra pills, yet still be seriously malformed (unclosed/misused tags). Viagra pills, It is a non-trivial problem to parse such a document into a structured format like XML, viagra pills, since XML parsers can work with well-formed documents only.

HTree and REXML

There is a solution (in most of the cases) for this problem, viagra pills, too: It is called HTree. Viagra pills, This handy package is able to tidy up the malformed HTML input, viagra pills, turning it to XML - the recent version is capable to transform the input into the nicest possible XML from our point of view: a REXML Document. Viagra pills, ( REXML is Ruby’s standard XML/XPath processing library).

After preprocessing the page content with HTree, viagra pills, you can unleash the full power of XPath, viagra pills, which is a very powerful XML document querying language, viagra pills, highly suitable for web extraction. Viagra pills,

Refer to [3] for the installation instructions of HTree.

Let’s revisit the previous Google example:

1 require 'open-uri'
2 require 'htree'
3 require 'rexml/document'

4 url = "http://www.google.com/search?q=ruby"
5 open(url) {
6  |page| page_content = page.read()
7  doc = HTree(page_content).to_rexml
8  doc.root.each_element('//a[@class="l"]') {
        |elem| puts elem.attribute('href').value }  
9 }

HTree is used in the 7th line only - it converts the HTML page (loaded into the pageContent variable on the previous line) into a REXML Document. Viagra pills, The real magic happens in the 8th line. Viagra pills, We select all the <a> tags which have an attribute ‘class’ with the value ‘l’, viagra pills, then for each such element write out the ‘href’ attribute. Viagra pills, [4] I think this approach is much more natural for querying an XML document than a regular expression. Viagra pills, The only drawback is that you have to learn a new language, viagra pills, XPath, viagra pills, which is (mainly from version 2.0) quite difficult to master. Viagra pills, However, viagra pills, just to get started you do not need to know much of it, viagra pills, yet you gain lots of raw power compared to the possibilities offered by regular expressions.

Hpricot

Hpricot is “a Fast, viagra pills, Enjoyable HTML Parser for Ruby” by one of the coolest (Ruby) programmers of our century, viagra pills, why the lucky stiff. Viagra pills, From my experience, viagra pills, the tag line is absolutely correct - Hpricot is both very fast (thanks to a C based scanner implementation) and really fun to use. It is based on HTree and JQuery, viagra pills, thus it can provide the same functionality as the previous Htree + REXML combination, viagra pills, but with a much better performance and greater ease of use. Viagra pills, Let’s see the google example again - I guess you will understand instantly what I mean!

1 require 'rubygems'
2 require 'hpricot'
3 require 'open-uri'

4 doc = Hpricot(open('http://www.google.com/search?q=ruby'))
5 links = doc/"//a[@class=l]"
6 links.map.each {|link| puts link.attributes['href']}

Well, viagra pills, though this was slightly easier than with the tools seen so far, viagra pills, this example does not really show the power of Hpricot - there is much, viagra pills, much, viagra pills, much more in the store: different kinds of parsing, viagra pills, CSS selectors and searches, viagra pills, nearly full XPath support, viagra pills, and lots of chunky bacon! If you are doing something smaller and don’t need the power of scRUBYt!, viagra pills, my advice is to definitely use Hpricot from the tools listed here. Viagra pills, For more information, viagra pills, installation instructions, viagra pills, tutorials and documentation check out Hpricot’ s homepage!

RubyfulSoup

Rubyfulsoup is a very powerful Ruby screen-scraping package, viagra pills, which offers similar possibilities like HTree + XPath. Viagra pills, For people who are not handy with XML/XPath, viagra pills, RubyfulSoup may be a wise compromise: It’s an all-in-one, viagra pills, effective HTML parsing and web scraping tool with Ruby-like syntax. Viagra pills, Although it’s expressive power lags behind XPath2.0, viagra pills, it should be adequate in 90% of the cases. Viagra pills, If your problem is in the remaining 10%, viagra pills, you probably don’t need to read this tutorial anyway ;-) Installation instructions can be found here: [5].

The google example again:

1  require 'rubygems'
2  require 'rubyful_soup'
3  require 'open-uri'

4  url = "http://www.google.com/search?q=ruby"  
5  open(url) { 
6    |page| page_content = page.read()
7    soup = BeautifulSoup.new(page_content)
8    result = soup.find_all('a', viagra pills, :attrs => {'class' => 'l'}) 
9    result.each { |tag| puts tag['href'] }
10 }

As you can see, viagra pills, the difference between the HTree + REXML and RubyfulSoup examples is minimal - basically it is limited to differences in the querying syntax. Viagra pills, On line 8, viagra pills, you look up all the <a> tags, viagra pills, with the specified attribute list (in this case a hash with a single pair { ‘class’ => ‘l’ } ) The other syntactical difference is looking up the value of the ‘href’ attribute on line 9.

I have found RubyfulSoup the ideal tool for screen scraping from a single page - however web navigation (GET/POST, viagra pills, authentication, viagra pills, following links) is not really possible or obscure at best with this tool (which is perfectly OK, viagra pills, since it does not aim to provide this functionality). Viagra pills, However, viagra pills, there is nothing to fear - the next package is doing just exactly that.

WWW::Mechanize

As of today, viagra pills, prevalent majority of data resides in the deep Web - databases, viagra pills, that are accessible via querying through web-forms. Viagra pills, For example if you would like to get information on flights from New York to Chicago, viagra pills, you will (hopefully) not search for it on google - you go to the website of the Ruby Airlines instead, viagra pills, fill in the adequate fields and click on search. Viagra pills, The information which appears is not available on a static page - it’s looked up on demand and generated on the fly - so until the very moment the web server generates it for you , viagra pills, its practically non-existent (i.e. Viagra pills, it resides in the deep Web) and hence impossible to extract. Viagra pills, At this point WWW::Mecahnize comes into play. Viagra pills, (See [6] for installation instructions)

WWW::Mechanize belongs to the family of screen scraping products (along with http-access2 and Watir) that are capable to drive a browser. Viagra pills, Let’s apply the ‘Show, viagra pills, don’t tell’ mantra - for everybody’s delight and surprise, viagra pills, illustrated on our google scenario:

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get('http://www.google.com')

search_form = page.forms.with.name("f").first
search_form.fields.name("q").first.value = "ruby"
search_results = agent.submit(search_form)
search_results.links.each {
     |link| puts link.href if link.class_name == "l" }

I have to admit that i have been cheating with this one ;-). Viagra pills, I had to hack WWW::Mechanize to access a custom attribute (in this case ‘class’) because normally this is not available. See how i did it here: [7]

This example illustrates a major difference between RubyfulSoup and Mechanize: additionally to screen scraping functionality, viagra pills, WWW::mechanize is able to drive the web browser like a human user: It filled in the search form and clicked the ’search’ button, viagra pills, navigating to the result page, viagra pills, then performed screen scraping on the results.

This example also pointed out the fact that RubyfulSoup - although lacking navigation possibilities - is much more powerful in screen scraping. Viagra pills, For example, viagra pills, as of now, viagra pills, you can not extract arbitrary (say <p>) tags with Mechanize, viagra pills, and as the example illustrated, viagra pills, attribute extraction is not possible either - not to mention more complex, viagra pills, XPath like queries (e.g. Viagra pills, the third <td> in the second <tr>) which is easy with RubyfulSoup/REXML. Viagra pills, My recommendation is to combine these tools, viagra pills, as pointed out in the last section of this article.

scRUBYt!

scRUBYt! is a simple to learn and use, viagra pills, yet very powerful web extraction framework written in Ruby, viagra pills, based on Hpricot and Mechanize. Viagra pills, Well, viagra pills, yeah, viagra pills, I made it :-) so this is kind of a self promotion, viagra pills, but I think (hopefully not just because being overly biased ;-)) it is the most powerful web extraction toolkit available to date. Viagra pills, scRUBYt! can navigate through the Web (like clicking links, viagra pills, filling textfields, viagra pills, crawling to further pages - thanks to mechanize), viagra pills, extract, viagra pills, query, viagra pills, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL (thanks to Hpricot and a lots of smart heuristics).

OK, viagra pills, enough talking - let’s see it in action! I guess this is rather annoying now for the 6th time, viagra pills, but let’s revisit the google example once more! (for the last time, viagra pills, I promise :-)

1  require 'rubygems'
2  require 'scrubyt'

3  google_data = Scrubyt::Extractor.define do
4    fetch          'http://www.google.com/ncr'
5    fill_textfield 'q', viagra pills, 'ruby'
6    submit

7    result 'Ruby Programming Language' do
8      link 'href', viagra pills, :type => :attribute
9    end
10 end

11 google_data.to_xml.write($stdout, viagra pills, 1)
12 Scrubyt::ResultDumper.print_statistics(google_data) 

Oputput:

  <root>
    <result>
      <link>http://www.ruby-lang.org/</link>
    </result>
    <result>
      <link>http://www.ruby-lang.org/en/20020101.html</link>
    </result>
    <result>
      <link>http://en.wikipedia.org/wiki/Ruby_programming_language</link>
    </result>
    <result>
      <link>http://en.wikipedia.org/wiki/Ruby</link>
    </result>
    <result>
      <link>http://www.rubyonrails.org/</link>
    </result>
    <result>
      <link>http://www.rubycentral.com/</link>
    </result>
    <result>
      <link>http://www.rubycentral.com/book/</link>
    </result>
    <result>
      <link>http://www.w3.org/TR/ruby/</link>
    </result>
    <result>
      <link>http://poignantguide.net/</link>
    </result>
    <result>
      <link>http://www.zenspider.com/Languages/Ruby/QuickRef.html</link>
    </result>
  </root>

    result extracted 10 instances.
        link extracted 10 instances.

You can donwload this example from here.

Though the code snippet is not really shorter, viagra pills, maybe even longer than the other ones, viagra pills, there are a lots of thing to note here: First of all, viagra pills, instead of loading the page directly (you can do that as well, viagra pills, of course), viagra pills, scRUBYt allows you to navigate there by going to google, viagra pills, filling the appropriate text field and submitting the search. Viagra pills, The next interesting thing is that you need no XPaths or other mechanism to query your data - you just copy’n’ paste some examples from the page, viagra pills, and that’s it. Viagra pills, Also, viagra pills, the whole description of the scraping process is more human friendly - you do not need to care about URLs, viagra pills, HTML, viagra pills, passing the document around, viagra pills, handling the result - everything is hidden from you and controlled by scRUBYt!’s DSL instead. Viagra pills, You even get a nice statistics on how much stuff was extracted. Viagra pills, :-)

The above example is just the top of the iceberg - there is much, viagra pills, much, viagra pills, much more in scRUBYt! than what you have seen so far. Viagra pills, If you would like to know more, viagra pills, check out the tutorials and other goodies on scRUBYt!’s homepage.

WATIR

From the WATIR page:

WATIR stands for “Web Application Testing in Ruby”. Viagra pills, Watir drives the Internet Explorer browser the same way people do. Viagra pills, It clicks links, viagra pills, fills in forms, viagra pills, presses buttons. Viagra pills, Watir also checks results, viagra pills, such as whether expected text appears on the page.

Unfortunately I have no experience with WATIR since i am a linux-only nerd, viagra pills, using windows for occasional gaming but not for development, viagra pills, so I can not tell anything about it from the first hand, viagra pills, but judging from the mailing list contributions i think Watir is more mature and feature-rich than mechanize. Viagra pills, Definitely check it out if you are running on Win32.

The silver bullet

For a complex scenario, viagra pills, usually an amalgam of the above tools can provide the ultimate solution: The combination of WWW::Mechanize or WATIR (for automatization of site navigation), viagra pills, RubyfulSoup (for serious screen scraping, viagra pills, where the above two are not enough) and HTree+REXML (for extreme cases where even RubyfulSoup can’t help you).

I have been creating industry-strength, viagra pills, robust and effective screen scraping solutions in the last five years of my career, viagra pills, and i can show you a handful of pages where even the most sophisticated solutions do not work (and i am not talking about scraping with RubyfulSoup here, viagra pills, but even more powerful solutions (like embedding mozilla in your application and directly accessing the DOM etc)). Viagra pills, So the basic rule is: there is no spoon (err… Viagra pills, silver bullet) - and i know by experience that the number of ‘hard-to-scrap’ sites is rising (partially because of the Web 2.0 stuff like AJAX, viagra pills, but also because some people would not like their sites to be extracted and apply different anti-scraping masquerading techniques). Viagra pills,

The described tools should be enough to get you started - additionally, viagra pills, you may have to figure out how to drill down to your stuff on the concrete page of interest.

In the next installment of this series, viagra pills, i will create a mashup application using the introduced tools, viagra pills, from some more interesting data than google ;-) The results will be presented on a Ruby on Rails powered page, viagra pills, in a sortable AJAX table. Viagra pills,

If you liked the article, viagra pills, subscribe to the rubyrailways.com feed!  


Creating a site for a ruby on rails tutorials is a great way to market the fairly new language. Viagra pills, Setting up a site should be very simple. Viagra pills, Use the engines to search domains for a relevant domain name. Viagra pills, Search for dedicated servers for cheap hosting plans to get efficient service and extra web space. Viagra pills, Use a wireless internet to upload the site conveniently, viagra pills, trying hiring a company that hires people with 642-586or at the least ccna certification. Viagra pills, Look into ibm certification yourself to increase productivity.


[1] There are a lot of other issues (social aspect, viagra pills, interoperability, viagra pills, design principles etc.), viagra pills, but these are falling out of scope of the current topic.Back


[2] However, viagra pills, if the problem can be relatively easily tackled with regular expressions, viagra pills, it’s usually good to use them for several reasons: No additional packages are needed (this is even more important if you don’t have install rights), viagra pills, you don’t have to rely on the HTML parser’s output and if you can use regular expressions, viagra pills, it’s usually the easier way to do so. Viagra pills, Back


[3] Install HTree: wget http://cvs.m17n.org/viewcvs/ruby/htree.tar.gz (or download it from your browser) tar -xzvf htree.tar.gz sudo ruby install.rb Back


[4] There are plenty other (possibly smarter) ways to do this, viagra pills, for example using each_element_with_attribute, viagra pills, or a different, viagra pills, more effective XPath - I have chosen to use this method to get as close to the regexp example as possible, viagra pills, so it is easy to observe the difference between the two approaches for the same solution. Viagra pills, For a real REXML tutorial/documentation visit the REXML site. Back


[5] The easiest way is to install rubyful_soup from a gem: sudo gem install rubyful_soup Since it was installed as a gem, viagra pills, don’t forget to require ‘rubygems’ before requiring rubyful_soup. Back


[6] sudo gem install mechanize Back


[7] I have added two lines to WWW::Mechanize source file page_elements.rb: To the class definition:
attr_reader :class_name
Into the constructor:
@class_name = node.attributes['class']

Similar Posts:order phentermine,viagra pills,buy prozac online,meridia prescription,cheapest meridia



If you liked the article, subscribe to the rubyrailways.com feed!  


90 Responses to “Viagra pills”

  1. Ruby, Rails, Web2.0 » Blog Archive » Announcing screen-scraping series Says:

    Viagra pills, [...] https://rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails [...]

  2. Chris Rose Says:

    Viagra pills, There is a project to port Watir for Firefox, viagra pills, just FYI - it’s called FireWatir

    Viagra pills, http://wiki.mozilla.org/SoftwareTesting:WatirandFirefox

  3. Chris Rose Says:

    Viagra pills, those were supposed to be underscores around the and, viagra pills, between Watir and Firefox, viagra pills, in the url in my above comment - I don’t know how those got altered - sorry.

  4. peter Says:

    Viagra pills, Chris, viagra pills,

    Viagra pills, Thanks for the link! We are developing a screen scraping application just now which is a Firefox extension, viagra pills, so i am quite involved with Firefox and good to know about stuff like FireWatir.

    Viagra pills, About the underscores - i guess it is WordPress. For example this was written as asterisk-this-asterisk and now you can see it in bold. Probably undescore is a shortcut for italic i guess…

  5. aa Says:

    Viagra pills, I am not commenting on your blog because you had a captcha-esque ‘please add 10 and 0′ field, viagra pills, it derided me as ‘not knowning math’ when I entered “10″, viagra pills, and it eradicated all the contents of my post rather than letting me take the challenge again. Comment spam is annoying, viagra pills, but my time is better spent bitching about your way of handling it than actually rewriting my post and helping you out.

  6. peter Says:

    Viagra pills, @aa:

    Viagra pills, ;-) Sorry for the inconvenience…. Nobody ‘bitched’ about it yet, viagra pills, so I did not know (I have tried it once or twice and it worked OK for me). What do you suggest? I would not like to drop the captcha completely since then i am receiving LOTS of spam. Maybe somebody has a suggestion for a better system?

  7. peter Says:

    Viagra pills, OK, viagra pills, I have turned off the captcha until i find something more convenient… So if the comments will be full of spam its because of that ;-)

  8. dominic Says:

    Viagra pills, With WWW::Mechanize you can get the parsed rexml document and it also adds convenience methods to this REXML::Document
    agent = WWW::Mechanize.new
    page = agent.get(’http://www.google.com’)
    form = page.forms.first
    form.fields.name(’q').value = ‘ruby’
    searchresults = agent.submit(form)
    search
    results.root.each_element(’//a[@class="l"]‘) {|elem| puts elem.attribute(’href’).value }

  9. Doug Bromley Says:

    Viagra pills, Absolutely superb article. I’ve generally always put up with using old fashioned regexp in my screen scraping and didn’t know of these other methods until now. You’ve opened my eyes. Thank you.

  10. Pig Pen - Web Standards Compliant Web Design Blog » Blog Archive » Screen Scraping With Ruby Says:

    Viagra pills, [...] Screen Scraping With Ruby - a tutorial. [...]

  11. Leonardo Pires Says:

    Viagra pills, Super neat. I’m expecting a new one, viagra pills, specially using Gecko’s DOM.

  12. frank Says:

    Viagra pills, Great introduction. You might want to add an link to your main page in the posts. I tried clicking the header but there seems to be no link. Now I will have to go back to Reddit to find out your blog’s url (I’m using sharpreader).

  13. peter Says:

    Viagra pills, @Leonardo:

    Viagra pills, I have a fully working and tested Java solution for that - but there i have every building stone ( Java gecko widget - currently using SWT.Browser but there are alternatives like Ajax Toolkit Framework and XULRunner which are even better) and JavaXPCOM + W3CConnector to communicate between mozilla and java)

    Viagra pills, The problem with Ruby is that although both of these things are there (RubyGecko, viagra pills, GTK::Mozembed) and rbXPCOM, viagra pills, they are in a very-very immature state, viagra pills, i am not sure if even usable. So although i have all the know-how to build such a ting, viagra pills, i am not sure whether the building blocks allow me to do this.

    Viagra pills, @frank:

    Viagra pills, Thanks for the suggestion! I will do that ASAP.

  14. Leonardo Says:

    Viagra pills, Do you published the solution’s source code? Maybe I can help…

  15. AaronT Says:

    Viagra pills, Font size…

    Viagra pills, Gee whats with the small font size on this page, viagra pills, the code blocks are unreadable unless font size is increased by the browser

  16. peter Says:

    Viagra pills, Since more people have been complaining about the font size/line height i have modified it a bit for both the text and the source code. Thanks for the feedback, viagra pills, i am continously trying to improve the look, viagra pills, so suggestions are welcome!

  17. peter Says:

    Viagra pills, @Leonardo:

    Viagra pills, Could you please PM me? I’d prefer to talk this over via e-mail rather than a WordPress comment page ;-)

  18. Me Says:

    Viagra pills, I’ve been working on a detailed project to parse and quantify a complicated course listing website for my college. Unfortunately, viagra pills, the site is a HTML throwback to the early 90’s and does not differentiate between listings in any meaningful way. As a result, viagra pills, the only thing capable of parsing the sea of random tags is a set of carefully constructed regex’s. This is would break very easily if they ever bothered to change how they did their markup, viagra pills, but it works in this case.

    Viagra pills, As I work on this, viagra pills, I’m constructing a parsing toolkit designed to abstract some of the repetitive regex tasks I frequently go through. While gross overkill for a nicely formatted site, viagra pills, it’s the only thing that seems to work with this html eyesore.

  19. /home/chrisdo » WeekBits #25 Says:

    Viagra pills, [...] Peter Szinek, viagra pills, owner of RubyRailWays, viagra pills, has announced a serie of articles about screen-scraping subjects. The first article «Data extraction for Web 2.0: Screen scraping in Ruby/Rails» was recently published. [...]

  20. Noel Clarke Says:

    Viagra pills, I would like to talk to you… I have been in a company that commericialized the first two methods you speak of - HTree and REXML. With a GUI designer.

    Viagra pills, I have a few thoughts about commercial applications… that could be monetized.

  21. peter Says:

    Viagra pills, @Noel:

    Viagra pills, You can reach me at peter@[thissite].com. Feel free to send me an email!

  22. Fez Says:

    Viagra pills, Thanks for the techniques listed here.

    Viagra pills, I’m going to go make a few screens my bitch using these techniques.

  23. James Britt Says:

    Viagra pills, Thanks for the mention of rubystuff.com. That site is itself created by scraping content from CafePress, viagra pills, using WWW::Mechanize.

    Viagra pills, Shamless plug; I wrote about that here: http://neurogami.com/cafe-fetcher/

  24. RMX Says:

    Viagra pills, HTree or HTML::XMLParser

    Viagra pills, It seems HTML::XMLParser is already included in ruby (in either net/http or mechanize or rexml ?) is already included and does pretty much the same thing as HTree without an extra download. Any reason you prefer HTree?

  25. Labnotes » Blog Archive » links for 2006-06-22 Says:

    Viagra pills, [...] Ruby, viagra pills, Rails, viagra pills, Web2.0 » Data extraction for Web 2.0: Screen scraping in Ruby/Rails “In this section we will check out the two main possibilities (string and tree based wrappers) and take a look at HTree, viagra pills, REXML, viagra pills, RubyfulSoup and WWW::Mechanize based solutions.” (tags: scraping) [...]

  26. peter Says:

    Viagra pills, @RMX:

    Viagra pills, Well, viagra pills, the reason for this is very prosaic: I did not know HTML::XMLParser beforehand.
    I will chcek it out and see what’s the difference between HTree and XMLParser…

  27. greg.rubyfr.net»Blog Archive » [En bref] RWN 12, 18 juin 2006 Says:

    Viagra pills, [...] Peter Szinek a étudié les différentes possibilités de screen scraping/extraction Web/navigation Web automatique avec Ruby”, viagra pills, il en a sortis un article comparant les différentes librairies Ruby utilisables dans ce domaine. [...]

  28. Scraping sites with ruby Says:

    Viagra pills, [...] Sometimes it feels a bit backwards scraping sites for microformats, viagra pills, maybe there’s scope for microformat returning webservices in the future. For the time being, viagra pills, if you’re wanting to parse sites in ruby there are several tools. I began by using the HTML lib which is used by assert_tag and friends in rails, viagra pills, but then ran into problems when giving it malformed XHTML. Now I’ve ended up with RubyfulSoup which is doing the job nicely. Other options are covered in this article. [...]

  29. Anonymous Says:

    Viagra pills, Data extraction for Web 2.0: Screen scraping in Ruby/Rails…

    Viagra pills, introduction to screen scraping/Web extraction with Ruby, viagra pills, evaluation of the tools along with installation instructions and examples….

  30. Ruby gets a stylish HTML scraper - scrAPI Says:

    Viagra pills, [...] The indefatigable Assaf Arkin has done it again by developing a new Ruby HTML scraping toolkit, viagra pills, scrAPI. Peter Szinek recently wrote a popular article about scraping from Ruby using Manic Miner, viagra pills, RubyfulSoup, viagra pills, REXML, viagra pills, and WWW::Mechanize, viagra pills, but none of these are as immediately useful as scrAPI.. so why? [...]

  31. uday Says:

    Viagra pills, I am pretty new to this web scraping stuff…can anyone tell me what are the major business usecases for this scraping? i know this web20 mashup’s does this but any commercial application does this?

    Viagra pills, tia.

  32. Michael @ SEOG Says:

    Viagra pills, Uday — There are a few different business cases I can think of. A primary one is marketing where you might want to build a contact list for your sales force to call or other sorts of targeting. There are many databases online that contain a lot of useful information.

    Viagra pills, Other times maybe you are trying to automate a process you have to do often. I saw an author who use a technique like this to track sales. There are other examples like tracking ebay bids on certain items that a power seller might find useful. There are many times where you want to take data from a web page and turn it into structured data for your own purposes.

  33. Bob Says:

    Viagra pills, Very helpfull and interesting article. Wanted to ask your opinion on scrAPI aswell. Looking forward to your next article on this subject.

  34. Bob Says:

    Viagra pills, Hi, viagra pills,

    Viagra pills, When I copy paste your HTree example it gives error:

    Viagra pills, undefined method `HTree’ for main:Object (NoMethodError)

    Viagra pills, on the usage of the HTree class. The following seems to work fine:

    Viagra pills, require ‘open-uri’
    require ‘htree/parse’
    require ‘htree/rexml’
    require ‘rexml/document’

    Viagra pills, url = “http://www.google.com/search?q=ruby”
    open(url) {
    |page| pagecontent = page.read()
    doc = HTree.parse(page
    content).torexml
    doc.root.each
    element(’//a[@class="l"]‘) {
    |elem| puts elem.attribute(’href’).value }
    }

  35. Bob Says:

    Viagra pills, Rubyful seems to change utf-8 characters, viagra pills, for instance   into %nbsp Is this standard behaviour?

  36. Bob Says:

    Viagra pills, Duh sorry about that, viagra pills, I meant to say the &nbsp; is translated into %nbsp

  37. Alex Says:

    Viagra pills, Nice post…

  38. ramonsblog » Blog Archive » Rubinrote Wohnungssuche Says:

    Viagra pills, [...] Zigmal duch die selben Web-Formulare klicken. Zigmal Hamburg als Bundesland auswählen und mit gedrückter STRG-Taste die bevorzugten Stadtteile auswählen. Und jedesmal geht ein Pop-Under mit auf. Mir reichts! Motiviert von einem Blog-Eintrag auf Rubyrailways von Peter Szinek aus dem schönen Wien (küss die Hand), viagra pills, habe ich mir das Mechanize Modul von Michael Neumann und Aaron Patterson mal etwas genauer angesehen. Im Grunde simuliert es einen Web-Browser und lässt sich mit [...]

  39. chuck sonic Says:

    Viagra pills, HTML::XMLParser?

    Viagra pills, It took me awhile, viagra pills, but I figured out what RMX was talking about. For the curious:

    Viagra pills, gem install htmltools

    Viagra pills, then

    Viagra pills, require_gem ‘htmltools’
    require ‘html/xmltree.rb’

    Viagra pills, parser = HTMLTree::XMLParser.new(false, viagra pills, false)
    parser.parsefilenamed(’my.html’)
    doc = parser.document # is a REXML::Document

    Viagra pills, Check out lib/html/xmltree.rb at http://ruby-htmltools.rubyforge.org/doc/ for more info. Seems to be functionally identical to htree. Slightly easier to install, viagra pills, but also in my very limited testing almost twice as slow.

    Viagra pills, -chuck

  40. Mr skin Says:

    Viagra pills, Thanks for the information, viagra pills, I needed a pick me up.

  41. Mark Says:

    Viagra pills, Hey ‘RMX’: I don’t see any HTML::XMLParser in the standard distribution. You would think before sending people on wild goose chases looking in three different places you say it might be (one of which, viagra pills, Mechanize, viagra pills, isn’t even there either) you would be a little more sure of it yourself. Check your facts next time.

  42. peter Says:

    Viagra pills, @Mark:

    Viagra pills, Don’t worry about RMX’s tips ;-) There is a better (by far) solution already: HPricot by why. I am working on my Ruby web-extraction framework right now - using HPricot - and I can tell you, viagra pills, it is absolutely the way to go. It is waaaay faster then any other tool, viagra pills, and they say it has also better shaky-html-parsing capabilities. Well, viagra pills, so far I did not have any problems with any page, viagra pills, and it is really, viagra pills, really lightning fast compared to HTRee + REXML or RubyfulSoup.

  43. Mark Says:

    Viagra pills, Ok Peter I’ll check that out. I’ll also look for your web abstraction framework.

  44. cesium62 Says:

    Viagra pills, yes, viagra pills, I agree with Bob. rubyfulsoup seems to translate html entity references like “ ” and “é” into “%nbsp” and “%eacute” respectively. Of course, viagra pills, the problem might be in the SGML parser code that rubyful soup uses. It sure would be nice if the community could discuss this problem and its solutions in more detail.

    Viagra pills, Cs

  45. Elliott’s blog » Blog Archive » scRUBYt - Hot, New Ruby Web-Scraping Toolkit Released Says:

    Viagra pills, [...] Article 1 [...]

  46. Kenny Says:

    Viagra pills, Instead of altering the gem, viagra pills, you could just add this at the top of your example :

    Viagra pills, class WWW::Mechanize::Link
    def class_name
    node.attributes['class']
    end
    end

  47. Peter Says:

    Viagra pills, Yeah, viagra pills, that’s absolutely true and it’s definitely the Ruby way - unfortunately when I wrote this article I was totally new to Ruby and (coming from Java) I forgot about the possibility to reopen a class…

  48. Dent Says:

    Viagra pills, you could have a look on http://www.knowlesys.com, viagra pills, they provide web data extraction service.

  49. Various tools for screen scrapping « Ruby on Rails Development on Windows Says:

    Viagra pills, [...] Various tools for screen scrapping Filed under: Uncategorized — bngu @ 11:29 am I came across this article that discussed several tools for screen scraping. The tools mentioned are string wrappers and tree wrappers. String wrapper is basic and not very flexible. Tree wrappers have several options: HTree, viagra pills, Hpricot, viagra pills, RubyfulSoup, viagra pills, WWW::Mechanize, viagra pills, scRUBYt!, viagra pills, WATIR. For examples and in-depth discussion of each of the tool, viagra pills, check out the article. [...]

  50. scRUBYt! » Ruby Web Scraping Tool Guide - a Simple to Learn and Use, yet Powerful Web Scraping Toolkit Written in Ruby Says:

    Viagra pills, [...] scRUBYt! is a simple to learn and use, viagra pills, yet powerful web scraping toolkit written in Ruby. The idea behind making scRUBYt! was to show a few simple concepts of Web extraction as a practical extension of this tutorial. [...]

  51. Webmaster Tips Says:

    Viagra pills, Ruby Bikini - How to Process XML in Ruby…

    Viagra pills, Continuing in the series of Brazilian bikini Web development tutorials, viagra pills, here is an experiment with the Yahoo Search API, viagra pills, Ruby and Brazilian bikinis….

  52. Jim Says:

    Viagra pills, I would like to use this in conjunction with trying to send a website URL to a validator at: http://validator.w3.org/, viagra pills, I have been reading your articles on Information Acquisition Process, viagra pills, tutorials, viagra pills, and what not. I have installed everything with no issues, viagra pills, and I’m just wondering where do I start, viagra pills, you have these examples but what do I do with it, viagra pills, does it go in a controller that I have made say Validator_controller? Could you possible guide me through this as I don’t really have a clue.

    Viagra pills, What I’m trying to do is have send a website URL to a validator like the one above, viagra pills, and then grab all the validation results etc, viagra pills, and display it on a page in my web application. Any help would be greatly received, viagra pills, oh I signed up to your forum, viagra pills, but but I never received my activation email? I checked my email and it is correct, viagra pills, my login was solidariti, viagra pills, if you want to check.

    Viagra pills, Thank you

  53. Undiggnified.com » Blog Archive » Ruling on Rails Says:

    Viagra pills, [...] 2)https://rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails This is a brief overview of scraping methods in Ruby. The author is a wee bit biased (but very knowledgeable) towards his own scraper-class: ScrubyT. I have not used ScrubyT since I am on a WIN32 machine and it wont work for me without some major tweaking. But he also goes over Hpricot, viagra pills, and Mechanize, viagra pills, which I use extensively. [...]

  54. 电子网 Says:

    Viagra pills, Continuing in the series of Brazilian bikini Web development tutorials, viagra pills, here is an experiment with the Yahoo Search API, viagra pills, Ruby and Brazilian bikinis….

  55. jonybrv Says:

    Viagra pills, Hi —

    Viagra pills, I am new to this scrapping technology. I was recently assigned a project which needs certain information to be scrapped from multiple webpages. Presently they are doing it using Perl:LWP and RegEX on Win32. As there is no option for a commerical software, viagra pills, please let me know your views and recommendations on any solutions that would address the need. Is PERL:LWP module sufficient enough ? or should I look for any .NET modules ?

    Viagra pills, Thanks

  56. Pick Your RoR HTML Parsing Poison :: Fat Penguin Says:

    Viagra pills, [...] Several options are available, viagra pills, but oh so popular is why’s Hpricot. It’s fast and enjoyable (although I experienced no joy while learning how to use it =) It also happens to be used in some of the other scraping/navigating libraries (WWW::Mechanize [rdoc] and scRUBYt!). [...]

  57. Don Svenson Says:

    Viagra pills, I’ve been using Newbie Web Automation http://www.newbielabs.com and it does a pretty good job of scrapping data from websites. It support IE and Firefox. I’m interested to see if this Ruby data extraction tool would stack up.
    Does it come with a debugger?

  58. mashupbuch.blog » Screen-Scraping mit Ruby Says:

    Viagra pills, [...] Auch wenn die Web-2.0-Welle mit ordentlich Getöse durch das Netz schwappt, viagra pills, gibt es viele Websites, viagra pills, besonders im deutschsprachigen Netz, viagra pills, die noch komplett auf dem Trockenen sitzen. Von offenen, viagra pills, remixbaren Daten z.B. in Form eines Webservices, viagra pills, haben viele Website-Betreiber noch nichts gehört oder sie streuben sich dagegen. Doch mit Screen- oder Web-Scraping sind nahezu alle Inhalte für Mashups nutzbar. Unter rubyrailways.com werden diverse interessante Ansätze und Bibliotheken für Ruby-Programmierer inklusive Vor- und Nachteilen gezeigt und verglichen. [...]

  59. doug y'barbo Says:

    Viagra pills, Hi: This is a first class tutorial–very professionally presented. It was also very useful; i read every word. I thought i was at least a competent practitioner of this skill, viagra pills, but apparently i’m not! Additionally, viagra pills, whether it was your intention or not, viagra pills, i think that reading this article helps anyone who hopes to acquire fluency w/ scRUBYt, viagra pills, by providing the context, viagra pills, or the problems w/ current libraries and techniques that led to ScRUBTt development. After discovering it a few days ago, viagra pills, i’ve used scRUBYt several times on real problems on a professional project i’ve been working on for the past six months–scRUBYt worked as smoothly as a commercial app, viagra pills, no hitches. So i didn’t find any bugs, viagra pills, and i doubt i could offer any improvements that you or the Community hasn’t thought of already, viagra pills, but if that changes, viagra pills, i’ll post up. regards –doug

  60. frosty Says:

    Viagra pills, I have written a javascript too that is extremely efficent for web scraping. Check it out: http://www.feedmarklet.com/batchmarklet.html

  61. George Zachariah Says:

    Viagra pills, Hpricot will fail if the html has got errors. In that case you could use tidy like this

    Viagra pills, agent = WWW::Mechanize.new;
    Page = agent.get(”http address”)

    Viagra pills, html = Page.body # Convert to Html from pure hpricot elements

    Viagra pills, Tidying up the html as there are errors

    Viagra pills, xml = Tidy.open(:showwarnings=>true) { |tidy|
    tidy.options.output
    xml = true
    puts tidy.options.show_warnings
    xml = tidy.clean(html)
    #puts tidy.errors
    #puts tidy.diagnostics
    xml
    }

    Viagra pills, Convert to Hpricot Document

    Viagra pills, doc = Hpricot(xml);

    Viagra pills, do rest of html processing

  62. hiutopor Says:

    Viagra pills, Hi

    Viagra pills, Very interesting information! Thanks!

    Viagra pills, Bye

  63. Jaime Iniesta Says:

    Viagra pills, Great tutorial!

    Viagra pills, Just two comments: the first example does not return any results, viagra pills, I think it’s because Google now returns the “class” part after the “href”.

    Viagra pills, And on the last example, viagra pills, the last line throws that error:

    Viagra pills, scraping006.rb:15: uninitialized constant Scrubyt::ResultDumper (NameError)

    Viagra pills, I’m on Ubuntu 7, viagra pills, ruby 1.8.5

  64. peter Says:

    Viagra pills, Jaime, viagra pills,

    Viagra pills, Yeah, viagra pills, the first problem is a classic for web scraping: if the source changes, viagra pills, your scraper stops to work. There are several solutions for this problem (starting with the most primitive, viagra pills, recoding your scraper, viagra pills, up to sophisticated AI heuristics including scraper adaption, viagra pills, machine learning etc). Thanks for noting it though, viagra pills, I’ll update it soon.

    Viagra pills, As for the second problem: that’s fine - ResultDumper was dropped due to a rewrite and should be back in the future. However, viagra pills, it’s nothing big, viagra pills, it just showed some statistics of the results (like the link pattern matched 10 results etc). You can ignore it for now.

  65. Data extraction for Web 2.0: Screen scraping in Ruby/Rails « Hot WWW News Says:

    Viagra pills, [...] read more | digg story [...]

  66. Ruby On Rails - important BookMarks Says:

    Viagra pills, [...] CuRL- Ex [...]

  67. BLogger Says:

    Viagra pills, yadayada yada

  68. BLogger Says:

    Viagra pills, http://www.dinamis.eu

  69. Aardvark Says:

    Viagra pills, I landed on this site the other day while searching for screen scraping. I wanted to write a screen scraper to monitor the status page of my DSL modem because AT&T service has been exceptionally poor lately, viagra pills, and I felt I might gather some valuable or at least interesting information by logging the status for a few days. So, viagra pills, I tried each example, viagra pills, some worked some didn’t. The WWW::Mechanize example worked and returned search results from http://www.google.com. Cool, viagra pills, not quite what I wanted, viagra pills, but cool. I only ran it twice, viagra pills, once I ran the example exactly as it was on the page and a second time I run it with an different search value. Then, viagra pills, I moved on with the Ruby learning and finally completed my modem status page scraper, viagra pills, which coincidently was my first Ruby program. Now, viagra pills, Google has put my IP address on some sort of blacklist. I cannot conduct a search without first solving a CAPTCHA, viagra pills, then after they’ve updated a cookie in my browser, viagra pills, I’m good to go. If I clear my browser’s cookies, viagra pills, I get the Google Error page and again must enter the CAPTCHA. If I go through a proxy, viagra pills, no CAPTCHA. If anyone else has encountered this problem, viagra pills, do share. I don’t think it is a coincidence. I do hope my IP is erased from this supposed blacklist soon. It is such an annoyance.

  70. Michael Says:

    Viagra pills, Thanks for your very well conceived and executed tutorial. I particularly appreciated your putting the various tools into context so that as a beginner I can make an informed decision about which to invest the time in learning.

    Viagra pills, I think that Ruby would be more widely and effectively used if there were more tutorials providing this kind of detailed and substantive overview of various problem domains.

    Viagra pills, Thanks again for this most helpful tutorial.

  71. acc617acdafa Says:

    Viagra pills, acc617acdafa…

    Viagra pills, acc617acdafa2014c6f3…

  72. John Says:

    Viagra pills, I’m not sure if this is the forum for my question. I’m new to Ruby. I’m looking for ways to submit web forms and save the resulting web page in a pdf file.

    Viagra pills, So I go to https://www.some-site.com (yes https); I click on a “start” button and a new page with a form to fill is displayed; I fill in the form using data from a csv file; I click on a button to submit the form; a new confirmation page is displayed. I want to save this confirmation page in a pdf file.

    Viagra pills, I want to do this in Windows using simple Ruby scripts (without AJAX or RAILS or VB, viagra pills, etc.) Using just Ruby scripts (I think I used Watir too) and IE 6, viagra pills, I am able to do the form submit and navigation. However, viagra pills, I can’t seem to find a simple way to save the last page into a pdf file.

    Viagra pills, I tried using the IE “print” function (CTRL-P) but I can only get the IE print dialog to come up; I don’t know how to supply the file name for the pdf printer’s “save as” pop-up window. Any ideas?

    Viagra pills, Thanks.
    John

  73. zaglyani08 Says:

    Viagra pills, Привет всем!:) В интернете множество порно-сайтов, viagra pills, в которых при скачивании требуются разные активационные коды или нужно пускать смс на номера, viagra pills,
    не зная сколько вас за это сдерут! Недолго думая, viagra pills, я решил создать сайт, viagra pills,
    все скачивания с которого будут бесплатными! В этом сайте вы можете найти всё что захотите, viagra pills, даже добавил раздел: Книги!
    Ещё один плюс, viagra pills, сайт постоянно обновляется! Кому стало интересно, viagra pills, прошу зайти по этой ссылке

  74. hipporu2008 Says:

    Viagra pills, Lineage II Hellbound это многопользовательская игра последнего поколения.
    В игре одновременно могут участвовать несколько тысяч персонажей контролируемых людьми.
    Средневековый, viagra pills, сказочный мир, viagra pills, наполненный чудесами и опасностями, viagra pills, монстрами и героями откроется для Вас.
    По ходу игры Ваш персонаж набирается опыта, viagra pills, и ему становятся доступны новые умения, viagra pills, оружие, viagra pills, заклинания.
    В вашей воле быть магом или воином, viagra pills, проводить время в боях с монстрами или окунуться в мир политики кланов.
    В Lineage 2 Вы сможете поучаствовать в битвах с драконами, viagra pills, когда только слаженные действия команды из нескольких
    десятков человек могут гарантировать успех, viagra pills, сможете осаждать замки, viagra pills, либо оказаться в рядах защитников стен, viagra pills,
    завести своего собственного птенца дракона и вырастить его до огромного летающего монстра, viagra pills, на спине которого сможете летать по миру.
    Для регистрации аккаунта и скачки клиента игры La2 Hellbound используйте наш сайт http://la2.hippo.ru/
    Приятный игры.

  75. Vietnam software outsourcing Says:

    Viagra pills, Cool

    Viagra pills, http://www.tuvinh.com

  76. More Light! More Light! :: Using Ruby for command line web lookups Says:

    Viagra pills, [...] world of screen-scraping as it is called, viagra pills, doesn’t end there. If you need more advanced techniques for screen scraping a page, viagra pills, behold the power of the [...]

  77. gen Says:

    Viagra pills, Very good Article.

    Viagra pills, http://scrappingexpert.com
    Web Data extraction Specialist.

  78. tip Says:

    Viagra pills, I admire you on the willingness to share this info with others - good luck!

  79. Dlip Says:

    Viagra pills, great article, viagra pills,
    But does it works fine with sites that uses JavaScript ?

  80. peter Says:

    Viagra pills, @Dlip: Sure, viagra pills, scRUBYt! does. Check out http://scrubyt.org.

  81. Eric Says:

    Viagra pills, I am also writing an new Article about the data extraction of web screen. So thanks for the base of your article, viagra pills, I will link to it :-)

    Viagra pills, Eric

  82. cafe world fans Says:

    Viagra pills, The information presented is top notch. I’ve been doing some research on the topic and this post answered several questions.

  83. ClubPenguinCheats Says:

    Viagra pills, Thanks for your very well conceived and executed tutorial. I particularly appreciated your putting the various tools into context so that as a beginner I can make an informed decision about which to invest the time in learning.

  84. Jamar Bushway Says:

    Viagra pills, Hi, viagra pills, I found your site by googling for Manic Miner. Have you seen the cool clothes at manicminer.se

  85. zero cost commissions Says:

    Viagra pills, Magnificent page, viagra pills, I share the same views. I wonder why this specific society does indeed not just think similar to me personally along with the web publication master ;-)

  86. car photos Says:

    Viagra pills, But does it works fine with sites that uses JavaScript ?

  87. ptrz Says:

    Viagra pills, Viagra pills, viagra pills, The information presented is top notch. I’ve been doing some research on the topic and this post answered several questions.that

  88. Jeff Says:

    Viagra pills, Hi, viagra pills, thank you for your article, viagra pills,

    Viagra pills, I used your script to collect relevant car data. It works like charm. Thanks for your work!

  89. patrik Says:

    Viagra pills, Thanks for share it

  90. backlink profile Says:

    Viagra pills, This actually is my very first clock I have haunted this web site. I saw a lot of concerning stuff in your site. From the a lot of remarks on your contented articles, viagra pills, I guess I am not the only one! continue up the proud work.

Leave a Reply




 Site feed

Support
Ruby Railways



Bad Behavior has blocked 814 access attempts in the last 7 days.