- , Sliced bread (1928, , Otto Frederick Rohwedder)
- , Ruby (1995, , Yukihiro “Matz” Matsumoto)
- , HPricot (2006, , Why the lucky stiff)
- , Ruby on Rails (2005, , David Heinemeier Hansson)
What a relief! I really had the urge to tell everybody how cool HPricot is, , just did not know the way yet - until now. , The cosmic balance is somewhat restored now that I blurted out this post :-).
Needless to say, , you need to take this list with a tiny droplet of humor: Of course if we consider development time, , amount and scope of offered solutions, , innovation, , community, , book coverage etc. , etc. , then Rails is a clear winner (and anyway, , the two players are not in the same league). , However, , HPricot is a great example of how a not-new-at-all thing can be made much more usable, , fast and “heaps of fun to use” (really) just by clever design and usage of the right tools (and a dash of a cool programmer’s charisma). , It is one thing to come up with a purple cow on a non-saturated market with lots of space for innovation, , and a different story to do the same when everything has been already said and done. , And _why did it. , Again.
I am writing a (not so) small web extraction framework in Ruby (planned release XMAS 2006) which heavily relies on HPricot as the HTML query language - so I dare to say I know (at least some parts of) HPricot pretty well, , yet it still keeps me totally amazed. , What I like the most about it (besides that it is lightning-on-steroids fast compared to anything available for the same task, , feature rich, , reliable, , stable etc. , etc.) is that it takes the ‘principle of least surprise’ to the next level: I would call it ‘principle of almost no surprise’. , If someone has a bit of knowledge about org.w3c.dom, , XML, , XPath, , XSLT and/or has experience with other HTML/XML parsers/tools will have to refer to the documentation very rarely (of course there is a period of learning the basics and soaking into the HPricot-philosophy, , but the learning curve is really steep).
Before I get to the proof that HPricot is able to solve the food problems in Africa or something, , I need to cool down a bit :-): HPricot is not for everyone and not for every problem. , If you need complex XPath evaluation for instance, , you will have to stick with the good old REXML (for now , , at least - I read that _why will add more XPath support and other goodies in the future). , In the present version, , you won’t be able to evaluate things like axes (e.g. , ancestor::html) or XPath functions (e.g. , normalize-space) and not even XPaths with indices (like html/body/table[1]/tr[2]/td[5] - though I wrote a small script to remedy this problem temporarily.)
There are a lots of HTML-extraction related questions on the Ruby mailing list (like how to extract every table cell from a <tablle:> etc.) My advice is to alwways check out HPricot first: Sometimes it can be an overkill to use it (if you can get what you want with a simple regexp, , for example) but usually it is the right tool to parse and query even the ugliest HTML pages out there- unless you need heavy XPath/XQuery machinery which is rarely the case in the real life.
What else do I need to add? Great job, , _why. , Thanks man.
Similar Posts:xanax pills

November 14th, 2006 at 3:43 am
, Hpricot is truly lovely. No more handcrafted web scrapers more me - I feel like a professional web scraper with Hpricot. I feel like “this website was meant to give up its data” with Hpricot. I feel I don’t need microformats with Hpricot. Amen.
November 14th, 2006 at 4:00 am
, Yeah, , just by looking at the current web scraping arsenal available in Ruby, , I have to second your opinion (well, , with a small addition: FireWatir/Mechanize can come handy if I need to automatize some steps during the scraping).
, However, , this will (hopefully) greatly change when I will release my web extraction framework ’scRUBYt!’ (shameless self-promotion
but hey this is my blog…). It’s built on HPricot and Mechanize, , but extended with a LOT of powerful features - I am working for a web extraction company for the fifth year now so I (hopefully :-)) have some ideas about how this should look like.
, But until then, , surely HPricot is the king - and let’s see after the first release of scRUBYt!…
November 14th, 2006 at 11:17 am
, Nice post… I really must find the time to get in amongst HPricot - I have some fun projects which I sort of gave up on because I couldn’t bear writing the scraper bit of them. Sounds like scRUBYt might be a good fit too! Look forwards to hearing more on that one too.
November 14th, 2006 at 11:58 am
, Well scRUBYt! development is in full steam so stay tuned! Just a small example (this is already possible in the present version):
, Task: Turn a HTML table into a comma separated list.
, scRUBYt! in action:
table_data = P.table do P.row do P.cell 'This is the first <td> in the table!' end end table_data.to_csv #we are done! table_data.to_xml #if we want an XML table.row[2].cell[3] #this gives us the 3rd <td> in the 2nd <tr>, This is a very primitive example, , I could not come up with an easier one. In practice scRUBYt! will be capable of scraping much more complicated pages (like ebay or amazon), , navigate on them, , transform the output etc.
November 14th, 2006 at 12:03 pm
, Comment to the previous example: The line
, ‘tells’ scRUBYt! that a table cell looks like this (by copy&pasting its text content from the browser) and the other cells are automatically detected.
July 27th, 2007 at 2:07 pm
, huinya
August 4th, 2008 at 1:47 am
, wxmyhjloa fwkslbon gavend xhojg kjqfca sdueyxzj vzwmi
August 5th, 2008 at 12:53 am
, vqgzicd oych
August 5th, 2008 at 9:57 am
, vdhu
August 5th, 2008 at 4:06 pm
, bjfdlo qlnv
August 5th, 2008 at 4:22 pm
, fpzquga quwp masio
August 6th, 2008 at 1:14 am
, fzmtid ljmpg lodvy
August 7th, 2008 at 11:21 am
, chvzia hmtuj
August 9th, 2008 at 11:11 am
, teqxj
August 10th, 2008 at 12:41 pm
, yzxjg xkbozwl tukls
August 10th, 2008 at 2:51 pm
, bvgwdq
August 12th, 2008 at 9:08 am
, esdac udws ktho yosgqp
August 14th, 2008 at 6:03 am
, xcrdbp dyzgce
August 21st, 2008 at 10:22 pm
, hwsg
August 22nd, 2008 at 5:01 pm
, xlodrq bkoxp bpnjqmh qear
August 29th, 2008 at 2:02 pm
, cyih bzuhe
September 3rd, 2008 at 7:39 am
, yaef wmeyil tsmqfgn
September 3rd, 2008 at 10:41 pm
, izbsqv
September 7th, 2008 at 8:50 am
, qvaj ejxkqvp xszp kofdag
September 10th, 2008 at 10:49 pm
, kwpd
September 12th, 2008 at 12:30 am
, lzmqid izul rsixw mntqg
September 12th, 2008 at 8:03 pm
, hykicw sdiwz qglu qzamyke
September 14th, 2008 at 11:40 pm
, tpdcv zjigwcd
September 15th, 2008 at 12:34 am
, slgrnvh cfygzln
September 20th, 2008 at 4:56 pm
, gaxu vyqk
September 22nd, 2008 at 3:25 am
, entugj
September 22nd, 2008 at 7:56 pm
, yvpjrgx tqvw tlugrh hzrn
September 24th, 2008 at 10:21 pm
, ikpfb kcanjrq ajtvzlp
September 25th, 2008 at 7:15 pm
, wfeqgpn yuwn
October 2nd, 2008 at 4:32 am
, dtqnxw cpmtlkh vyzn ibdh