header image


November 14th, 2006
  1. , Sliced bread (1928, , Otto Frederick Rohwedder)
  2. , Ruby (1995, , Yukihiro “Matz” Matsumoto)
  3. , HPricot (2006, , Why the lucky stiff)
  4. , Ruby on Rails (2005, , David Heinemeier Hansson)

What a relief! I really had the urge to tell everybody how cool HPricot is, , just did not know the way yet - until now. , The cosmic balance is somewhat restored now that I blurted out this post :-).

Needless to say, , you need to take this list with a tiny droplet of humor: Of course if we consider development time, , amount and scope of offered solutions, , innovation, , community, , book coverage etc. , etc. , then Rails is a clear winner (and anyway, , the two players are not in the same league). , However, , HPricot is a great example of how a not-new-at-all thing can be made much more usable, , fast and “heaps of fun to use” (really) just by clever design and usage of the right tools (and a dash of a cool programmer’s charisma). , It is one thing to come up with a purple cow on a non-saturated market with lots of space for innovation, , and a different story to do the same when everything has been already said and done. , And _why did it. , Again.

I am writing a (not so) small web extraction framework in Ruby (planned release XMAS 2006) which heavily relies on HPricot as the HTML query language - so I dare to say I know (at least some parts of) HPricot pretty well, , yet it still keeps me totally amazed. , What I like the most about it (besides that it is lightning-on-steroids fast compared to anything available for the same task, , feature rich, , reliable, , stable etc. , etc.) is that it takes the ‘principle of least surprise’ to the next level: I would call it ‘principle of almost no surprise’. , If someone has a bit of knowledge about org.w3c.dom, , XML, , XPath, , XSLT and/or has experience with other HTML/XML parsers/tools will have to refer to the documentation very rarely (of course there is a period of learning the basics and soaking into the HPricot-philosophy, , but the learning curve is really steep).

Before I get to the proof that HPricot is able to solve the food problems in Africa or something, , I need to cool down a bit :-): HPricot is not for everyone and not for every problem. , If you need complex XPath evaluation for instance, , you will have to stick with the good old REXML (for now , , at least - I read that _why will add more XPath support and other goodies in the future). , In the present version, , you won’t be able to evaluate things like axes (e.g. , ancestor::html) or XPath functions (e.g. , normalize-space) and not even XPaths with indices (like html/body/table[1]/tr[2]/td[5] - though I wrote a small script to remedy this problem temporarily.)

There are a lots of HTML-extraction related questions on the Ruby mailing list (like how to extract every table cell from a <tablle:> etc.) My advice is to alwways check out HPricot first: Sometimes it can be an overkill to use it (if you can get what you want with a simple regexp, , for example) but usually it is the right tool to parse and query even the ugliest HTML pages out there- unless you need heavy XPath/XQuery machinery which is rarely the case in the real life.

What else do I need to add? Great job, , _why. , Thanks man.

Similar Posts:xanax pills



If you liked the article, subscribe to the feed   and follow me on twitter!.


      

35 Responses to “”

  1. Dr Nic Says:

    , Hpricot is truly lovely. No more handcrafted web scrapers more me - I feel like a professional web scraper with Hpricot. I feel like “this website was meant to give up its data” with Hpricot. I feel I don’t need microformats with Hpricot. Amen.

  2. peter Says:

    , Yeah, , just by looking at the current web scraping arsenal available in Ruby, , I have to second your opinion (well, , with a small addition: FireWatir/Mechanize can come handy if I need to automatize some steps during the scraping).

    , However, , this will (hopefully) greatly change when I will release my web extraction framework ’scRUBYt!’ (shameless self-promotion :-) but hey this is my blog…). It’s built on HPricot and Mechanize, , but extended with a LOT of powerful features - I am working for a web extraction company for the fifth year now so I (hopefully :-)) have some ideas about how this should look like.

    , But until then, , surely HPricot is the king - and let’s see after the first release of scRUBYt!…

  3. Peter Says:

    , Nice post… I really must find the time to get in amongst HPricot - I have some fun projects which I sort of gave up on because I couldn’t bear writing the scraper bit of them. Sounds like scRUBYt might be a good fit too! Look forwards to hearing more on that one too.

  4. peter Says:

    , Well scRUBYt! development is in full steam so stay tuned! Just a small example (this is already possible in the present version):

    , Task: Turn a HTML table into a comma separated list.

    , scRUBYt! in action:

    table_data = P.table do
                    P.row do
                      P.cell 'This is the first <td> in the table!'
                    end
                 end
    
    table_data.to_csv     #we are done!
    table_data.to_xml     #if we want an XML
    table.row[2].cell[3]  #this gives us the 3rd <td> in the 2nd <tr>
    

    , This is a very primitive example, , I could not come up with an easier one. In practice scRUBYt! will be capable of scraping much more complicated pages (like ebay or amazon), , navigate on them, , transform the output etc.

  5. peter Says:

    , Comment to the previous example: The line

    P.cell 'This is the first <td> in the table!'
    

    , ‘tells’ scRUBYt! that a table cell looks like this (by copy&pasting its text content from the browser) and the other cells are automatically detected.

  6. al Says:

    , huinya

  7. gksnoliqw xtqeac Says:

    , wxmyhjloa fwkslbon gavend xhojg kjqfca sdueyxzj vzwmi

  8. amoxicillin birth control Says:

    , vqgzicd oych

  9. discontinuing celexa Says:

    , vdhu

  10. what is diazepam Says:

    , bjfdlo qlnv

  11. what is diazepam Says:

    , fpzquga quwp masio

  12. why use hydrocodone Says:

    , fzmtid ljmpg lodvy

  13. valtrex Says:

    , chvzia hmtuj

  14. hotel allegra zurich Says:

    , teqxj

  15. cipro Says:

    , yzxjg xkbozwl tukls

  16. order cipro Says:

    , bvgwdq

  17. pictures of lortab Says:

    , esdac udws ktho yosgqp

  18. sale ultram Says:

    , xcrdbp dyzgce

  19. effects of zocor Says:

    , hwsg

  20. allegra aruba Says:

    , xlodrq bkoxp bpnjqmh qear

  21. keyword wellbutrin ocd baikalguide Says:

    , cyih bzuhe

  22. order paxil online Says:

    , yaef wmeyil tsmqfgn

  23. prozac generic Says:

    , izbsqv

  24. high blood pressure drug interaction amoxicillin Says:

    , qvaj ejxkqvp xszp kofdag

  25. oxycodone 512 Says:

    , kwpd

  26. prozac information Says:

    , lzmqid izul rsixw mntqg

  27. cheap ultracet Says:

    , hykicw sdiwz qglu qzamyke

  28. side effects zyrtec Says:

    , tpdcv zjigwcd

  29. drug zyrtec Says:

    , slgrnvh cfygzln

  30. how to commit suicide with klonopin Says:

    , gaxu vyqk

  31. prozac Says:

    , entugj

  32. the drug ultracet Says:

    , yvpjrgx tqvw tlugrh hzrn

  33. amoxicillin Says:

    , ikpfb kcanjrq ajtvzlp

  34. paxil Says:

    , wfeqgpn yuwn

  35. cipro side affects Says:

    , dtqnxw cpmtlkh vyzn ibdh

Leave a Reply




Bad Behavior has blocked 2412 access attempts in the last 7 days.