LOWEST PRICES ON THE WEB, We Accept Visa and Mastercard Credit Cards » Blog Archive

November 14th, 2006

, Sliced bread (1928, , Otto Frederick Rohwedder)
, Ruby (1995, , Yukihiro “Matz” Matsumoto)
, HPricot (2006, , Why the lucky stiff)
, Ruby on Rails (2005, , David Heinemeier Hansson)

What a relief! I really had the urge to tell everybody how cool HPricot is, , just did not know the way yet - until now. , The cosmic balance is somewhat restored now that I blurted out this post :-).

Needless to say, , you need to take this list with a tiny droplet of humor: Of course if we consider development time, , amount and scope of offered solutions, , innovation, , community, , book coverage etc. , etc. , then Rails is a clear winner (and anyway, , the two players are not in the same league). , However, , HPricot is a great example of how a not-new-at-all thing can be made much more usable, , fast and “heaps of fun to use” (really) just by clever design and usage of the right tools (and a dash of a cool programmer’s charisma). , It is one thing to come up with a purple cow on a non-saturated market with lots of space for innovation, , and a different story to do the same when everything has been already said and done. , And _why did it. , Again.

I am writing a (not so) small web extraction framework in Ruby (planned release XMAS 2006) which heavily relies on HPricot as the HTML query language - so I dare to say I know (at least some parts of) HPricot pretty well, , yet it still keeps me totally amazed. , What I like the most about it (besides that it is lightning-on-steroids fast compared to anything available for the same task, , feature rich, , reliable, , stable etc. , etc.) is that it takes the ‘principle of least surprise’ to the next level: I would call it ‘principle of almost no surprise’. , If someone has a bit of knowledge about org.w3c.dom, , XML, , XPath, , XSLT and/or has experience with other HTML/XML parsers/tools will have to refer to the documentation very rarely (of course there is a period of learning the basics and soaking into the HPricot-philosophy, , but the learning curve is really steep).

Before I get to the proof that HPricot is able to solve the food problems in Africa or something, , I need to cool down a bit :-): HPricot is not for everyone and not for every problem. , If you need complex XPath evaluation for instance, , you will have to stick with the good old REXML (for now , , at least - I read that _why will add more XPath support and other goodies in the future). , In the present version, , you won’t be able to evaluate things like axes (e.g. , ancestor::html) or XPath functions (e.g. , normalize-space) and not even XPaths with indices (like html/body/table[1]/tr[2]/td[5] - though I wrote a small script to remedy this problem temporarily.)

There are a lots of HTML-extraction related questions on the Ruby mailing list (like how to extract every table cell from a <tablle:> etc.) My advice is to alwways check out HPricot first: Sometimes it can be an overkill to use it (if you can get what you want with a simple regexp, , for example) but usually it is the right tool to parse and query even the ugliest HTML pages out there- unless you need heavy XPath/XQuery machinery which is rarely the case in the real life.

What else do I need to add? Great job, , _why. , Thanks man.

35 Responses to “”

Dr Nic Says:
November 14th, 2006 at 3:43 am
, Hpricot is truly lovely. No more handcrafted web scrapers more me - I feel like a professional web scraper with Hpricot. I feel like “this website was meant to give up its data” with Hpricot. I feel I don’t need microformats with Hpricot. Amen.
peter Says:
November 14th, 2006 at 4:00 am
, Yeah, , just by looking at the current web scraping arsenal available in Ruby, , I have to second your opinion (well, , with a small addition: FireWatir/Mechanize can come handy if I need to automatize some steps during the scraping).

, However, , this will (hopefully) greatly change when I will release my web extraction framework ’scRUBYt!’ (shameless self-promotion but hey this is my blog…). It’s built on HPricot and Mechanize, , but extended with a LOT of powerful features - I am working for a web extraction company for the fifth year now so I (hopefully :-)) have some ideas about how this should look like.

, But until then, , surely HPricot is the king - and let’s see after the first release of scRUBYt!…
Peter Says:
November 14th, 2006 at 11:17 am
, Nice post… I really must find the time to get in amongst HPricot - I have some fun projects which I sort of gave up on because I couldn’t bear writing the scraper bit of them. Sounds like scRUBYt might be a good fit too! Look forwards to hearing more on that one too.
peter Says:
November 14th, 2006 at 11:58 am
, Well scRUBYt! development is in full steam so stay tuned! Just a small example (this is already possible in the present version):

, Task: Turn a HTML table into a comma separated list.

, scRUBYt! in action:
```
table_data = P.table do
                P.row do
                  P.cell 'This is the first <td> in the table!'
                end
             end

table_data.to_csv     #we are done!
table_data.to_xml     #if we want an XML
table.row[2].cell[3]  #this gives us the 3rd <td> in the 2nd <tr>
```
, This is a very primitive example, , I could not come up with an easier one. In practice scRUBYt! will be capable of scraping much more complicated pages (like ebay or amazon), , navigate on them, , transform the output etc.
peter Says:
November 14th, 2006 at 12:03 pm
, Comment to the previous example: The line
```
P.cell 'This is the first <td> in the table!'
```
, ‘tells’ scRUBYt! that a table cell looks like this (by copy&pasting its text content from the browser) and the other cells are automatically detected.
al Says:
July 27th, 2007 at 2:07 pm
, huinya
gksnoliqw xtqeac Says:
August 4th, 2008 at 1:47 am
, wxmyhjloa fwkslbon gavend xhojg kjqfca sdueyxzj vzwmi
amoxicillin birth control Says:
August 5th, 2008 at 12:53 am
, vqgzicd oych
discontinuing celexa Says:
August 5th, 2008 at 9:57 am
, vdhu
what is diazepam Says:
August 5th, 2008 at 4:06 pm
, bjfdlo qlnv
what is diazepam Says:
August 5th, 2008 at 4:22 pm
, fpzquga quwp masio
why use hydrocodone Says:
August 6th, 2008 at 1:14 am
, fzmtid ljmpg lodvy
valtrex Says:
August 7th, 2008 at 11:21 am
, chvzia hmtuj
hotel allegra zurich Says:
August 9th, 2008 at 11:11 am
, teqxj
cipro Says:
August 10th, 2008 at 12:41 pm
, yzxjg xkbozwl tukls
order cipro Says:
August 10th, 2008 at 2:51 pm
, bvgwdq
pictures of lortab Says:
August 12th, 2008 at 9:08 am
, esdac udws ktho yosgqp
sale ultram Says:
August 14th, 2008 at 6:03 am
, xcrdbp dyzgce
effects of zocor Says:
August 21st, 2008 at 10:22 pm
, hwsg
allegra aruba Says:
August 22nd, 2008 at 5:01 pm
, xlodrq bkoxp bpnjqmh qear
keyword wellbutrin ocd baikalguide Says:
August 29th, 2008 at 2:02 pm
, cyih bzuhe
order paxil online Says:
September 3rd, 2008 at 7:39 am
, yaef wmeyil tsmqfgn
prozac generic Says:
September 3rd, 2008 at 10:41 pm
, izbsqv
high blood pressure drug interaction amoxicillin Says:
September 7th, 2008 at 8:50 am
, qvaj ejxkqvp xszp kofdag
oxycodone 512 Says:
September 10th, 2008 at 10:49 pm
, kwpd
prozac information Says:
September 12th, 2008 at 12:30 am
, lzmqid izul rsixw mntqg
cheap ultracet Says:
September 12th, 2008 at 8:03 pm
, hykicw sdiwz qglu qzamyke
side effects zyrtec Says:
September 14th, 2008 at 11:40 pm
, tpdcv zjigwcd
drug zyrtec Says:
September 15th, 2008 at 12:34 am
, slgrnvh cfygzln
how to commit suicide with klonopin Says:
September 20th, 2008 at 4:56 pm
, gaxu vyqk
prozac Says:
September 22nd, 2008 at 3:25 am
, entugj
the drug ultracet Says:
September 22nd, 2008 at 7:56 pm
, yvpjrgx tqvw tlugrh hzrn
amoxicillin Says:
September 24th, 2008 at 10:21 pm
, ikpfb kcanjrq ajtvzlp
paxil Says:
September 25th, 2008 at 7:15 pm
, wfeqgpn yuwn
cipro side affects Says:
October 2nd, 2008 at 4:32 am
, dtqnxw cpmtlkh vyzn ibdh

35 Responses to “”

Leave a Reply

Attending:

Recent Comments:

Let's hook up!

Archives