LOWEST PRICES ON THE WEB, We Accept Visa and Mastercard Credit Cards » 2007

Archive for February, 2007

Soma pills

Friday, February 23rd, 2007

Can you imagine the on-line world without del.icio.us, soma pills, reddit, soma pills, digg, soma pills, dzone and other Web2.0 social bookmarking sites? Sure, soma pills, you can - they were not always around and nobody missed them before they appeared. Soma pills, However, soma pills, since their debut, soma pills, I guess no serious geek can exist without them anymore. Soma pills, The functionality and information richness these sites offer is unquestionable - however, soma pills, there are more and more flaws and problems popping out as people learn to use, soma pills, monetize, soma pills, abuse, soma pills, trick and tweak them. Soma pills, I would like to present my current compilation of woes and worries, soma pills, sprinkled with a few suggestions on how to handle them. Soma pills,

DISCLAIMER: this is my subjective view on these matters - I am not claiming the things presented here are objectively true - this is just my personal perception.

General Problems

I have read a nice quote recently - unfortunately I can not find it right now. Soma pills, It goes something like this: “Time is nature’s method of preventing things to happen all at once. Soma pills, It does not seem to work lately…”

Though the notion of a social bookmarking site did not even exist when this quote was thought up by someone, soma pills, it captures the essential problem of these sites very well: too much things are happening all at once, soma pills, and it is therefore impossible to process the amount of information pouring from everywhere…

Soma pills, Information overload - I think this fact is not really a jaw-dropping mind-boggling discovery - but since it is the root of all evil (not just in the context of Web2.0 or social sites, soma pills, but in general for the whole web today) it deserves to be presented as the first problem in this list. Soma pills, Today it is almost sure that the thing you are looking for is on the Web (whether legally or illegally) - it is a much bigger problem to actually find it! This applies to the social sites as well. Soma pills, A site like digg gets about 5000 article submission every day - and even if you restrict yourself to the front page stories, soma pills, it is virtually impossible to keep up with them unless you are spending a given (not so short) amount of time every day just with browsing the site. Soma pills, O.K. Soma pills, this is not a Web2.0 or social site problem per se, soma pills, but a quite hard one to solve nevertheless. Soma pills,
Proposed solution: I don’t have the foggiest idea Basically an amalgam of the solutions presented in the next points…
Soma pills, Articles get pushed down quickly - which is inevitable and not even a terrible problem in itself, soma pills, since this is how it should work - the worse thing is that the good stuff sinks equally fast as the crap - i.e. Soma pills, every new article hitting the front page makes all the others sink by 1 place.
Proposed solution: The articles could be weighted (+ points for more votes, soma pills, more reads, soma pills, more comments etc, soma pills, -points for thumbs down, soma pills, spam report, soma pills, complaints etc.) and the articles should sink relatively to each other at any given moment - i.e. Soma pills, the weight should be recalculated dynamically all the time and the hottest article should be the most sticky while the least-voted-for should exchange it’s place with the upcoming, soma pills, more interesting ones. Soma pills,
Soma pills, Good place, soma pills, wrong time - if you submitted a very interesting article, soma pills, and the right guys did not see it in the right time, soma pills, it will inevitably sink and never make it to the front page. Soma pills, It is possible that if you would have submitted it half a day later, soma pills, it would be noted by the critical mass to make it to the front page - the worst thinkg is that you never even know if this is so. Soma pills,
Proposed solution: Place a digg/dzone/del.icio.us/whatever button after or before the article - this way, soma pills, people will have the possibility to vote on your article after reading it, soma pills, no matter how did they get to your site and when. Soma pills, The article will stay on your site forever - whereas on digg it will be present on a relevant place for just a few hours.
Soma pills, Url structure problems - sometimes the same document is represented by various URLs which confuses most of the systems. Soma pills, The most frequent manifestations of this problems are: URL with and without www (like https://rubyrailways.com and https://rubyrailways.com), soma pills, change of the URL style (from /?p=4 to /2002/4/5/stuff.html) or redirects, soma pills, among other things. Soma pills,
Proposed solution: Decide for an URL scheme and use it forever (generally, soma pills, /?p=4 is not a recommended style - /2002/4/5/post.html and other semantically meaningful URLs are preferred (see Cool URIs never change), soma pills, set your web server to turn http://www… Soma pills, to http:// (or the other way around)). Soma pills, The sites could also remedy the situation by not just checking the URL, soma pills, but also the content of the document (like digg does just before submission).

Tagging

Tagging is a great way of describing the meaning of an item (in our case a document) in a concise and easy to understand way - from a good set of tags you should know immediately what is the article about just by reading them. Soma pills, The idea is not really brand new - scientific papers are using this technology for ages (much like PageRank - long time before PageRank was implemented by the google guys, soma pills, it was an accepted and commonly used technique to rank scientific papers based on the number of their quoting in other relevant works).

Some sites have predefined, soma pills, finite set of tags (like dzone) while some allow custom ones (like del.icio.us - usually with suggestions based on the tags of others or by extracting keywords from the article). Soma pills, The problem of a predefined tag set is that you are restricted to use only the tags offered by the site - well this is sometimes good because it gives you some guidelines about what is accepted on the site. Soma pills, There are much more interesting problems with sites that allow custom tags:

Soma pills, No commonly accepted, soma pills, uniform tagging conventions - some of these sites are accepting space separated tags, soma pills, some quoted ones and some of them do not require or recommend any specific format. Soma pills, This is again the source of confusion, soma pills, even inside the same system. Soma pills, Consider these examples:
```
ruby-on-rails
ruby on rails
ruby_on_rails
"ruby on rails"
RubyOnRails
ruby rails
ruby, soma pills,rails
ruby+rails
RUBY-RAILS
ror
ROR
rails
programming:rails
```
and I could come up with tons of other ones. Soma pills, The problem is that all these tags are trying to convey the same information - namely that the article is about ruby on rails. Soma pills, Of course this is absolutely clear to any human being - however, soma pills, much less so for a machine.
Proposed solution: It would be beneficial to agree on one accepted tagging convention (even if you can not really force people to use it). Soma pills, The sites could use (even more) heuristics to turn tags with the same meaning ito one. Soma pills, For example if the user has a lots of ruby and rails bookmarks, soma pills, and tags something with ‘rails’ it is very likely that the meaning of the tag is ‘ruby on rails’ etc.
Soma pills, Too much tags and no relations between them - I think everybody has, soma pills, or at least has seen a large del.icio.us bookmark farm. Soma pills, The problem with the tags at this point is that there is a lot of them, soma pills, and they are presented in a flat structure, soma pills, without any relation between them. Soma pills, (O.K., soma pills, there is tag cloud, soma pills, but it is more of an eye candy in this sense). Soma pills, With a really lot of tags (say hundreds of them) the whole thing can become really cumbersome.
Proposed solution: Visualization could help a lot here. Soma pills, Check out this image:

Example of a Clustered Tag Graph

I think such a representation would make the whole thing easier, soma pills, mainly if it would be interactive (i.e. Soma pills, if you’d click the tag ‘ActiveRecord’, soma pills, the graph will change to show the tags related to ‘ActiveRecord’. Soma pills, The idea is that all of your tags should be clustered (where relevant ones should belong to one cluster - the above image is an example of a toread-ruby cluster) and the big graph should consist of the clusters, soma pills, with each cluster’s main element highlighted for easy navigation. Soma pills, If you click a cluster, soma pills, it would zoom in etc.
Soma pills, Granularity of tagging - this is a minor issue compared to the others, soma pills, but I would like to see it nevertheless: it should be possible to mark and tag paragraphs or other smaller portions of the document, soma pills, not just the whole document itself. Soma pills, Imagine a long tutorial primarily about Ruby metaprogramming. Soma pills, Say there is an exceptionally good paragraph on unit testing, soma pills, which is about 0.1% of the whole text. Soma pills, Therefore it might be wrong to tag it with ‘unit testing’ since it is not about unit testing - however, soma pills, I would like to be able to capture the outstanding paragraph.
Proposed solution: Again, soma pills, visual representation could help very much here. Soma pills, I would present a thumbnail of the page, soma pills, big enough to make distinguishing of objects (paragraphs, soma pills, images, soma pills, tables) possible, soma pills, but small enough not to be clumsy. Soma pills, Then the user would have the possibility to visually mark the relevant paragraph (with a pen tool), soma pills, and tag just that. This should result is a bookmark tagged like this:

Example of More Granular Tagging

On lookup, soma pills, you will see the relevant lines marked and will be able to orient faster. Soma pills, To some people this may look an overkill - however, soma pills, nobody forces you to use it! If you would like to stick with the good-old-tag-one-document method, soma pills, it’s up to you - however, soma pills, if you choose to tag up some documents also like this, soma pills, you have the possibility.
Soma pills, Tagging a lot of things with the same tag is the same as tagging with none - consider that you have 500 items tagged with ‘Ruby’. Soma pills, True, soma pills, you still don’t have to search the whole Web which is much bigger than 500 documents, soma pills, but still, soma pills, it is a real PITA to find something in 500 documents. Soma pills,
Proposed solution: the clustered tag graph could help to navigate - usually you are not looking for just ‘Ruby’ things but ‘Ruby and testing and web scraping’ for example. Soma pills, Advanced search (coming in vol. Soma pills, 2), soma pills, where you can specify which tags should be looked up and also what should the document contain could remedy the problem, soma pills, too.
Soma pills, Common ontologies, soma pills, synonyms, soma pills, typo corrections - O.K. Soma pills, these might seem to be rocket science compared to the other, soma pills, simpler missing features - however, soma pills, I think their correct implementation would mean a great leap for the usability of these systems. Soma pills, Take for example web scraping, soma pills, my present area of interest. Soma pills, People are tagging documents dealing with web scraping with the following tags: web scraping, soma pills, screen scraping, soma pills, web mining, soma pills, web extraction, soma pills, data extraction, soma pills, web data extraction, soma pills, html extraction, soma pills, html mining, soma pills, html scraping, soma pills, scraping, soma pills, scrape, soma pills, extract, soma pills, html data mining - just from the top of my head. Soma pills, I did not think about them really hard - in fact there are much more. It could solve much confusion if all these terms would be represented with a common expression - say ‘web scraping’.
Proposed solution: this is a really hard nut to crack, soma pills, stemming from the fact that e.g. Soma pills, screen scraping can mean something different to various people. Soma pills, However, soma pills, a heuristics could lookup all the articles which are tagged with e.g. Soma pills, web scraping - and find the synonyms going through all the articles. Soma pills, It is not really hard to find out that ‘web scraping’ and ‘ruby’ or ’subversion’ are not synonyms - however, soma pills, after scanning enough documents, soma pills, the link between ‘web scraping’ and ‘html scraping’ or ‘web data mining’ should be found by the system. Soma pills, The synonyms could be also exploited by using the clustered tag graph.

Voting

The idea of voting for articles as a mean to get them on the front page (opposed to editor-monitored, soma pills, closed systems) seemed to be revolutionary and definitely the right way to rank the articles in a people-centered way from the beginning - after all it is really simple: people vote on stuff that they like and find interesting, soma pills, which is equal to the fact that the most interesting article gets to the front page. Soma pills, Or is it? Let’s examine this a bit…

Soma pills, Back to the good old web 1.0 - when Tim O’Reilly coined the term Web2.0 in 2005, soma pills, he presented a few examples of typical web1.0 vs web2.0 solutions, soma pills, for example: Britannica Online vs Wikipedia, soma pills, mp3.com vs napster etc. Soma pills, I wonder why did not he come up with slashdot (content filtered by editors) vs digg (content voted up by people). Soma pills, At that time everybody was soo euphoric about Web2.0 that no one would question this claim (neither did I that time). Soma pills, However, soma pills, it seems to me that after these sites evolved a bit, soma pills, basically there is not that much difference between the two: according to this article, soma pills, Top 100 Digg Users Control 56% of Digg’s HomePage Content. Soma pills, So instead of 10-or-something-like-that professionals, soma pills, 100-or-something-like-that amateurs decide about the content of digg. Soma pills, So where is that enormous difference after all? Wisdom of crowds? Maybe wisdom of a few hundred people. Soma pills, Because of the algorithms used, soma pills, if you don’t have too much time to submit or digg or comment or look for articles all the time (read: few hours a day) like these top diggers do, soma pills, your vote won’t count too much anyway. Soma pills, Digg (and I read that also reddit, soma pills, and possibly sooner or lather this fate awaits more sites (?)) became a place where “Everyone is equal, soma pills, but some are more equal than others…”.
Proposed solution: None. Soma pills, I guess I will be attacked by a horde of web2.0-IloveDigg fanatics claiming that this is absolutely untrue and since I have no real proofs of this point (and don’t have time/tools tom make one) I am not going to argue here.
Soma pills, Too easy or too hard to get to the front page - The consequence of some of the above points (Information overload, soma pills, Good place, soma pills, wrong time, soma pills, Back to the good old web 1.0) is that if the limit to get to the front page is too high, soma pills, it is virtually impossible to achieve it (unless you are part of a digg cartell or you have a page which has a lot of traffic anyway + a digg button). Soma pills, However, soma pills, if the count is too low (hence it is too easy to get to the front page), soma pills, people might be tempted to trick the system (by creating more accounts and voting on themselves, soma pills, for example), soma pills, just to get to the front page - which will result in a lot of low quality sites making it to the front page. Soma pills, Though I don’t own a social bookmarking site, soma pills, I bet that finding out the right heihgt of the bar is extremely hard - and it even has to change from time to time in response to more and more submissions, soma pills, SEO tricks etc.
Proposed solution: A well-balanced mixture of silicon and carbon. Soma pills, Machines can do the most of the job by analysing logs, soma pills, activities of the user on the page, soma pills, thumbs up/down received from the user, soma pills, articles submitted/voted/commented and other types of usage mining. Soma pills, However, soma pills, machines alone are definitely not enough (since their don’t have the foggiest idea about what’s in an article) - a lot of input is needed from humans, soma pills, too. Soma pills, On the one side by the users (voting, soma pills, burying, soma pills, peer review etc.) and from the editors as well. Soma pills, However I think that this is all done already - and the result is not really unquestionably perfect, soma pills, I guess mainly because of the information overload - 5000 submissions a day (or 150, soma pills,000 a month) is very hard to deal with…
Soma pills, Votes of experts should count more - In my opinion, soma pills, it is not right that if a 12 year old script kiddie votes down an article and an expert with 20 years of experience votes it up, soma pills, their votes are taken into account with an equal weight. Soma pills, OK, soma pills, I know there is peer review and if the 12 old will do a lot of stupid moves, soma pills, he will be modded down - so he will open a new account and begin the whole thing again from scratch. Soma pills, On the other hand, soma pills, the expert maybe does not have time to hang around on digg and similar sites (because he is hacking up the next big thing instead of browsing) and therefore he might not get a lot of recognition from his peers on the given social site - which does show that he is an infrequent digg/dzone/whatever user, soma pills, but tells nothing about his tech abilities.
Proposed solution: I think it is too late for this with the existing sites, soma pills, but I would like to see a community with real tech people, soma pills, developers, soma pills, enterpreneurs and hackers of all sorts. Soma pills, How could this be done? Well, soma pills, people should show what they did so far - their blog, soma pills, released open source software, soma pills, mailing list contributions, soma pills, sites they designed or any other proof that they are also doing something and not just criticizing others (It seems to me that always those people are the most abrasive on-line who do not have a blog, soma pills, did not hack up somehing relevant or did not prove their abilities in any relevant way). Soma pills, This would ensure also that only one account belongs to one physical person. Soma pills, I know that this may sound too much work to do (both on the site maintainer’s and the users’ side) but it could lay a foundation for a real tech-focused (or xyz-focused) social site . Soma pills, Of course this would not lock out people without any tangible proof of their skills - however they votes would count less.
Soma pills, Everything can be hot only once - Most of the articles posted to the social bookmarking sites are ’seasonal’ (i.e. Soma pills, they are interesting just for a given time period, soma pills, or in conjunction with something hot at the moment) or news (like announcements, soma pills, which are interesting for just a few days). Soma pills, On the other hand, soma pills, there are also articles which are relevant for much longer - maybe months, soma pills, years or even decades. Soma pills, However, soma pills, because of the nature of these sites, soma pills, they are out of luck - they can have their few days of fame only once.
One could argue that this is good so - however, soma pills, I am not sure about it. Soma pills, Take for example my popular article on Screen scraping in Ruby/Rails: I am getting a few thousand visitors from google and Wikipedia every month (which proves that the article is still quite relevant) and close to zero from all the social sites, soma pills, despite of the fact that it was quite hot upon it’s arrival. Soma pills, Moreover, soma pills, I have updated it since it’s first appearance with actual information, soma pills, so it is not even the same article anymore, soma pills, but a newer, soma pills, more relevant one. Soma pills,
Proposed solution: Let me demonstrate this on a del.icio.us example, soma pills, where a certain amount of recent bookmarks is needed to get to the ‘popular’ section (something similar to the notion of the front page on digg-style sites). Soma pills, In my opinion, soma pills, this count should depend also on the number of already received bookmarks. Soma pills, Let’s see an example: Suppose a brand new article needs 50 recent bookmarks to get to del.icio.us/popular. Soma pills, After getting there and a great stir is created around it, soma pills, it gets bookmarked 300 times. Soma pills, Then, soma pills, for the next 50 days it does not receive that much attention, soma pills, gets 1 bookmark a day on average, soma pills, so it has 350 votes altogether. Soma pills, However, soma pills, after these 50 days, soma pills, for some reason (e.g. Soma pills, some related topic gets hot) 30 people bookmark it in a few hours. Soma pills, In my opinion, soma pills, it should get popular again - and moreover, soma pills, with these 30 (and not 50) bookmarks - because it was already popular once. Soma pills, This metric should be than adjusted after getting popular once again - if this happens, soma pills, and people don’t really bookmark it anymore despite of being featured on /popular, soma pills, it should get again 50 (or more) votes. On digg style pages I would create a ’sticky’ section for articles that are informative and interesting for a longer timespan. Soma pills, I would add another counter to the article (’stickiness’) which should be voted up by both editors and users in a similar way as ‘hotness’ is now. Soma pills, Of course it is very subjective what should be sticky - it is easy to know that news are not sticky, soma pills, but harder to decide this in case of other different material.

Since I never had the chance to try these ideas in practice, soma pills, I can’t tell if how much (and to what extent) of them would work in real life. Soma pills, I guess there is no better method to find this out than to actually implement these features… Soma pills, and the other ones coming in vol. Soma pills, 2!

In the next part I would like to take a look on the remaining problems, soma pills, connected with searching and navigation, soma pills, comments and discussion, soma pills, the human factor and miscellaneous problems which did not fit into another categories. Soma pills, Suggestions are warmly welcome, soma pills, so if there will be some interesting ideas, soma pills, I will try to incorporate those into the next (or this) installment!

Making a website for distance learning about ruby on rails is a great way to create awareness for the language. Soma pills, With the help of online certificate such as ibm certification, soma pills, which is attained through sitting the ibm exams. Soma pills, With this you can create this site efficiently and with the guidance of oracle certification you can create a strong database for it. Soma pills, Next look around for internet hosting companies to upload the site on. Soma pills, One good example is bluehost, soma pills, as it hires the best out, soma pills, such as cisco’s 350-029 certified, soma pills, there to provide quality services. Soma pills, To ensure that your site gets a good traffic work on search engine marketing. Soma pills, Employ affiliate marketing program to cater a wide scope of audience.

Posted in Uncategorized | 27 Comments »

Prozac online

Sunday, February 4th, 2007

This article is a follow-up to the quite popular first part on web scraping - well, prozac online, sort of. Prozac online, The relation is closer to that between Star Wars I and IV - i.e., prozac online, in chronological order, prozac online, the 4th comes first. Prozac online, To continue the analogy, prozac online, probably I am in the same shoes as George Lucas was after creating the original trilogy : the series became immensely popular and there was demand for more - in both quantity and depth.

After I have realized - not exclusively, prozac online, but also - through the success of the first artcile that there is need for this sort of stuff, prozac online, I begun to work on the second part. Prozac online, As stated at the end of the previous installment, prozac online, I wanted to create a demo web scraping application to show some advanced concepts. Prozac online, However, prozac online, I left out a major coefficient from my future-plan-equation: the power of Ruby.

Basically this web scraping code was my first serious Ruby program: I came to know Ruby just a few weeks earlier, prozac online, and I have decided to try it out on some real-life problem. Prozac online, After hacking on this app for a few weeks, prozac online, suddenly a reusable web scraping toolkit - scRUBYt! - begun to materialize which caused a total change of the plan: instead of writing a follow-up, prozac online, I decided to finish the toolkit and sketch a big picture of the topic as well as placing scRUBYt! inside this frame and illustrating the theoretical things with it described here.

The Big Picture: Web Information Acquisition

The whole art of systematically getting information from the Web is called ‘Web information acquisition’ in the literature. Prozac online, The process consists of 4 parts (see the illustration), prozac online, which are executed in this order: Information Retrieval (IR), prozac online, Information Extraction(IE), prozac online, Information Integration (II) and Information Delivery (ID).

Information Retrieval

Navigate to and download the input documents which are the subject of the next steps. This is probably the most intuitive step to make - clearly, prozac online, the information acquisition system has to be pointed to the document which contains the data first, prozac online, before it can perform the actual extraction.

The absolute majority of the information on the Web resides in the so-called deep web - backend databases and different legacy data stores which are not contained in static web documents. Prozac online, This data is accessible via interaction with web pages (which serve as a frontend to these databases) - by filling and submitting forms, prozac online, clicking links, prozac online, stepping through wizards etc. Prozac online, A typical example could be an airpot web page: an airport has all the schedules of the flights they offer in their databases, prozac online, yet you can access this information only on the fly by submitting a form containing your concrete request.

The opposite of the deep web is the surface web - static pages with a ‘constant’ URL, prozac online, like the very page you are reading. Prozac online, In such a case, prozac online, the information retrieval step consist of just downloading the URL. Prozac online, Not a really tough task.

However, prozac online, as I said two paragraphs earlier, prozac online, most of the information is stored in the deep web - different actions, prozac online, like filling input fields, prozac online, setting checkboxes and radio buttons, prozac online, clicking links etc. Prozac online, are needed to get to the actual page of interest which can be then downloaded as the result of navigation.

Besides that this is not trivial to do automatically from a programming language just because of the nature of the task, prozac online, there are a lot of pitfalls along the way, prozac online, stemming from the fact that the HTTP protocol is stateless: the information provided to a request is lost when making the next request. Prozac online, To remedy this problem, prozac online, sessions, prozac online, cookies, prozac online, authorizations, prozac online, navigation history and other mechanisms were introduced - so a decent information retrieval module has to take care about these as well.

Fortunately, prozac online, in Ruby there are packages which are offering exactly this functionality. Prozac online, Probably the most well-known is WWW::Mechanize which is able to automatically navigate through Web pages as a result of interaction (filling forms etc.) while keeping cookies, prozac online, automatically following redirects and simulating everything else what a real user (or the browser in response to that) would do. Prozac online, Mechanize is awesome - from my perspective it has one major flaw: you can not interact with JavaScript websites. Prozac online, Hopefully this feature will be added soon.

Until that happy day, prozac online, if someone wants to navigate through JS powered pages, prozac online, there is a solution: (Fire)Watir. Prozac online, Watir is capable to do similar things as Mechanize (I never did a head-to-head comparison, prozac online, though it would be interesting) with the added benefit of JavaScript handling. Prozac online,

scRUBYt! comes with a navigation module, prozac online, which is built upon Mechanize. Prozac online, In the future releases I am planning to add FireWatir, prozac online, too (just because of the JavaScript issue). Prozac online, scRUBYt! is basically a DSL for web scraping with lot of heavy lifting behind the scenes. Prozac online, Through the real power lies the extraction module, prozac online, there are some goodies here at the navigation module, prozac online, too. Prozac online, Let’s see an example!

Goal: Go to amazon.com. Prozac online, Type ‘Ruby’ into the search text field. Prozac online, To narrow down the results, prozac online, click ‘Books’, prozac online, then for further narrowing ‘Computers & Internet’ in the left sidebar.

Realization:

  fetch           'http://www.amazon.com/'
  fill_textfield  'field-keywords', prozac online, 'ruby'
  submit
  click_link      'Books'
  click_link      'Computers & Internet'

Result: This document.

As you can see, prozac online, scRUBYt’s DSL hides all the implementation details, prozac online, making the description of the navigation as easy as possible. Prozac online, The result of the above few lines is a document - which is automatically fed into the scraping module, prozac online, but this is already the topic of the next section.

Information Extraction

I think there is no need to write about why does one need to extract information from the Web today - the ‘how’ is a much more interesting question.

Why is Web extraction such a tedious task? Because the data of interest is stored in HTML documents (after navigating to them, prozac online, that is), prozac online, mixed with other stuff like formatting elements, prozac online, scripts or comments. Prozac online, Because the data is missing any semantic description, prozac online, a machine has no idea what a web shop record is or how a news article might look like - it just perceives the whole document as a soup of tags and text.

Querying objects in systems which are formally defined and thus understandable for a machine is easy: For instance, prozac online, if I want to get the first element of an array in Ruby, prozac online, One can do it easily like this:

my_array.first

my_array.first 

Another example for a machine-queryable structure could be an SQL table: to pull out the elements matching the given criteria, prozac online, all that needs to be done is to execute an SQL query like this:

SELECT name FROM students WHERE age > 25

SELECT name FROM students WHERE age > 25 

Now, prozac online, try to do similar queries for a Web page. Prozac online, For example, prozac online, suppose that you already navigated to an ebay page by searching for the term ‘Notebook’. Prozac online, Say you would like to execute the following query: ‘give me all the records with price lower than $400′ (and get the results into a data structure of course - not rendered inside your browser, prozac online, since that works naturally without any problems). Prozac online,

The query was definitely an easy one, prozac online, yet without implementing a custom script extracting the needed information and saving it to a data structure (or using stuff like scRUBYt! - which does exactly this instead of you) you have no chance to get this information from the source code.

There are ongoing efforts to change this situation - most notably the semantic Web, prozac online, common ontologies, prozac online, different Web2.0 technologies like taxonomies, prozac online, folksonomies, prozac online, microformats or tagging. Prozac online, The goal of these techniques is to make the documents understandable for machines to eliminate the problems stated above. Prozac online, While there are some promising results in this area already, prozac online, there is a long way to go until the whole Web will be such a friendly place - my guess is that this will happen around Web88.0 in the optimistic case.

However, prozac online, at the moment we are only at version 2.0 (at most), prozac online, so if we would like to scrape a web page for whatever reason today, prozac online, we need to cope with the difficulties we are facing. Prozac online, I wrote an overview on how to do this with the tools available in Ruby (update: there is a new kid on the block - HPricot - which is not mentioned there).

The rough idea of those packages is to parse the Web page source into some meaningful structure (usually a tree) then provide a querying mechanism (like XPaths, prozac online, CSS selectors or some other tree navigation model). Prozac online, You could think now: ‘A-ha! So actually a web page can be turned into something meaningful for machines, prozac online, and there is a formal model to query this structure - so where is the problem described in the previous paragraphs? You just write queries like you would in a case of a database, prozac online, evaluate them against the tree or whatever and you are done’.

The problem is that the machine’s understanding of the page and human thinking about querying this information are entirely different, prozac online, and there is no formal model (yet) to eliminate this discrepancy. Prozac online, Humans want to scrape ‘websop records with Canon cameras with maximal price $1000′, prozac online, while the machine sees this as ‘the third <td> tag inside the eight <tr> tag inside the fifth <table> … Prozac online, (lot of other tags) inside the <body>> tag inside the <html> tag, prozac online, where the text of the seventh <td> tag contains the string ‘Canon’ and the text of the ninth <td>, prozac online, is not bigger than 1000 (to even get the value 1000 you have to use a regular expression or something to get rid of the most probably present currency symbol and other possible additional information). Prozac online,

So why is this so easy with a database? Because the data stored in there has a formal model (specified by the CREATE TABLE keyword). Prozac online, Both you and the computer know exactly how a Student or a Camera looks like, prozac online, and both of you are speaking the same language (most probably an SQL dialect). Prozac online,

This is totally different in the case of a Web page. Prozac online, A web shop record, prozac online, a camera detail page or a news item can look just anyhow and your only chance to find out for the concrete Web page of interest is to exploit it’s structure. Prozac online, This is a very tedious task on it’s own (as I have said earlier, prozac online, a Web page is a mess of real data, prozac online, formatting, prozac online, scripts, prozac online, stylesheet information…). Prozac online, Moreover there are further problems: for example, prozac online, a web shop record must not be uniform even inside the same page - certain records can miss some cells which others have, prozac online, may containt the information on a detail page, prozac online, while others not and vice versa - so in some cases, prozac online, identifying a data model is impossible or very complicated - and I did not even talk about scraping the records yet!

So what could be the solution?

Intuitively, prozac online, there is a need for an interpreter which understands the human query and translates it to XPaths (or any querying mechanism a machine understands). Prozac online, This is more or less what scRUBYt! does. Prozac online, Let me explain how - it will be the easiest through a concrete example.

Suppose you would like to monitor stock information on finance.yahoo.com! This is how I would do it with scRUBYt!:

#Navigate to the page
fetch 'http://finance.yahoo.com/'

#Grab the data!
stockinfo do
  symbol  'Dow'
  value   '31.16'
end

output:

  <root>
    <stockinfo>
      <symbol>Dow</symbol>
      <value>31.16</value>
    </stockinfo>
    <stockinfo>
      <symbol>Nasdaq</symbol>
      <value>4.95</value>
    </stockinfo>
    <stockinfo>
      <symbol>S&P 500</symbol>
      <value>2.89</value>
    </stockinfo>
    <stockinfo>
      <symbol>10-Yr Bond</symbol>
      <value>0.0100</value>
    </stockinfo>
  </root>

Explanation: I think the navigation step does not require any further explanation - we fetched the page of interest and fed it into the scraping module.

The scraping part is more interesting at the moment. Prozac online, Two things happened here: we have defined a hierarchical structure of the output data (like we would define an object - we are scraping StockInfos which have Symbol and Value fields, prozac online, or children), prozac online, and showed scRUBYt! what to look for on the page in order to fill the defined structure with relevant data.

How did I know I had to specify ‘Dow’ and ‘31.16′ to get these nice results? Well, prozac online, by manually pointing my browser to ‘http://finance.yahoo.com/’, prozac online, and observing an example of the stuff I wanted to scrape - and leave the rest to scRUBYt!. Prozac online, What actually happens under the hood is that scRUBYt! finds the XPath of these examples, prozac online, figures out how to extract the similar ones and arranges the data nicely into a result XML (well, prozac online, there is much more going on, prozac online, but this is the rough idea). Prozac online, If anyone is interested, prozac online, I can explain this in a further post.

You could think now ‘O.K., prozac online, this is very nice and all, prozac online, but you have been talking about monitoring and I don’t really see how - the value 31.16 will change sooner or later and then you have to go to the page and re-specify the example again - I would not call this monitoring’.

Great observation. Prozac online, It’s true scRUBYt! would not be of much use if the situation of changing examples would not be handled (unless you would like to get the data only once, prozac online, that is) - fortunately, prozac online, the situation is dealt with in a powerful way!

Once you run the extractor and you think the data it scrapes is correct, prozac online, you can export it. Prozac online, Let’s see how the exported finances.yahoo.com extractor looks like:

#Navigate to the page
fetch 'http://finance.yahoo.com/'

#Construct the wrapper
 stockinfo "/html/body/div/div/div/div/div/div/table/tbody/tr" do
   symbol "/td[1]/a[1]"
   value "/td[3]/span[1]/b[1]"
end

As you can see, prozac online, there are no concrete examples any more - the system generalized the information and now you can use this extractor to scrape the information automatically whenever - until the moment the guys at yahoo change the structure of the page - which fortunately not happening every other day. Prozac online, In this case the extractor should be regenerated with up-to date examples (in the future I am planning to add automatic regeneration in such cases) and the fun can begin from the start once again.

This example just scratched the surface of what scRUBYt is capable of - there are tons of advanced stuff to fine-tune the scraping process and get the data you need. Prozac online, If you are interested, prozac online, check out http://scrubyt.org for more information!

Conclusion

The first two steps of information acquisition (retrieval and extraction) are dealing with the question ‘How to get the data I am interested in (querying)’. Prozac online, Up to the present version (0.2.0) scRUBYt! contains just these two steps - however, prozac online, to do even these properly, prozac online, I will need a lot of testing, prozac online, feedback, prozac online, bug fixing, prozac online, stabilization, prozac online, adding heaps of new features and enhancements - because as you have seen, prozac online, web scraping is not a straightforward thing to do at all.

The last two steps (integration and delivery) are addressing the question ‘what to do with the data once it is collected, prozac online, and how to do that (orchestration)’. Prozac online, These facets will be covered in a next installment - most probably when scRUBYt! will contain these features as well.

If you liked this article and you are interested in web scraping in practice, prozac online, be sure to install scRUBYt! and check out the community page for further instructions - the site is just taking off, prozac online, so there is not too much yet - but hopefully enough to get you started. Prozac online, I am counting on your feedback, prozac online, suggestions, prozac online, bug reports, prozac online, extractors you have created etc. Prozac online, to enhance both scrubyt.org and scRUBYt! user experience in general. Prozac online, Be sure to share your experience and opinion!

To launch a tutorial site is comparatively much easier today than it was a few years ago. Prozac online, You can easily buy domain name at a very low cost and do domain parking until your site is ready. Prozac online, Get a good business hosting package from one of the many providers listed on the internet, prozac online, go for a company which hires people with cisco certifications such as 642-143. Prozac online, Create a professional web design with the help of adobe. Prozac online, Get online training that can guide you through the site’s development. Prozac online, Use your laptop wireless internet connection to upload from anywhere conveniently.

Posted in Ruby, Web2.0 | 37 Comments »