Netty-gritty: 2007

Monday

Tribute

Tribute to the Founding Fathers of the Web and the Net
What better way to start off Netty-gritty, than with a tribute to the Founding Fathers of the Web and the Net! - Tim Berners-Lee and Vint Cerf

Sir Timothy "Tim" John Berners-Lee, is considered the inventor of the World Wide Web.
In 1980, while at CERN, Berners-Lee proposed a project based on the concept of hypertext, to facilitate sharing and updating information among researchers. With help from Robert Cailliau he built a prototype system named ENQUIRE. He used similar ideas to those underlying the Enquire system to create the World Wide Web, for which he designed and built the first web browser and editor, and the first Web server.
Currently: Sir Timothy Berners-Lee is the director of the World Wide Web Consortium (which oversees the continued development of WWW).

Vinton Gray Cerf is referred to as one of the "founding fathers of the Internet" for his key technical and managerial role, together with Bob Kahn, in the creation of the Internet and the TCP/IP protocols which it uses.
Currently: Vint Cerf is the Chairman of the Internet Corporation for Assigned Names and Numbers (ICANN) Board of Directors. He is also the “Chief Internet Evangelist” at Google.

Sunday

Typosquatting

Typosquatting, also called URL hijacking, involves creating domain names that is a variation on a popular domain name with the expectation that the site will get traffic off of the original sight because of mistakes such as typographical errors (typos) made by Internet users while typing a website address into a web browser. Should a user accidentally enter an incorrect website address, they may be led to an alternative website owned by a cybersquatter.

Once in the typosquatter's site, the user may also be tricked into thinking that they are in fact in the real site; through the use of similar logos, website layouts or content. Some such sites may have been created with malicious intent to spam users with a popups and foist executable downloads. More common are benign sites, selling advertising to firms based on keywords similar to the misspelled word in the domain.

A company may try and preempt typosquatting by obtaining a number of websites with common misspellings and redirect them to the main, correctly spelled website. For example www.gooogle.com, www.goolge.com, www.gogle.com www.gewgle.com, and others, all redirect to http://www.google.com.

Egs of typosquatting: wikiepdia.org, wilipedia.org, goggle.com, googlr.com
(Such sites may be malicious in intent, check them out at your own risk!)

Saturday

CAPTCHA

While browsing the net (especially webmail services or blogs) you would have come across images with distorted letters presented to you, which you have to type in to be able to proceed. Ever wondered what they are called, and what they are for?

CAPTCHA stands for ‘Completely Automated Public Turing test to tell Computers and Humans Apart’.
A CAPTCHA is a type of challenge-response test used in computing to determine whether the user is human. As computers are unable to solve the CAPTCHA, any user entering a correct solution is presumed to be human.

A CAPTCHA is sometimes described as a reverse Turing test, because it is administered by a machine and targeted to a human, in contrast to the standard Turing test that is administered by a human and targeted to a machine.

CAPTCHAs can be deployed to protect systems vulnerable to e-mail spam, such as the webmail services of Gmail, Hotmail etc. CAPTCHAs have also found active use in stopping automated posting to blogs or forums, which may be for commercial promotion, harassment or vandalism.

So next time you 'solve' and type in the obscure letters presented to you in an image, you know what it is for and you have a name for it. :)

Want a hands-on experience of CAPTCHA right now? Try posting a comment on this blog :)

Friday

Internet bots

Internet bots, or simply bots, are software program that operates as an agent for a user and simulates human activity or run automated tasks over the internet. Typically, bots perform tasks that are structurally repetitive, at a much higher rate than would be possible for a human.

On the Internet, the most ubiquitous bots are the programs called spiders or crawlers, which are search engine programs that go out on the internet, follow links, and read through the pages in order to index the site in a search engine.

Bots may also be implemented where a response speed faster than that of humans is required (e.g., gaming bots and auction-site robots) or in situations where the emulation of human activity is required. A chatterbot is a program that can simulate talk with a human being. A shopbot is a program that shops around the Web on your behalf and locates the best price for a product you're looking for.

More malicious use of bots is the coordination and operation of an automated attack on networked computers, such as a denial-of-service attack. Bots can also be used to commit click fraud (to repeatedly click on ad links to bolster revenue, or to bolster the popularity of a site).

A spambot is an internet bot that attempts to spam large amounts of content on the Internet, usually adding advertising links.

Now, we get a clearer picture of how CAPTCHAs help protect sites (eg: Orkut) from spamming bots, don't we?

Thursday

Web 2.0

Although the term seems to suggest a new version of the World Wide Web, it does not refer to an update to any technical specifications, but to changes in the ways application developers and end-users perceive and use the internet.

Web 2.0 applications tend to interact much more with the end-user. With web 2.0, the end-user is no more just a user of the application but an active contributor of content and an integral part of the data driving the application, whether it is through Blogging (blogger), creating articles through Wiki (wikipedia), tagging contents (del.ico.us, digg, technorati), providing content (flickr, youtube, twitter) or doing Podcast.

Social nature of these web 2.0 applications have become phenomenally successful, allowing the application to leverage on the user-driven data (orkut).

So, that would lead us to ask what Web 1.0 is right? There was no Web 1.0 until Web 2.0. In other words, Web 1.0 was coined to differentiate existing applications from Web 2.0. Web 1.0 had static content which, according to Berners-Lee, could be considered the "read-only web". In other words, the early web allowed us to search for information and read it. There was very little in the way of user interaction or content contribution.

Web 2.0, would be "read-write" web if we stick to Berners-Lee's method of describing it.
For example, Kodakexpress is Web 1.0, while Flickr is Web 2.0

So, what next? Web 3.0, ofcourse! :)

The Web 3.0 would be something akin to a "read-write-execute" web.

Web 2.0 or 3.0, you don't need to upgrade anything or get new software. These are abstract ideas used to contemplate how the web applications are evolving from the perspective of application developers and end-users, in how they use the internet.

Wednesday

Myth busting

Myth busting the ‘404 Page not found’ story

Whenever we type in a non-existent web address (may be the url typed was incomplete or misspelled or the page no longer exists), we get a ‘404 - Page not found’ error. Which we have seen umpteen number of times. But why the error code '404'?

When internet was in its infancy, and was confined to CERN's internal network, in an office on the fourth floor (room 404), they placed the World Wide Web's central database: any request for a file was routed to that office, where requested files were located and transferred over the network, to the person who made that request. When faulty requests were made, for which files were not present, the standard message was: 'Room 404: file not found'. The room numbers remained in the error codes in the official release of HTTP when the Web left CERN to conquer the world, and are still displayed when a browser makes a faulty request to a Web server.

I am sure all of us would have heard this story. But is this true? Turns out, its just a popular myth.

Why 404 then? HTTP status codes were established by the World Wide Web Consortium (W3C) in 1992, as a part of the HTTP 0.9 spec. They were defined by Tim Berners-Lee. Berners-Lee based the HTTP status codes on FTP status codes.

The first '4' indicates a client error. The server is saying that you've done something wrong, such as misspelled URL or request to a page which is no longer there. Conversely, a 5xx error indicates a server-side problem. The middle '0' refers to a general syntax error. The last '4' just indicates the specific error in the group of 40x, which also includes 400: Bad Request, 401: Unauthorized, etc.

Further, there is no room 404 in CERN! The offices in building 4 at CERN start at 410 and work upwards.

So, myth busted :)

Tuesday

RSS Feed

RSS is used to describe the technology used in creating feeds.

A feed, also known as RSS feed, XML feed, syndicated content, or web feed, is frequently updated content published by a website. It is usually used for news and blog websites, but can also be used for distributing other types of digital content, including pictures, audio or video. Feeds can also be used to deliver audio content (usually in MP3 format) which you can listen to on your computer or MP3 player.

An RSS document, which is called a "feed", "web feed" or "channel" contains either a summary of content from an associated web site or the full text.

By retrieving the latest content, from the sites you are interested in, RSS feed allows you to easily stay informed. It eliminates the need to visit each site manually and there is no need to subscribe to each site's email newsletter.

RSS content can be read using software called an "RSS reader", "feed reader" or an "aggregator". The user subscribes to a feed by entering the feed's link into the reader or by clicking an RSS icon in a browser that initiates the subscription process. The reader checks the user's subscribed feeds regularly for new content, downloading any updates that it finds.

Some history: RDF Site Summary, the first version of RSS, was created by Ramanathan V. Guha at Netscape in March 1999. This version became known as RSS 0.9. In July 1999, Dan Libby produced a new version, RSS 0.91, and renamed RSS to Rich Site Summary. The RSS-DEV Working Group, introduced RSS 1.0 in December 2000, and reclaimed the name RDF Site Summary. In September 2002, Winer released a major new version of the format, RSS 2.0, that redubbed its initials Really Simple Syndication.

So, "RSS" in general, may to refer to any of the following formats:
a. Really Simple Syndication (RSS 2.0)
b. RDF Site Summary (RSS 1.0 and RSS 0.90)
c. Rich Site Summary (RSS 0.91)

Monday

Domain hack

In the web world, there are hundreds of thousands of sites/ domains, so how do you differentiate your website from the others. Go in for Domain hack :)

A domain hack is an unconventional domain name that combines domain labels, especially the top-level domain (TLD), to spell out the full "name" or title of the domain, making a kind of pun.

For example the domains http://blo.gs makes use of the TLD .gs (South Georgia and the South Sandwich Islands) to spell "blogs", and http://chronolo.gy uses the TLD .gy (Guyana) to spell "chronology".

The domains del.icio.us, an.geli.ca, and cr.yp.to make use of the TLDs .us (United States) , .ca (Canada) and .to (Tonga) to spell "delicious", "angelica", and "crypto" respectively.

In the context of domain hack, the "hack" represents a clever trick, and not a break-in (as in security).

Del.icio.us, is a social bookmarking site created by internet guru Joshua Schachter. The name was catchy, and it quickly became an internet celebrity. Del.icio.us, though not the first domain of this nature, is the best-known and most frequently-accessed domain hack.

The demand for uniqueness and visibility in the cyber world gave rise to an ever growing list of domain hacks such as magnol.ia, win.gs, mir.aculo.us, videoDi.sk, inter.net, Who.is, buyhom.es, groo.ms, Whocalled.us – to name a few.

We all know, the TLD for India is .in. Here are some domain hacks on .in :) http://adm.in (admin), http://einste.in (Einstein) and http://doma.in (domain). And hey, What about http://sach.in? ;) (To my delight, Sach.in is not taken yet :))

.me is a Internet country code TLD that has been assigned to Montenegro. Now think of the numerous possibilities of playing with the .me TLD. :)

Sunday

Google hacking

Google hacking is a term that refers to creating complex search engine queries, to retrieve information related to computer security. In its malicious form, it can be used to detect websites that are vulnerable to numerous exploits and vulnerabilities as well as locate private, sensitive information.

Google hacking involves the use advanced and less well known features of the Google search engine to reveal sensitive data about a particular target. Google Hacking, essentially makes use of advanced search operators to locate specific strings of text within search results.

While Google hacking is the general term used, many of the tactics and search operators can be used on any search engine.

So, you ready for some hacking? Here goes...

One can retrieve the Username & Password list from Microsoft FrontPage Servers by inputing the microscript below, in Google search field: -

"#-Frontpage-" inurl:administrators.pwd

What do you see? :)

Note the search operators such as ‘inurl:’, double quotes etc, used in the search string.

This is just a simple example. You might be able to get just about anything if you know what to look for, and how to look for it!

Saturday

Search Smart I

Search Smart (with Google search operators)

We all use Google (or other search engines) to retrieve info from the vast realms of the net. But do we search smart? Well, start searching smart; even write microscripts - to get what you are looking for, making use of the special operators provided by the search engines. Here is how to do it…

If a word is essential to getting the results you want, you can include it by putting a "+" sign in front of it.

If you want to search not only for your search term but also for its synonyms, place the tilde sign ("~") immediately in front of your search term. (If you give '~House'; Google automatically searches for House, Home, Dwelling etc too. If you give '~Food'; Google searches for Food, Nutrition, Cooking, Recipe etc).

To find pages that include either of two search terms, add an uppercase OR between the terms.

What if you want to search a string only within a particular site - give search terms you're looking for, followed by the word "site" and a colon followed by the domain name (If you want to find Hawking in Wikipedia only, give 'Hawking site:en.wikipedia.org')

You can use Numrange (two numbers, separated by two periods) to set ranges for everything from dates to weights to amounts (eg: 'Saarang 2004..2007'; '5000..10000 kg truck'; 'DVD player $50..$100'). Google searches for all the numbers between the range along with the keyword(s).

Google ignores common words and characters such as where, the, how.

Sometimes the best way to ask a question is to get Google to 'fill in the blank' for you. You can do this by adding an asterisk "*" in the part of the sentence or question that you want filled in. ('the parachute was invented by *')

I suggest you try out each of them its really interesting and useful!

Friday

Search Smart II

Cache Operator:

Say you want to see all occurrences of the words in your search string Highlighted, in a particular web page. What can you do? Use the ‘cache:’ operator.

Eg: If you wanted all the occurrences of the words ‘web’, ‘net’, ‘browser’ and ‘Google’ highlighted (for easy viewing) in the Netty-gritty blog page; you just need to give the string ‘cache:netty-gritty.blogspot.com web net browser google’ in Google. See for yourself, what it returns!

Now, don’t you think its very useful, especially when you need to find occurrences of strings in a webpage laden with lots of information.?

As you might have guessed the primary functionality of the ‘cache:’ operator is not to highlight search words. So what is the ‘cache:’ operator really meant for?

Google takes a snapshot of each page examined as it crawls the web, and caches these as a back-up. When you use the ‘cache:’ operator, you see the web page as it looked when it was indexed (this is the content Google uses for the search – so if you updated your website recently and want to know if the Google bots have already crawled through your content, you may use the ‘cache:’ operator).

Link Operator:

Know the secret behind Google's huge success as a search engine? It was the hypothesis 'that pages with the most links to them from other highly relevant Web pages must be the most relevant pages associated with the search' - by Larry Page and Sergey Brin, who were Ph.D. students at Stanford University then. Their research project was accordingly named Back-rub, which later went on to be the internet giant Google!

Obviously we could except, some operator which would list all webpages that have links to the specified webpage, right? Well, what we are looking for is the 'link:' operator.
Eg: 'link:www.google.com' will list webpages that have links pointing to the Google homepage. If you had a website/ blog and wanted to see all other pages that link to your page, you know what to do :)

Thursday

Search Smart III

Define Operator:
Use the ‘Define:’ operator to get definitions of word(s) without navigating to other sites. Eg: ‘define:mainframe’

Inurl Operator:
Say you forgot a url, but remember some words in the url. If you are going to search for that word(s), Google is going to throw you umpteen number of pages all of which will have the word(s), and finding your url will be like looking for a needle in a haystack. So is there a way to search words only in the url? Yes; there is. Use ‘inurl:’ or ‘allinurl:’ operators. Eg: ‘allinurl: Bill Gates’ will return only pages which have the words ‘Bill’ and ‘Gates’ in the url. You might ask why ‘inurl:’ operator while there is a ‘allinurl:’. Let me illustrate with an eg. Say you want ‘Gates’ to be in the url and ‘Paul Allen’ in the page, then you can use ‘inurl:gates paul allen’.

Filetype Operator:
If you want only pages with a particular type of file type (like doc, jpg, xls, php, xls, swf, pdf etc) to be returned, use the 'filetype:' operator. Eg. 'Hawking filetype:doc'

If you want to find the exact phrase (all words in the same order you entered), give the search string within double quotes. Eg: "Tribute to the Founding Fathers of the Web and the Net" (extremely helpful when you want to look up lyrics, quotes, dialogues etc)

If you want to search for some words and exclude some other words, use the '-' operator. For instance you want to search for virus, but you want the word computer not to be present in results (may be you are a medical student who wants to exclude pages on computer virus). The search would be 'virus -computer'

Google operators can be combined in a number of ways to form ‘Google Microscripts’ to write complex queries and to Search Smart :)

Wednesday

Link bombing

‘Link bombing’ (or Google bombing) is an attempt to influence the ranking of a given page in results returned by the search engine, often with humorous or political intentions.

If you are searching for something, search engines like Google, Ask, and Yahoo! will first find all the pages they think are relevant. Then, crucially, they need to decide which order to display search results in, ie how to rank the pages based on the relevance. One of the most important factors in deciding how relevant particular sites are, is to count how many other sites link to it (backlinks).

Search engines often use the number of backlinks that a website has as one of the factors for determining that website's search engine ranking. For example, Google's PageRank algorithm uses backlinks to help determine a site's rank. Therefore, websites often employ various techniques to increase the number of backlinks pointing to their website and to boost their page ranks. The more references, or links there are to a site, the more important it is deemed to be.

Because lots of people quote and link the Wikipedia website, for example, the Wikipedia site is seen as relevant and a good hit. That is why the site ranks quite highly for many search terms.

Thus in theory you could get your own site listed at the top for very targeted keywords using the same technique. This presents an opportunity for people to exploit this fact, and to create a "link bomb". They may encourage thousands of their peers to include a link to a particular page for some particular phrase/words. For eg. Link bombers had the Bush homepage in their blogs, and labeled it "miserable failure", as a result whenever one searches for ‘miserable failure’ due to the huge number of backlinks to the Bush homepage, the main search engine algorithms were fooled into thinking this was a relevant result for that search term, and Bush was driven to the top of the rankings. Mr Bush had been link bombed.

Tuesday

Search engine optimization

Search engine optimization (SEO) is the process of improving the volume and quality of traffic to a web site from search engines. The higher a site “ranks” the earlier it is presented in the search results, and more visit that site gets.

SEO mainly considers how search algorithms work and what people search for. Search engines use crawlers to find pages for their algorithmic search results. Search engine crawlers may look at a number of different factors when crawling a site. Not every page is indexed by the search engines. Distance of pages from the root directory of a site may also be a factor in whether or not pages get crawled.

A good knowledge on how search engines assign Page Ranks is vital to devise the SEO techniques for your site. While there are many complex SEO techniques, a few basic ones would be:

• Get quality inbound links (Search for your target keywords. Look at the pages that appear in the top results. Visit those pages and ask the site owners if they will link to you, they might - especially if you offer to link back)
• Make sure you get your pages get listed in Social bookmarking sites
• Create a descriptive title. Failure to put target keywords in the title tag may cause perfectly relevant web pages to be poorly ranked.
• Create a Site Map that includes all the pages in the site
• Have main keyphrase at the beginning of page title (keyphrase prominence) and keep keyphrases together (keyword proximity)
• Have Keyphrases in Description Meta Tag
• Avoid frames. Search engines may index only the framed content page and not the navigation frame.
• Make sure that all internal pages link to the homepage
• Target keywords should be at least two or more words long (too many sites will be relevant for a single word)

Black hat SEO or spamdexing, employ overly aggressive techniques such as link farms and keyword stuffing. Search engines look for sites that employ these techniques and may remove them from their indexes.

Monday

Podcasting

Podcasting is the method of distributing multimedia files, such as audio or video programs, over the Internet using syndication feeds, for playback on portable media players and personal computers.

The term "podcast" is a portmanteau of the acronym "Pod" – standing for "Portable on Demand" – and "broadcast". (The name iPod was coined with Pod, prefixed with the "i" commonly used by Apple for its products and services.)

Thus podcasting refers to posting or transmitting an audio/ video file to be downloaded and viewed/heard by other internet users either on a computer or MP3 player. It involves making a multimedia file (usually in MP3 format) that is updated frequently available for automatic download, via an RSS feed, so users can listen to the file at their convenience.

A podcaster begins by making a file available on the Internet, usually by posting the file on a webserver. The content provider (podcaster) then announces the existence of that file in a feed (RSS or Atom).

A podcast-specific aggregator or podcatcher (software installed on the users' computer/ portable media player) checks each feed at a specified interval. If the feed data has changed from when it was previously checked, the podcatcher determines the location of the most recent item and automatically downloads it. The downloaded episodes can then be played, replayed, or archived as with any other computer file.

Podcasting's initial appeal was to allow individuals to distribute their own radio-style shows, but the system quickly became used in a wide variety of other ways, including distribution of school lessons, official and unofficial audio tours of museums, conference meeting alerts and updates, and by police departments to distribute public safety messages.

Sunday

OpenID

Think of all the sites you have registered with - blogs, emails, social networking sites, social bookmarking sites, photo sharing sites, video sharing sites, feeds.... With Web 2.0 hitting the internet in a big way, you find a new site you can't live without, springing up every other day. Now that means you will be having 100s of usernames and passwords to remember. Ever thought why we can't have single digital identity across the Internet?

OpenID is a single sign-on system, a way to use a single digital identity across the internet. Using OpenID-enabled sites, you do not need to remember authentication tokens such as username and password. Instead, you only need to be registered on a website with an OpenID "identity provider" (IdP). Since OpenID is decentralized, any website can employ OpenID software as a way for users to sign in. It eliminates the need for multiple user names across different web sites, simplifying your online experience.

An OpenID provider is a service provider offering the service of registering and providing OpenID authentication. You get to choose the OpenID Provider that best meets your needs. A URL or XRI is chosen by the End User as their OpenID identifier, which can be used to login to any site that accepts OpenID.

OpenID is still in the adoption phase and is becoming popular, with organizations like AOL, Microsoft, Sun, Novell, Yahoo etc. beginning to accept and provide OpenIDs.

Saturday

Web Widgets

Web widget is a embeddable piece of code that can be installed and executed within any HTML-based web page.
Widgets may also be known as modules, snippets or plug-ins.

Applications can be integrated within a website by the placement of the small snippet of code. Widgets may be used to enhance the experience of the visitors by providing additional third party functionalities. While the third party sites that provide widgets, use it as a form advertisement or syndication for their sites.

Widgets are now commonplace and are used by bloggers, social network users, auction sites and personal web sites. iGoogle and Netvibes are among the most visited sites that make use of widgets extensively.

Think of all the link counters, advertising banners, flash animations, YouTube videos, Podcast players, calendars, count-down timers, Flickr photo presentations, social bookmarking buttons you have come across. Btw the clock and google search and Digg Top 10 on this blog are widgets I have embedded here :)

If you are looking at hosting some cool widgets on your site check out widgetbox.com, yourminis.com or pageflakes.com.

Netty-gritty

Monday

Tribute

Sunday

Typosquatting

Saturday

CAPTCHA

Friday

Internet bots

Thursday

Web 2.0

Wednesday

Myth busting

Tuesday

RSS Feed

Monday

Domain hack

Sunday

Google hacking

Saturday

Search Smart I

Friday

Search Smart II

Thursday

Search Smart III

Wednesday

Link bombing

Tuesday

Search engine optimization

Monday

Podcasting

Sunday

OpenID

Saturday

Web Widgets

My Blogs

Blog Archive

My Blogs