Sunday, February 8, 2009

Search Engine for Dummies - part 1

What would Internet be without search engines? You have a humongous amount of data, but without a tool to find what you need, you would be at lost.

Today you probably take it for granted that Google is the most popular search engine. But that wasn't always the case. If you've been around in the Internet long enough (i.e. a decade or more), you'd probably remember those days before Google become household name. Before the turn of the new millennium, the most popular search engine was Altavista.

Back than Altavista was considered cutting-edge and the pioneer of a truly usable Internet search engine. It was one of the top web destinations. But by early 2000's. It's popularity quickly dropped due to the arrival of a new kid on the block: the venerable Google.

When Google entered the search engine market, many people thought the market was already saturated. There were already a bunch of search engines (You may recall names such as Lycos, Magellan, InfoSeek, Excite, etc). But Google proved them wrong. It rose steadily in prominence, and soon enough it grasped the top spot and remained there ever since.

So how does Google do it? And where did Altavista and those other search engines fail?

All search engines begin with a web crawler. It is a piece of software which automatically crawls the web, collecting information about every page it encounters. This data is then stored and indexed in the search engine's database, making it available for search queries from users.

The quality of a search engine from user's point of view depends on the relevance of the result set it gives back based on the search term. The primary difference between old generation search engines (Altavista et al) and newer generation ones is the method to determine the most relevant web pages to be put in the result set.

Old search engines' method is based on the textual content of the web pages. In this method, a page's relevance to a search term is calculated based on how many times that search term occurs in the page. For example, if you search for "Nuclear Weapon", the search engine would probably return a page where this term occurs multiple times at the top of the search result.

That sounds reasonable, right? But it turns out that this simple method has serious flaws. Supposed you create a page which contains nothing but the term "Nuclear Weapon" repeated a dozen times. This page would rank high in the result set, despite the fact that it's a useless page. This is the reason why results returned using this method typically have low quality. How do you improve the quality then?

to be continued...

How Big is Your Internet Footprints?

A while ago, for one reason or another I remembered a good old friend from my college years in Illiois, US which I haven't heard for more than five years. We now lives at the opposite ends of the earth. He stayed in the US after graduating, while I went back to my home country, Indonesia.

Time passed, life got me busy, hence I no longer kept in touch with him. His old email was no longer active and we didn't have many mutual friends. So when I wanted to find out news about him, I was at lost. What do you do then when you're in a situation like this?

Well, if you use Facebook (or any other social networking sites), most likely that would be the first place to go. That I did. But nop, he wasn't there. The next logical thing would be to go to uncle Google. I did that too. Voila, after a few searches I found some info about him. Now I know that instead of Facebook he uses LinkedIn. I found out know where he works, his gmail email address, and things he did for college project assigments. Search a little more, I found out that last year he had married his longtime girlfriend (they hooked up since the college days. Dude, what took you so long...). That's not it, I even found that just recently, they had a baby daughter. Had I searched deeper, probably I could find his home address, phone number, and his shoe number (ok, probably not that one).

Pretty scary, huh?! His personal information is all exposed on the Internet. In the old days, probably you need to be a cop or a private investigator to obtain that kind of information. Now, a Google skill is all you need.

What about yourself? If you have used Internet for a while and you have been active (use a lot of web services, join social networks, forum discussions, etc), chances are your personal information are all over the web. You may think that some of the websites that you use have somehow violated your privacy. But the fact is, in most cases it is you who expose to the world about yourself. Just think of what others can see in your profile in your favorite social network.

I heard now it's not uncommon for employers to find out as much as they can about their prospective employees via Internet. The Internet can tell them what you don't mention in your CV. It's easier for the employer to Google for your information than ask your reference giver. If you have been behaving "nicely", than you have nothing to worry about. But if you've been "bad", well, your Internet profile is like a criminal record which is difficult to erase. Perhaps now you should be more careful about what kind of footprint you left in the Internet. In my case, it's already to late, my footprints are all over the place....