Friday, February 04, 2005

semantic and Search Engines

Add to Delicious Digg this links to this post -

Google and other search engines try to make the best of your queries, trying to figure out what are the best results they can offer.

Thanks to the pagerank factor (or equivalent for other search engines), they get votes from other websites to help them decide whether a page is or is not related to your query.

But they need to take synonyms into account. This is where semantic comes in.

add "~" at the beginning of the word, and you will get an other set of results, based on semantic search. You get Nokia for phones, BMW for cars, etc ...

How does it work? You may think...probably co-occurence is taken into account - but if it is, it is definitely weighted somehow:

co-occurence: take the number of results for kw1 (n1), number of results for kw2 (n2), and then number of results for kw1+kw2 (n12)

=> c-index (correlation index) = c = n12/(n1 + n2 - n12)
mortgage is related to microsoft for instance (type ~mortgage in google). but c-index is very low for the two keywords.

In some case, you have got more results for the keyword itself than for the combination ~keyword. It means that the algorithm is not taking into account all the parameters it usally does. it keeps some off page factors though, because some pages are displayed without related keywords on them (easy to test). which parameters are taken out ... worth finding out.

How can google associate "microsoft" with "mortgage", or "phone" with "nokia"? The pageRank seems to be involved (all the site coming first have a very decent PR, big brands steal the top spots). there is definetely something to look into on this side.



It is not a reciprocal meaning:
~phone leads to nokia
~nokia doesn't lead to phone.

I leave the problem open. How does google builds its own thesaurus? any manual input?











0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home