Amirkasra's blog

By Amirkasra, 11 years ago, In English

hi every body ! recently i got crazy about creating a search engine myself ! i learned about an algorithm, known as "document distance". its very interesting in my view ! if you don't know about it, it's an algorithm that calculates the angel between the vectors of two documents. actually the more different they are in subject, the larger angel they have between. google knows more ! i wonder if anyone knows some basic algorithms like "docDist" that maybe useful in making a search engine. if you got any idea of which kind of knowledge should i gain in order to build a search engine please let me (and maybe others!) know.

  • Vote: I like it
  • +41
  • Vote: I do not like it

»
11 years ago, # |
  Vote: I like it +7 Vote: I do not like it

I think they can help you: google,yandex :)

»
11 years ago, # |
  Vote: I like it +16 Vote: I do not like it

The greatest book I've ever seen about search engines and their structures is Introduction to Information retrieval .

For example you can find here your factor "angle between documents" in the chapter about ranking factors and algorithms.

The overview of most fundamental components and algorithms used in search engines is quite fluent in this book, however I haven't seen so precise and so wide overview before and strongly recommend it to you. Knowledges from this book is sufficient to build simple working prototype of simple web search engine.

»
11 years ago, # |
  Vote: I like it +16 Vote: I do not like it

Your search engine will naturally consist of two big parts: the crawler (that will be traversing the web and parsing pages) and the ranker (that given the query and some documents will decide which documents are more relevant to the query and which ones are less relevant).

The crawler will need to store its data somewhere. Search for "inverted indexes" -- This is how most of the search engines represent their data internally. Parsing itself is rather straightforward, but will have its own challenges (for instance, people will try to trick your crawler by introducing hidden elements on the page, that contain SEO-optimized content, and you would want to implement your crawler in such a way that it will be able to recognize such hidden elements and ignore them).

For the ranker, you will need to collect lots of different signals and use some model to rank documents based on them. It might be a hand-tuned model, or some machine learning model. Some signals will be based on the document body, in particular read about BM25:

http://en.wikipedia.org/wiki/Okapi_BM25

Some will be based on other pages linking to the current document. In particular, for a long time many search engines used techniques similar to PageRank:

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

as a base of their rankers. These days search engines tend to lower the weight of the links in favor of other signals, since buying links was one of the most common ways of tricking search engines to rank your web site higher.

Then when you are done with the crawler and the ranker, you will have a somewhat working search engine. Then you will want to concentrate on some other aspects of it, such as correcting spelling, filtering out spam results and may be introducing some relevant ads :)