Evan Hoose's Website, Blog and stomping groundHome Blog Programs Tutorials Other Stuff
Drew DeVault wrote this. (Read that first. It'll provide useful context I won't explain.)
I like Drew's work, and this got me thinking.
What is the best way to implement a search engine in this style?
This is a living document. I will make edits and other changes without warning. It is also not complete.
I am mostly using this as notekeeping for myself, and have made it publicly availabe in the hopes that it will be useful.
I do intend to keep playing around with this, but it is strictly on a spare time/I feel like it basis.
As I've started studying this, I've come across some resources. I'll link them here in case anyone else is looking.
Introduction to Information Retrieval -- A book by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Just started it, but it looks promising.
Information Retrieval Resources -- Resource link dump provided by the above authors.
Arden Dertat's Blog -- Conveniently, Arden Dertat has a series of blog posts about building search engines. It looks good, but I don't know enough about the topic to confirm or deny.
Wikipedia: Search engine indexing -- Exactly what it sounds like.
NOTE: Most of what is described below is either frontend, or the very front of the backend. Why? Because that's what I knew enough to write about when I started. I'm currently studying the resources linked above, and will update as I learn more.
We will have three main components:
I would argue that the crawler should take a list of domains to use as 'Tier 1' and then crawl links as described in Drew's post.
This crawler would then output a database which can be fed into the data servers, in both centralized and decentralized forms (for example, after running a crawl, the crawler could offer to also submit the database to the centralized data server).
This should help build up a central database, while still making it simple to run your own instance with a given focus.
In order to combat abuse, I would have the central server do some form of sanity check against database, as well as blacklisting sites that appear to be blogspam or otherwise not useful.
These servers will take the databases outputed from the crawler, and use them to respond to search requests.
However, unlike a traditional search engine, these servers will only respond with data in some serialized format.
The purposes of this are multifold.
First, it simplifies the amount of development needed for the server, which seems like a Good Thing, to me at least.
Second, it would allow for more useful data to be sent with the same amount of bandwidth.
Third, it allows multiple search clients to be used, each with as many or as few features as they please. One feature in particular that I would find useful would be the ability to create a local search database that could be compared against before sending out network requests.
The most basic client could be as simple as a web page which renders whatever data it receives from the central server.
Some other features that could be included client-side:
Building of a local search database.
Blacklisting of domains that you personally find non-useful.
A launcher for the web-crawler, which could be used to improve results that you find lacking.
A switcher for which search server you want to connect to.
And probably a few more, but the above would be my wishlist.
In my mind at least, this strikes a good balance between de/centralization.
The search server is much simplified down from what it would have to be for a traditional engine.
It should be more simple to write your own clients/servers, as you could have standardized formats for search results and search databases.
Is dependent on user having access to a good client. There could of course be one provided, but unscrupulous people could sabotage the network by providing/advertising subpar providers.
Possibly leans to far towards decentralization. (Would users running crawls submit them to the central server?)
Related to the first two, is dependent on a given provider giving access to a good search server.
Is designed by someone who doesn't know much about either crawling or indexing, and therefore may be totally unviable for reasons I don't understand.