AS4 | Evan Hoose

A better search solution.

Drew DeVault wrote this. (Read that first. It'll provide useful context I won't explain.)

I like Drew's work, and this got me thinking.

What is the best way to implement a search engine in this style?

Before you begin:

This is a living document. I will make edits and other changes without warning. It is also not complete.

I am mostly using this as notekeeping for myself, and have made it publicly availabe in the hopes that it will be useful.

I do intend to keep playing around with this, but it is strictly on a spare time/I feel like it basis.

Resources for learning:

As I've started studying this, I've come across some resources. I'll link them here in case anyone else is looking.

Introduction to Information Retrieval -- A book by Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch├╝tze. Just started it, but it looks promising.

Information Retrieval Resources -- Resource link dump provided by the above authors.

Arden Dertat's Blog -- Conveniently, Arden Dertat has a series of blog posts about building search engines. It looks good, but I don't know enough about the topic to confirm or deny.

Wikipedia: Search engine indexing -- Exactly what it sounds like.

Proposed Architecture

NOTE: Most of what is described below is either frontend, or the very front of the backend. Why? Because that's what I knew enough to write about when I started. I'm currently studying the resources linked above, and will update as I learn more.

We will have three main components:

The Web Crawler

I would argue that the crawler should take a list of domains to use as 'Tier 1' and then crawl links as described in Drew's post.

This crawler would then output a database which can be fed into the data servers, in both centralized and decentralized forms (for example, after running a crawl, the crawler could offer to also submit the database to the centralized data server).

This should help build up a central database, while still making it simple to run your own instance with a given focus.

In order to combat abuse, I would have the central server do some form of sanity check against database, as well as blacklisting sites that appear to be blogspam or otherwise not useful.


Page Ranking

The data/search servers

These servers will take the databases outputed from the crawler, and use them to respond to search requests.

However, unlike a traditional search engine, these servers will only respond with data in some serialized format.

The purposes of this are multifold.

First, it simplifies the amount of development needed for the server, which seems like a Good Thing, to me at least.

Second, it would allow for more useful data to be sent with the same amount of bandwidth.

Third, it allows multiple search clients to be used, each with as many or as few features as they please. One feature in particular that I would find useful would be the ability to create a local search database that could be compared against before sending out network requests.

The client

The most basic client could be as simple as a web page which renders whatever data it receives from the central server.

Some other features that could be included client-side:

And probably a few more, but the above would be my wishlist.

Pros/Cons of this architecture.