the dude abides

Why We Built Elasticsearch (at dotScale)

2013-06-07T00:00:00-04:00

A talk I gave at dotScale about why we built Elasticsearch.

Why We Built Elasticsearch (at dotScale) was originally published by Shay Banon at the dude abides on June 07, 2013.

VMWare Blogs - 10 “Bonsai Cool” Things About Elasticsearch

2013-02-05T00:00:00-05:00

An interview with me regarding Elasticsearch on VMWare Blogs.

VMWare Blogs - 10 “Bonsai Cool” Things About Elasticsearch was originally published by Shay Banon at the dude abides on February 05, 2013.

WIRED - Your Own Private Google

2012-12-07T00:00:00-05:00

An article on WIRED interviewing myself (amongst others) regarding Elasticsearch and todays ability of companies to build Google like infrastructure using open source search solutions, and the awesomeness that is Lucene.

WIRED - Your Own Private Google was originally published by Shay Banon at the dude abides on December 07, 2012.

You Know, for Search (Inc)

2012-06-12T00:00:00-04:00

When I set down to write the first lines of code for Elasticsearch, about 3 years ago, I looked at my wife and my 2 months old daughter and knew perfectly what I was getting into. Well, as Miracle Max would say, I mostly did. I knew that its going to be a commitment that will take a big chunk of my life to follow through, I knew that I was building something useful that will make developers life simple, and I knew that there is a need for something like Elasticsearch out there.

Obviously, I knew all those things, but not many others did. Its funny, as Tim Robbins found out in the classic Cohen brothers movie, “The Hudsucker Proxy”, getting people to see a circle and state “You Know, for Kids!”, and making the leap to understand what it can actually be is not simple.

Thomas Jefferson said: “I am a great believer in Luck, and I find that the harder I work, the more I have of it”. Elasticsearch was and is certainly not a walk in the park. Building a distributed system, that can handle massive amounts of data, and still be usable is no simple task. I do feel lucky though, we live in an age where anyone can open a laptop and set out to change the world (a bit), with extra luck points for having an understanding spouse.

More than that, I feel lucky with the community that evolved around Elasticsearch. It caught me by surprise the speed at which Elasticsearch got adopted. I knew that something was missing, I just couldn’t believe how quickly people will realize it as well. I am very proud of what happened around Elasticsearch, the ecosystem that developed around it, and the users actually taking and using it in real systems and in production.

But, dodging buses like crazy for the past couple of years has left me exhausted, and the user’s request for having something formal around Elasticsearch, the need to feel safe, has been increasing exponentially. You see, I have seen it happen for 10 years now, it usually starts with “lets just have a search box so people can search some content”, and quickly evolves to using it in many other parts of the stack / system, quickly increasing its importance in the application.

Luckily (do you notice a trend?), a few months ago, my good friend from the Compass days Uri Boness mentioned that he was doing something around search, and I got to know Steven, who was leading it. Steven and myself immediately hit it off. You see, Steven and myself think very much alike, yet still differently. It’s a rare combination that doesn’t happen frequently, but when it does, exciting things happen. And to top it all, Steven comes with an extensive track record when it comes to Open Source projects. When I learned that Simon Willnauer, one of Lucene rock stars is part of the team as well, the inevitable conclusion was not that hard to make. It Just Felt Right.

So, I am extremely happy to announce that we now have an Elasticsearch company, your basic one stop shop for anything to do with Elasticsearch. What does it mean? It means that we can basically move Elasticsearch harder, faster, better, and stronger, while providing all the services you might expect from an Open Source company.

On top of that, the Elasticsearch team also includes Nick White (the finance wizard), as the CFO. Chris Male, and Martijn van Groningen, both amazingly talented developers and Lucene committers, and the valuable Elissa Nancarrow to handle, well, basically everything else. With the future holding additional talented people joining the company (and if you are up to it, we are hiring!), it really feels like we are on our way to build something really special.

I still vividly remember 10 years ago sitting in a one room apartment in London, with no job, first getting into the search space by writing “iCook” for my wife while she was studying to be a Chef at the Cordon Bleu. Its been a long journey to get to this point, yet it feels like it has just begun…

You Know, for Search (Inc) was originally published by Shay Banon at the dude abides on June 12, 2012.

Data Design Patterns / Analytics - Elasticsearch (at bbuzz)

2012-06-04T00:00:00-04:00

A talk I gave at Berlin Buzzwords about data design patterns and analytics with Elasticsearch.

Side note, the title of the talk is completely messed up, decided a day before the conference to create a fresh talk about a more technically interesting topic… .

Data Design Patterns / Analytics - Elasticsearch (at bbuzz) was originally published by Shay Banon at the dude abides on June 04, 2012.

Road to a Distributed Search Engine (at bbuzz)

2011-08-09T00:00:00-04:00

A talk I gave at Berlin Buzzwords about the road it took for Elasticsearch distributed architecture to happen.

Road to a Distributed Search Engine (at bbuzz) was originally published by Shay Banon at the dude abides on August 09, 2011.

Something New...

2010-09-08T00:00:00-04:00

After four great years of working at GigaSpaces, I am leaving to pursue my own venture. Working at GigaSpaces is certainly one of the best things I have done in my professional life, and leaving is not a simple decision. The people, the technology, and the innovation at GigaSpaces is amazing, and I am proud to have taken part in it.

What is next for me? As most people probably know, my technical love is search. Starting with Compass, and now with ElasticSearch, I am trying to build open source solutions that help get app level search functionality into any application easily (as most apps, if not all of them, need it ;) ). That is why I am going to start and build something around it (specially, elasticsearch), but on that, I think a different blog post is in order.

Something New... was originally published by Shay Banon at the dude abides on September 08, 2010.

The Future of Compass & Elasticsearch

2010-07-07T00:00:00-04:00

Its been a long time since I blogged about Compass, and I guess its about time to discuss Compass, ElasticSearch, and how they relate to one another.

I started Compass six years ago with a real belief that search is something that any application should have (and search here is not just full text search), and the aim was to have search integrated into a Java application as simple as possible.

Compass has pioneered some really exciting features, including the ability to map your domain model to a search engine (OSEM), later also XML and JSON support, and integration with other ORM libraries (Hibernate, JPA, and so on) to try and make the integration of search as seamless as possible to your typical Java stack application (which has been copied quite nicely by others as well ;) ).

During the lifecycle of Compass, I have also tried to address the scalability aspects of a search solution. By integrating with solutions such as GigaSpaces, Coherence, and Terracotta, the aim was to try and make a search based solution more scalable and usable by applications.

About 8 months ago, I started to think about Compass 3.0. I knew that it required a major rewrite in how it uses Lucene (Lucene 2.9 came with several major changes internally, mainly in how it handles low level readers and search), and also in how to better create a scalable search solution, being able to scale from a single box to many easily and seamlessly. The changes did not end there, I also wanted to create a solution where adding more advance search features, such as facets and others, would be simple.

The more I thought about it, the more I understood that this basically entitles a complete rewrite of Compass if its going to be done correctly. Also, I wanted to bring to the table the experience I had with search over the past years and how search should be done, which is hard with an existing codebase.

This is an important point, especially when it comes to scalable search, which I would like to touch on. The way that I started with trying to solve scalable search using Compass is by creating a distributed Lucene Directory implementation. Of course, this does gets you a bit further down the scalability road, but its very evident for people knowing how Lucene works (or, for that matter, search engines) that this is not the preferred solution (I knew it as I was writing it). Even going up the stack and creating something similar to Lucandra won’t cut it… . The proper way to solve the scalability problem is by running a “local” index (a shard) on each node, and do map reduce when you execute a search, and routing when you index (this is a very simplistic explanation).

So, I started out building elasticsearch. Its basically a solution built from the ground up to be distributed. I also wanted to create a search solution that can be used by any other programming language easily, which basically means JSON over HTTP, without sacrificing the ease of use within the Java programming language (or more specially, the JVM).

To be honest, I am amazed at what has happened in just 8 months. ElasticSearch is up and running, providing all the core features I wanted it to have at the beginning. Its a scalable search solution, with a JSON over HTTP interface as well as really nice “native” Java API (it gets nicer in the upcoming 0.9 release).

Sadly, I have been spending a lot of time on elasticsearch, and almost no time on Compass itself, especially around the forum. For that, I deeply apologize to the amazing Compass users that have been there over the years.

So, what about the future of Compass? I see ElasticSearch as Compass 3.0. Thats how I would have wanted the next major version of Compass to look like. This is not to say that the current ElasticSearch version implements all of Compass features, but the basis is there. The two main features that are missing are OSEM, and ORM integration.

As for OSEM, ElasticSearch can already index JSON (and JSON like structure, for example, in the form of a Map). What is left to be done is to create a mapping layer from the object model to this JSON like structure. ORM level integration should work in very similar to how Compass implements it today.

In terms of Java (JVM) level integration, ElasticSearch can easily either start embedded or remote to the Java process, both in distributed mode or in a single node mode.

So, what should someone do today? If you are going to start a new project, I would suggest you take ElasticSearch for a spin, I am sure you will like it. Existing Compass users should start to give serious thought as to how to migrate to ElasticSearch. Hopefully, once OSEM is implemented in ElasticSearch, the migration will be simpler.

Regarding the current Compass 2.x version, its basically in maintenance mode. I will try and help in the forum as much as I can. Will gladly accept patches and apply them to trunk and maybe even release a minor version for it. If someone would like to get more involved with it (administer the forum, help with the patches, releases, commit permission, and so on), I would be happy for it.

As far as I am concerned, the future is ElasticSearch. It is probably the most advanced open source distributed search solution you can find today, and its integration with Java (JVM) is a first class citizen. I hope that Compass user base will follow… .

The Future of Compass & Elasticsearch was originally published by Shay Banon at the dude abides on July 07, 2010.

ElasticSearch

2010-02-08T00:00:00-05:00

Well, the gig is out. What I have been working on for the past several months is now alive. ElasticSearch is an open source, distributed, RESTful search engine. More info about it can be found in the You Know, for Search blog post.

How does that relates to Compass? Good question, which deserves a proper blog post. I will be maintaining another blog just for ElasticSearch, and also there is a twitter, you should follow.

Enjoy!

ElasticSearch was originally published by Shay Banon at the dude abides on February 08, 2010.

Quantum Physics and Data Grids

2009-05-06T00:00:00-04:00

One of quantum physics’ crazier notions is that two particles seem to communicate with each other instantly, even when they’re billions of miles apart. Albert Einstein, arguing that nothing travels faster than light, dismissed this as impossible “spooky action at a distance.”

The great man may have been wrong. A series of recent mind-bending laboratory experiments has given scientists an unprecedented peek behind the quantum veil, confirming that this realm is as mysterious as imagined.

Based off this theory, there has been several computer science related experiments, especially revolving around Quantum Computer and different encryption algorithms that I won’t get into in this blog post. What I would like to suggest is how Quantum Physics can revolutionize the area of (In Memory ;) ) Data Grids.

One of the main problems of Data Grids is the ability to replicate state changes from one instance to the other, especially when using WAN. This highly ties into Brewer’s CAP Theorem, which states that:

When designing distributed web services, there are three properties that are commonly desired: consistency, availability, and partition tolerance. It is impossible to achieve all three. In this note, we prove this conjecture in the asynchronous network model, and then discuss solutions to this dilemma in the partially synchronous model.

What I am suggesting is that once the ability (and it very close, there are already companies building highly secure computer systems using Quantum Physics), CAP theorem will no longer be applicable.

As a thought experiment, imagine that a photon has an up spin, and that represents binary 1 (I know, it can get much more advance than that, I am simplifying things). A down spin represent binary 0. Once photons are “entangled” (we bring up our data grid), and then we separate them (across the building or across the ocean, does not really matter), we can get “instantaneous replication”. Once we change the state of one photon, the other will change its state instantaneously (only when we check its state, but that is when we really care about it ;) ). By exhibiting this behavior, we actually can get all three properties of CAP theorem.

Imagine as well the ability to store a “local cache” of the data. Since the size of data that can be “store” is exponentially bigger than current technology, and the fact that “state change” is not bounded by current technology (no need for wires), most people can have most of the data locally most of the time (which in itself, is relative). Once local data is changed, there is no need for 2PC or something like that in order to update the master data. For all intent and purposes, we hold the master data :).

Quantum Physics is going to revolutionize the way we go about and use technology. What I talked about is just the tip of the iceberg, but I personally believe that once the technology starts maturing, the impact it will have on the world will dwarf the arrival of computers, the industrial revolution, or any other major event that occurred in our not so long history.

One of Einstein famous quotes regarding Quantum physics is “God doesn’t play dice with the universe”. I personally like better Neils Bohr, a big proponent of quantum uncertainty, rebuttal: “Quit telling God what to do.”

Quantum Physics and Data Grids was originally published by Shay Banon at the dude abides on May 06, 2009.

OSGi - Still Waiting for Magnum

2009-02-02T00:00:00-05:00

I got a love and hate relationship with OSGi. On one hand, I think that its a great spec that provides amazing capabilities of modularization within the JVM. On the other hand, its a monstrosity, that in order to use it, you need to heavily lubricant your MANIFEST file, since you know where its going to end up at. Before you jump up and replace that MANIFEST with something else and apply it to me, let me make my point.

I just watched the wonderful Zoolander movie, and in the movie, everybody waits for Zoolander to reveal his Magnum face. At the end, Zoolander manages to find his Magnum face, with OSGi, I am still waiting for it to happen.

We at Javaland somehow always manage to make things much more complex before we realize that and work very hard in order to simplify them (JSF anyone?). The same goes with OSGi. In 95% of the applications, I don’t really need OSGi. Most applications can live happily with the current deployment model (each web application in its own class loader, versioning done on a per web app level).

Worst yet, in order to separate my web application into several modules, I need to either work really hard (Import / Export), or rely on tooling. Personally, I hate to work hard, and hate even more to rely on tooling (websfear anyone?). Take for example Spring dm server, while a novel concept and an amazing technical achievement, in practicality it just makes my life more complex for 99% of my typical applications. Sure, it can deploy “plain” war file, which is great, but then am I not just better off with plain Tomcat? I guess that this exact reasoning were behind the upcoming Spring tc server.

Sure, application servers, or Eclipse for example, can use OSGi to their heart content, but the best thing about it is that they hide almost completely the fact that OSGi is used. Either through tooling for Eclipse (which, in case of Eclipse, that is actually ok, since I develop plugins for Eclipse), or by supporting plain web applications with application server. Another great example is Terracotta TIMs, sure, they use OSGi, but who the frack cares? (personally, I think it was an overkill, but as long as I, as an integration module developer don’t know it, I don’t really care).

From an architectural standpoint, by the way, I strongly believe that the need for modularity within the same VM is almost never needed. Personally, I will almost never create a DAO module, and Service module, since most times I will roll an upgrade that changes the whole application. Most times when it is needed, you are usually better off with abstracting the module behind a REST AP.

The main case where you do need it, IMO, is when working in a collocated architecture. In such cases, modules do need to reside within the same VM and share another “heavy” collocated module (such as a data grid node/partition). In such cases, a very simple dependency tree (easily achieved with class loader hierarchy) can be created, with the common collocated module at its base. This is, by the way, why I like the Java Module System JSR, its much simpler than OSGi and focus on modularity and nothing else.

I am not saying that modularity within the VM is not needed. I just want the following things: When it is not needed, I want things to remain the same. And when it is needed, pretty please, with sugar on top, keep it simple. No tooling and no 300 lines MANIFEST files please.

OSGi - Still Waiting for Magnum was originally published by Shay Banon at the dude abides on February 02, 2009.

Actor Model and Data Grids

2009-01-18T00:00:00-05:00

The Actor Model is getting a log of (much deserved) hype for the past year. With languages like Scala and Erlang pretty much leading the way in reviving it.

First, here is what wikipedia has to say about the Actor Model:

In computer science, the Actor model is a mathematical model of concurrent computation that treats “actors” as the universal primitives of concurrent digital computation: in response to a message that it receives, an actor can make local decisions, create more actors, send more messages, and determine how to respond to the next message received.

With Data Grids, a lot has been already said about collocation. When using a Data Grid, the next logical step (the first is to stop hitting the database so much) is to collocate business logic execution with data.

One way to collocate business logic with Data can be done by sending the work to the node where the data resides. Data Grid vendors realize that this is a much needed feature and provide means to do so (Coherence with the InvocableService or GigaSpaces Executors).

Another way to collocate data with business logic is to actually start the Data Grid node in a collocated manner with a set of business services that interact with it in a collocated and event driven manner (this is called a processing unit in GigaSpaces).

A business service in such a case is a simple service that usually react to events happening in the Data Grid, process them, and produces results that cause other events to happen in the Data Grid.

If we take a step back for just a moment, you can see that the last paragraph sounds very similar to the Actor Model definition (not completely, but we are getting there…).

So, first, what constitutes as an event in a Data Grid? Well, surprise surprise, Data. Data being written / changed / removed from the Data Grid forms events within the Data Grid.

So, lets take a simple service, the service is running on each data grid node (the Data Grid is partitioned) in a collocated manner with each node. The service reacts to events occurring in the Data Grid (an Order has been submitted/written to the Data Grid). It them process the event, and produces more events. More events can be, for example, changing the status of the Order to “Requires Validation” (which will trigger the validation service), as well as write an “Email Message” with a link to the Order into the Data Grid (which will cause another service to email the fact that an Order has been received).

As you can see, the service above is an Actor, but not completely. We need to be able to control the “transactionaility” of the Actor by making the whole process transactional (a good Data Grid is also a transactional Data Grid, especially for collocated executions).

We also need to make the Actor highly available. In this case, again, it should be simple. Data Grids usually provide the option to run backups for each partitions. We should be able to just create the service on backup nodes as well, and once the backup node becomes primary, the service should kick in and start processing Data. While the service is running collocated with a node that is in backup state, it should be in an “inactive” mode, waiting for a failure of the primary node.

Last, we should be able to make the Actor highly concurrent and thread safe. If the Actor operates solely using the Data Grid API, then is thread safe as Data Grid APIs are thread safe and highly concurrent. There also should be a way to instantiate several instances of the Actor in order to fully utilize multi core systems.

Up until now, I tried to explain how Data Grids fits very nicely with the Actor Model. Let me finish with a simple example based on GigaSpaces APIs to show how all of the above can be achieved.

The diagram is a simple polling container that wraps a service (Actor, the Order service in our example). The polling container takes (removes) events (Data), process them, and writes back Data to the collocated Space (Data Grid) it is running with. Here is an example of how a service like that can be coded:

@EventDriven @Polling(concurrentConsumers = 3) @TransactionalEvent
public class NewOrderActor {

    @Autowired
    GigaSpace gigaSpace;

    @EventTemplate
    Order newOrder() {
        Order template = new Order();
        template.setState(OrderState.NEW);
        return template;
    }
    
    @SpaceDataEvent
    public Order newOrderArrived(Order newOrder) {
        EmailMessage email = new EmailMessage(newOrder);
        gigaSpace.write(email); // cause event to process emails

        newOrder.setState(Order.REQUIRES_VALIDATION);
        return newOrder;    
    }
}

The above example will use a template of “new Orders” to receive events. Once a new order is added to a certain primary partition, the polling container will take it and call the newOrderArrived event handler. Within the event handler, a new EmailMessage will be written to the Data Grid which will start a chain of events that will (hopefully, more Actors are needed) send an email. It will also return an updated Order with its state changed that will be written back to the collocated Data Grid. The update Order will cause the OrderValidator actor to kick in and do its magic. Of course, thanks to the @TransactionalEvent annotation, everything will happen under a single transaction.

Thats it. I hope that the “collocation” of Data Grid make sense to you as it make so much sense to me.

Actor Model and Data Grids was originally published by Shay Banon at the dude abides on January 18, 2009.

Lakers Vs. Celtics

2008-06-08T00:00:00-04:00

While waiting (yes, another sleepless night) for the second game in the Lakers Vs. Celtics NBA finals, I remembered one of my all time favorite arcade games that I played when I was a child. It was an NBA arcade game called Lakers versus Celtics and it was one of the main reasons for my love for the NBA (and the Lakers).

So, while waiting for the game, I managed to run it on my macbook pro. Here is what I needed to do: First, download DOSBox. Next, head over and download the game here. Using DOSBox is pretty simple. You start it, and then mount c ~/path/where/you/undzip/the/game. So much fun!. Warcraft 2 anyone? :).

Lakers Vs. Celtics was originally published by Shay Banon at the dude abides on June 08, 2008.

Dune and Scalability

2008-02-22T00:00:00-05:00

I am reading the wonderful Dune book again. Another quote from Dune reminded me of problems we have when trying to build a scalable application:

Kynes looked at Jessica, said: “The newcomer to Arrakis frequently underestimate the importance of water here. You are dealing, you see, with the Law of the Minimum.”

She heard the testing quality in his voice, said, “Growth is limited by the necessity which is present in the least amount. And, naturally, the least favorable condition controls the growth rate.”

“It’s rare to find members of a Great House aware of planetological problems,” Kynes said. “Water is the least favorable condition for life on Arrakis. And remember that growth itself can produce unfavorable conditions unless treated with extreme care”.

So true. When we build an application and notice some performance problems, we first need to try and nail down the “least favorable condition”. For example, if we have a messaging system talking to a database, and we notice a bottleneck in the messaging system, we first need to tackle it.

But, “growth itself can produce unfavorable conditions”. We might find that we increased the growth in our messaging system two fold, only to find that our database became the next bottleneck which allowed our system to grow in a much smaller factor.

Even so, “unfavorable conditions” can be much worse, where we find that our whole architecture can simply not scale anymore, and we have to re-architect it to meet our needs. Extreme care, Kynes said, and he is right… .

Dune and Scalability was originally published by Shay Banon at the dude abides on February 22, 2008.

Great Quote from Dune

2008-02-19T00:00:00-05:00

I am reading the wonderful Dune book again and wanted to share the following quote:

Many have marked the speed with which the Muad’Dib learned the necessities of Arrakis. The Bene Gesserit, of course, know the basis of the speed. For the others, we can say that Muad’Dib learned rapidly because his first training was in how to learn. And the first lesson of all was the basic trust that he could learn. It is shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult. Muad’Dib knew that every experience carries a lesson.

So true…

Great Quote from Dune was originally published by Shay Banon at the dude abides on February 19, 2008.

Cognitive Programming

2006-04-23T00:00:00-04:00

I really love sport activities, and when I was a teenager, I could not care less about programming, software, or studies, my happiness came from playing football (or soccer for you americans), basketball, and so on. One of my most joyous days in high school was when my class (it was the “they are smart so we are gonna stick them in the same class so bullies don’t have to travel to different places to beat them up” class) got to the finals in my high-school football championship.

What I found was, that when I was playing basketball, my game really picked up after watching good NBA games. I used to sit in the middle of the night and watch Michael Jordan and Charles Barkley (my favorite basketball player) play, and imagine myself playing the same way. What I think happened was, that when I went to sleep, I was still thinking about all the moves and jump shots, and somehow it got into my mind. I did not practice more, but the mere fact that I was watching them, elevated my game to a whole new level (naturally, only until the bullies showed up and trashed our basketball).

The same I think happens with programming. You can read all the books in the world, but for me, the best thing is to look at good code. For example, after starting to delve into Spring code base (long time ago), I think that I am a better programmer (thanks Jurgen ;)), and the fact that I look at the code late at night (I am a night crawler), before I collapse and go to sleep, enhances the assimilation process.

Yea, I know, there is a Catch 22 here (is there a situation where there isn’t?). If you want to write good code, you need to be able to identify good code. But in order to identify good code, you need to have a “good code” sonar. It does take time, but you can slowly graduate to a state where good code will simply make sense, and bad code will make you remember the time you saw Jersey Girl (the movie). The bad thing is, that within our open source world (where you can actually see the code), a successful project will not necessarily mean good code (Have a look at most Apache projects). So how do you know where to find good code to look at? that is a really difficult question, if you want to listen to me, start looking at Spring source code (don’t get me wrong, there are some nasty stuff there as well). 15 minuets before you hit the sack, glance over how JMS server session is implemented, or the Task Execution abstraction. It will make you a better software developer, or at least expand your views.

Cognitive Programming was originally published by Shay Banon at the dude abides on April 23, 2006.

The Birth of Compass

2006-03-26T00:00:00-05:00

It all started when I moved to London. My wife just started to learn at the Cordon Blue to be a chef (yea, I know, I’m a lucky SOB), and she really needed something to help her manage all her recipes. As we all know, most successful open source projects starts with an “itch to scratch”, but only slightly less known is, that for most married man, the itch starts with the Misses ;).

Well, me being a geek software developer, I went out and checked all the current things out there, but there was nothing as close to what I envisioned as a good recipe management software. So, I started iCook (yea, I am a Mac geek as well).

While I started writing the software, I was looking for a Job, and I really wanted to start using all the cool/resume enhancing Java projects. It meant that I decided to use Spring, Hibernate, and Eclipse RCP (an overkill? Definitely! But hey, I told my wife that I’m going to use it as a learning experience as well, and she, as the usual users, told me that she does not give a crap, as long as it works and works well).

I started to do the usual bits, define a proper domain model, using your usual Spring Dao, Service abstraction, and integrating all that with Eclipse RCP. Being an avid Mac user, I wanted something similar to most Mac software, a great user experience and I consider search to be a vital part of it. Basically it meant having a general search box that would search on all my domain model (Recipes, Articles, Books, Steps, Ingredients, …). Naturally, I turned to the best Java search engine out there, namely Lucene.

While trying to integrate Lucene into my app (hey, after a couple of days, I got a proper app running, listing and editing Recipes in Eclipse based RCP - the early crappy version) I stumbled into several problems with Lucene.

The first problem with Lucene was transaction support. Before you start, I know, my wife and iCook could not care less about transaction support in her single user recipe manager software, but it persisted in my mind as a drawback for integrating Lucene with a transactional system.

The second problem was the fact that I expected my application to have constant updates to the Lucene index. As most Lucene users know, this is not simple. The deletion of a Document is done using an IndexReader (funny name), and adding documents is done using the IndexWriter. If you add the same Document in Lucene (i.e. have the same business ids), it will be duplicated in the Lucene index unless you delete it first. And in order to get proper performance out of Lucene, you should batch your deletes, and than perform your updates.

The third problem was the hurdle of mapping Lucene Documents into my domain mode. I had to do the old Jdbc to Domain model mapping that you usually do with Databases, and I thought that once I have Hibernate, I don’t care about this (as much) anymore. I wanted something similar for Lucene.

Don’t get me wrong, Lucene does very well in what it aims to provide, a low level Java search engine. But for me, it did not follow the rule of make simple things simple, and make difficult things possible. When you look at the current successful projects in the Java world, they are projects that help you deliver and develop in a much faster pace, usually breaking current misconceptions or habits.

So I dumped my iCook project for now, and started to work on this project, which was later called Compass. My main requirements for the project at that time was for transaction support, identifiable “Documents”, and an Object mapping framework.

Naturally, once I got started, I found out that there are a lot of things missing when you just use Lucene. Caching of IndexReaders and Searchers and invalidating them for better performance, a lower level abstraction than OSEM (my name for Object Search Engine Mapping), which I called RSEM (Resource Search Engine Mapping) - think Spring Jdbc or iBatis for Lucene. Once you have OSEM, it only makes sense to integrate it with ORM tool for seamless integration, and Spring support. And many others.

So the project grew and grew, I decided to release it in Source Forge under an open source license (initially LGPL, later moved to Apache 2, another story for another time), and the end result is that my wife still waits for iCook (it still means that I eat well!), and I spend my days and nights on Compass. I do have a wonderful wife :-).

The Birth of Compass was originally published by Shay Banon at the dude abides on March 26, 2006.