Friday, June 15, 2007

Hibernate Search - cool, but is it the right approach? Year baby!

Sanjiv Jivan wrote a blog entry questioning the "point" of Hibernate Search. He missed some critical steps in his argumentation, that I am willing to correct. I started to answer on his blog, but the answer being fairly long, I opted for a blog entry.

I think Sanjiv failed to understand which population Hibernate Search is targeting.
Hibernate Search is about ORM. If you don't use Hibernate, if you don't use JPA, forget about Hibernate Search, it's not for you.

His main point is, why use Hibernate Search instead of a straight Lucene + Database (I'm assuming JDBC) solution? Five years before he could have asked, why use an ORM rather than a straight JDBC access? Because it does for you and optimize 90% of the job and let you focus on the 10% that is hard.
I won't explain why an ORM is usually (but not always) a good approach (everybody got that nowadays), so let's focus on a different question: considering that Hibernate is used in a given application, should we go for plain Lucene and JDBC layer as Sanjiv suggests or should we go for Hibernate Search? Should we go for 2 different set of APIs / programmatic model and model representation, or should we go for one unified model?

Let's see each of Sanjiv's concerns one at a time.

Why Hibernate Search rather than plain Lucene and JDBC?
Out of the box, setting up a plain Lucene and JDBC solution requires to write the bridge. Lucene has it's own world, the DB an other one. Your code has to bind them together (write the optimized JDBC routine + optimized Lucene index routine). It can be long, painful and buggy.
I doubt Sanjiv had to do it before, he would not talk like that :) Hibernate Search does the binding for you in your Hibernate backed application.
People are attracted by Hibernate Search because it lowers the barrier of entry to Lucene in a project by a great deal. This opens the Search capabilities to a lot of applications that would not have considered it with only plain Lucene in their hands.

Hibernate (Search) does not play well with massive indexing
Sanjiv claims that the initial indexing (or reindexing) is slow (he hasn't tried actually) and memory consuming.
Have a second look at the Hibernate Search reference documentation, the massive indexing procedure explicitly helps you to control the amount of memory spent.
In Lucene, one good rule of thumb is use as much memory as possible to minimize IO access. So yes, the more memory you'll spend the more efficient your hibernate Search massive indexing will be. You have to think about the global system, not only a subpart.

Event based indexing should not be used
Next Sanjiv tries to show that the event based indexing is wrong and that one should always use batch indexing. The honest answer is it depends.
Hibernate Search does not constraint to index things per transaction (it's a pluggable strategy), and I never said that indexing at commit time was important. Not indexing before commit time is critical (think about rollbacks).
As a matter of fact, the clustered mode (JMS mode) explicitly does not index at commit time, it delegates the work for later (and to someone else). The overhead of sending a message for later indexing (I'm not speaking of actual Lucene operations here) is minimal.
What do we gain? The usual on the fly vs batch mode benefits: no batch window, more homogeneous CPU consumption on systems, not having to take care of a batch job. I don't know about you, but the less batch jobs I have in my systems, the better I sleep.
By the way, is batch mode supported with Hibernate Search? Absolutely. Who likes to avoid batch jobs when possible, most of the developers and ops guys I have met. When you need to use them, do it ; when you don't stop the masochism.

To justify that batch mode should rules, Sanjiv used the data mining and star / snow schema as an example. These are a very specific kind of applications where ORM are almost never used. They could be, with some adjustments tot he ORM, but that's another story, maybe my next project :) Anyway, this is out of the scope of Hibernate Search, see the very first point.

I agree that JMS is highly over engineered and should be simplified in Java EE6, but come on, setting up a Queue is only a few clicks in a graphical console... it's not too bad. Don't tell me JMS is too hard (Hibernate Search does the JMS calls by the way, not you).

Hibernate Search does not support third party modifications in the database
It's actually a fairly known problem to people who use 2nd level cache in ORMs, has 2nd level cache been banned from our toolbox? clearly no. But once again Hibernate Search works fine in a batch mode. So this should solve Sanjiv's concerns.

Annotation based indexing definition is not flexible
Is that an inflexible approach? How practical would it be to change them on the fly? Changing which elements are indexed, or how would require to reindex the whole set of data. Quite possible, but definitely something that is not so useful on the fly. As for boosting, I do set my field boosting at query time, I find it more flexible than index time boosting, so I never had the issue Sanjiv is describing.

Why using Hibernate Search query API?
Why not using straight Lucene queries an APIs, it's all about text in the end?
The nice thing about the Hibernate Search is that it's really easy to replace a HQL query by a Lucene query: just replace the Query object and you're done, the rest of the code remains unchanged. Because is that simple, people tend to use Hibernate Search and Lucene queries in a more widespread number of usecases, and not simply for a Yahoo-like search screen (we always talk about Google, let's switch for a while ;) ):
- save some DB CPU cycles and distribute it to cheaper machines
- efficient multi word queries
- wildcards
- etc
Here is a use case that is clearly not about plain text:
"increase visibility of all books where 'Paris Hilton' is mentioned and double the increase if 'prison' is also present"

Hibernate Search queries can return either managed objects or projected properties (retrieving only a subset of the data). When to use what?
Sometimes, you use property projections rather than object retrieval in HQL queries either for ease of use or performance reasons, It's more convinient to play with the objects, but you pick up the best tool for the job. I would say the same kind of rules can be applied with Hibernate Search between a regular query and a field projection.

Hibernate Search not suitable for high volume websites
I love this one. I did design high volume websites backed by Lucene. I know what you gain, I know what you lose. Hibernate Search is full of best practices. The Hibernate Search clustering support is a good example of architecture that an architect could mimic to scale with Lucene (up and out). But it's not the only one, it depends on the use case, that's why Hibernate Search does not impose an architecture, that's why I prefer libraries over off-the-shelves products.

I would recommend this off-the-shelves solution?
DBSight or Solr (which I know better) are interesting solutions indeed, but not for the same kind of projects, or at least not for the same integration strategy. We are comparing a library versus a black box. BTW DBSight has a 3-minutes install demo. I could not beat them, it took me 15 mins on stage at JavaOne ( but I walk and talk a lot :) )
I have never been a big fan of black boxes nicely integrated in my IT system, but if I had to choose such a solution I would also give the Google Search Appliance a try, the Google Mini is fairly cheap.


Anyway, Hibernate Search has been developed with practical solutions for practical problems, not theoretical considerations. Giving it a shot is the only way to judge.
Damn long post, sorry about that :(

14 comments:

Chris said...

Hi, Emmanuel, this is Chris Lu from DBSight. I share your feeling of other people having different opinions with your product. You know it can do more and better. But users may not get it.

On the other hand, don't argue with your "potential users". Either API should be documented more, or think from your potential users' view.

No doubt about it. Hibernate Search is a great fit for hibernate users. With its tight and natural relation, Lucene can be very easy. But my point is, it's too tight, with respect to language choice, coding library choice, etc. Regarding ORM choice, I personally am out of java Hibernate trap, and turns to ruby's ActiveRecord. Everything can be sooooo easy!

DBSight has similar "Don't Repeat Yourself" philosophy like ActiveRecord. And make it general.

You can use DBSight with Lucene query, and complicated database structure, and add/change search easily. I doubt Google Mini can really handle it. And you can check the price and what you can get to see if it's really cheap or not.

Solr is great, but database side is lacking. DBSight's UI to select from complicated databases is good, but it's a dirty job. in open source society, the dirty job usually is not done, unless someone pays the money for the OSS, which would be nice. But not everyone has such a chance.

afsina said...

Dudde subject is hibernate and lucene. Dont like this? use compass or hell, hand code it, it is no big deal. Do you have to insert Ruby to each conversation? Talk when Ruby becomes 50 times slower not 100 times as now. you guys are the worst.

Unknown said...

Year documentation needs to be enhanced. That's why products are labeled beta.

BTW, I did not talk about it but NHibernate also has an NHibernate Search version thanks to Ayende (http://www.ayende.com/Blog/archive/2007/04/02/NHibernate-Search.aspx).
He probably will ahve to catch up with the latest version though now :)

Unknown said...

After reading your response I still feel that the points I raise are valid. I did mention and still do feel that Hibernate Search is an attractive option for applications not dealing with large amounts of data as there are less moving parts. However I don't see this as the best available option for applications dealing with large volumes of data. I would be happy to be proved worng as Hibernate Search matures.

In response to your point "Why Hibernate Search rather than plain Lucene and JDBC?", I did mention in my article that working with Lucene directly is more work than simply using Hibernate Search. But the point is there are tools like DBSight that do this in a flexible manner allowing making it just as easy to leverage full text search abilities.

In response to your point "Hibernate (Search) does not play well with massive indexing"
Do you feel that indexing the amazon database with Hibernate Search is going to be comparable to just running Lucene against their database? If you do, I beg to differ. There is a significant overhead in reading managed Hibernate objects event if the session is flushed in batches. And I have tried this batch flushing approach on numerous occasions and carefully profiled it but it just points to Hibernate taking up lots of CPU and memory.

Regarding "Event based indexing should not be used", I think I've made my points and your take is different. I suppose we can let the reader decide. I've mentioned that I feel that full text search is a separate concern in the sense that it is not directly related to the applications business logic. Since Hibernate Search runs in-process with the application and is pretty much hit with any update operation, it does add more risk the the main application who's primary function is most likely not full text search. Risk is introduced with every line of new code that is hit, no matter how solid the code is. Its inevitable that there are going to be bugs. This is not a big deal for many applications, but on the other hand is an unnecessary risk for others. With the offline indexing approach directly against the database, its not a big deal if on batch job fails. It can be rerun without affecting the primary application. As they say the best optimization is code that is not run :)

I'm not a big fan of black boxes integrating with an IT system, but I don't think that argument is applicable here. Its more like Oracle managing indexes, building execution plans and running stats. I wouldn't want to be doing that with Hibernate.

Anyways, thanks for taking the time to respond to my post. I'll be keeping a close eye on how things progress with Hibernate Search.

Anonymous said...

still think compass is a better approach!

Unknown said...

If I had to index a data set the size of Amazon, I would break the indexing into several workers, then reconsolidate, which can be supported by Hibernate Search. In such an architecture, the pure indexing part is not the bottleneck, the DB can be put on its knees.

The session is not flushed during the indexing process, it's just cleared on a regular basis. So the traditional flush overhead, just does not happen. Once again, the memory consumption is constant (and can be defined) in such a strategy. Of course objects are created and destroyed, so the GC will work, but the memory window is constant.

Event-based vs Batch-based depends on the application requirements, and most indexing solutions support both. But it's only part of the problem, indexing data needs to be consumed (through search), and *that* is directly part of your business logic. Tuning a search engine/query to let it return the most appropriate information requires to know the business fairly well and will often vary depending on contextual (meta)data. So far the best way to express those things is in your application code.

Thanks for having shared your thoughts, it's indeed an interesting discussion. It puts into lights rooms for improvements, either code or documentation :)

Anonymous said...

I cannot speak about Hibernate Search specifically, but my team did try to use Compass, which takes a similar approach via its HibernateGps, and we ultimately abandoned it for several reasons, some of which Sanjiv touched on.

Although it was a breeze to get Search up and running on a small data set, once we had larger data sets we noticed a not-insignificant drag on save/update times when we used in-database indexing. Switching to file-based indexing improved performance, but we had recurring problems with corruption of the index files, which would case the whole app to hang on startup. Perhaps the corruption occurred when multiple threads were indexing under heavy load, we were never sure.

And we had the problem that not all data entered the database via Hibernate, and re-indexing was VERY slow on large (millions of recs) data sets (it actually never completed, we gave up after 6 hours).

After many headaches, we gave up and went to an HQL-based search of indexed database tables, which worked like a charm, took only one man day to get working, and has never given us any bug reports.

Unknown said...

Hi Dennis,
It really depends on the actual data but for the records, one of the Hibernate Search user achieved more 550 inserts/s with the dafaut Lucene factors config. ie 3000000 records in 90 mins (reading the data from the database and indexing).

Anonymous said...

Hibernate lucene back-end sounds very attractive but I would agree to many points Sanjeev has raised.

I must add here, I firmly believe that annotations on which fields are to be indexed and how to process them, should be administrative concern rather than programmatic.

Few of the things which constantly change in even a moderate installation of lucene are: stopwords, synonyms, protected words. Besides to maintaining quality of search requires ability to change analysers at runtime.

Does hibernate address these concerns?

Unknown said...

"Few of the things which constantly change in even a moderate installation of lucene are: stopwords, synonyms, protected words. Besides to maintaining quality of search requires ability to change analysers at runtime. "

Adjusting the stop words and synonyms is just about how your analyzer is implemented.
When you say changing analyzers, you mean changing analyzer for your queries? yes, it is possible.

Anonymous said...

We use lucene indexing on our web site. All our data is in tables. We have several pages that pull data directly from tables. At the same time we need to give our users the capability to do full text searches. We need to index frequently so that we minimize the mismatch between direct table access and search through lucene indexes. To claim that it is OK for applications to index nightly or weekly is just not practical. Also we have had to manage scalability issues ourselves so if some of that architecture is taken care of out of the box I am really interested, specially because we are already using Hibenate in our technology stack. However some of the opposing comments ring true so we will have to test for ourselves.

manish said...

for a social network website, what would be recommend using Hibernate Search or tool like DBSight. Also i am using hibernate for all my database operations.

Unknown said...

Of course I am not objective. I would recommend Hibernate Search.
But if you ask the DBSight guys, they would recommend DBSight ;)

To make it clearer, I see no reason why Hibernate Search would not be suitable for a search engine in a social network.

Anonymous said...

Hello Emmanuel,

I´m glad that you have asked Sanjiv blog post.
First i believe that if someone have critics on performance (an old hibernate topic :)), they have to bring to discussion their benchmarks analysis.
I will start to evaluate HSearch for a database with some hundreds of tables and some million of records (this is part of the feasibility study to adopt a solution), in future i will try to publish here as an additional comments what was the result of the performance evaluation.

Regards,

Jose Carlos Canova