I will be joining Hazelcast as CTO

I am very excited to announce that I will be joining the world-class team at Hazelcast as CTO.

Hazelcast (www.hazelcast.com) develops, distributes and supports the leading open source in-memory data grid. The product, also called Hazelcast, is a free open source download under the Apache license that any developer can include in minutes to enable them to build elegantly simple mission-critical, transactional, and terascale in-memory applications. The company provides commercially licensed Enterprise extensions, Hazelcast Management Console and professional open source training, development support and deployment support. The company is privately held and headquartered in Palo Alto, California.

What this means for Hazelcast Users

I will join my efforts to those of Hazelcast CEO Talip Ozturk and the team and bring my deep knowledge of caching to Hazlecast to complement that of the team. I will be out and about at conferences and visiting customers.

With the team we will be figuring out what great features to add to Hazelcast and how to improve the leading open source In-Memory Data Grid.

We will also be bring to market Enterprise Extensions which add high value features based around the Hazelcast core.

Hazelcast has made an announcement that puts this move into their own words.

What this means for Terracotta BigMemory Users

We will develop a comparative caching/operational store project and product based on Hazelcast core.   This will then be an alternative for BigMemory users.

What this means for Ehcache Users

The ownership of Ehcache was transferred to Terracotta 4 and a half years ago when I joined them who then took over maintenance of it.

While Ehcache remains widely used today, the open source version is only suitable for single node caching. This is not that useful for most production contexts so it is not directly competitive with Hazelcast or for that matter In-Memory Data Grids, which deal with clusters of computers.

I expect Ehcache will implement JCache and that in the future those ISVs and open source frameworks which currently define Ehcache as their caching layer will instead define it using JCache, of which Ehcache will be one provider.

Hazelcast is already developing their JCache implementation, which is already up on GitHub.

What this means for JCache

JCache is the new standard for caches and IMDGs. It includes a key-value API suitable for caches and operational stores. Importantly it was designed primarily for IMDGs. Listeners, loaders, writers and other user-defined classes are expected to be executed somewhere in the cluster, not in process.  And the spec defines and single and batch EntryProcessors, the defining feature of IMDG, which enables in-situ computation.

I will continue to act as spec lead on new maintenance releases of JCache. And I will also work with the Java EE 8 expert group who are including JCache in Java EE8. And I will be working with open source frameworks and ISVs as they move to add a JCache layer to their architectures.

Hazelcast will be one of the first to market with an implementation of JCache which should be available in a production-ready implementation in February.

As it is Apache 2 open source, I encourage open source frameworks and ISVs to include Hazelcast in their distributions as they add JCache. That way they can ship with an out of the box IMDG built in without locking themselves or their users/customers into a single vendor.

Introducing Deliberate Caching

A few weeks ago I attended a ThoughtWorks Technology Radar seminar. I worked at ThoughtWorks for years and think if anyone knows what is trending up and down in software development these guys do. At number 17 in Techniques with a rising arrow is what they called Thoughtful Caching. At drinks with Scott Shaw, I asked him what it meant.

What the trend is about is the movement from reactive caching to a new style. By reactive I mean you find out your system doesn’t perform or scale after you build it and it is already in production. Lots of Ehcache users come to it that way. This is a trend I am very happy to see.

Deliberate Caching

The new technique is:

  • proactive
  • planned
  • implemented before the system goes live
  • deliberate
  • is more than turning on caching in your framework and hoping for the best – this is the Thoughtful part
  • uses an understanding of the load characteristics and data access patterns
We kicked around a few names for this and came up with Deliberate Caching to sum all of this up.
The work we are doing standardising Caching for Java and JVM based languages, JSR107, will only aid with this transition. It will be included in Java EE 7 which even for those who have lost interest in following EE specifically will still send a signal that this is an architectural decision which should be made deliberately.

Why it has taken this long?

So, why has it taken until 10 years after Ehcache and Memcache and plenty of others came along for this “new” trend to emerge?  I think there are a few reasons.

Some people think caching is dirty

I have met plenty of developers who think that caching is dirty. And caching is cheating. They think it indicates some architectural design failure that is best of being solved some other way.
One of the causes of this is that many early and open source caches (including Ehcache) placed limits on the data safety that could be achieved. So the usual situation is that the data in the cache might but was not sure to be correct. Complicated discussions with Business Analysts were required to find out whether this was acceptable and how stale data was allowed to be. This has been overcome by the emergence of enterprise caches, such as Enterprise Ehcache, so named because they are feature rich and contain extensive data safety options, including in Ehcache’s case: weak consistency, eventual consistency, strong consistency, explicitly locking, Local and XA transactions and atomic operations.  So you can use caching even in situations where the data has to be right.

Following the lead of giant dotcom

The other thing that has happened is that as giant dotcoms it cannot have escaped anyone’s notice that they all use tons of caching. And that they won’t work if the caching layer is down. So much so that if you are building a big dot com app it is clear that you need to build a caching layer in.

Early Performance Optimisation is seen as an anti -pattern

Under Agile we focus on the simplest thing that can possibly work. Requirements are expected to keep changing. Any punts you take on future requirements may turn out to be wrong and your effort wasted. You only add things once it is clear they are needed. Performance and scalability tend to get done this way as well. Following this model you find out about the requirement after you put the app in production and it fails. This same way of thinking causes monolithic systems with single data stores to be built which later turn out to need expensive re-architecting.

I think we need to look at this as Capacity Planning. If we get estimated numbers at the start of the project for number of users, required response times, data volumes, access patterns etc then we can capacity plan the architecture as well as the hardware. And in that architecture planning we can plan to use caching. Because caching affects how the system is architected and what the hardware requirements are, it makes sense to do it then.

 

 

Terracotta acquired by Software AG

As you probably have heard, Terracotta has been acquired by Software AG. This is an exciting development for both companies. Ari Zilka, CTO of Terracotta has a comprehensive blog post detailing the acquisition and its implications.

For me, it means I keep working for Terracotta, but now Terracotta is a wholly owned business unit within Software AG.

Ehcache will remain available in its current two editions: open source under the Apache 2 license and commercial with value-add features. And of course it will get even more investment as part of the larger organization.

I joined Terracotta 21 months ago. It has been an amazing ride so far for me and for Ehcache. I am looking forward to the next chapter. Right now my area of focus is on standardizing Java caching by leading the specification of JSR107. Once that is done we will implement the specification in Ehcache.

Holy Moly, Batman. Apple upgraded my Maven

I ran a Maven build this morning and it broke. Strange, as I had not changed anything. Maven’s ability to simply break because of a change to a non-versioned dependency or even more arcane, a versioned dependency with a non-versioned dependency of it’s own (a transitive dependency) is legendary.  So I thought that was the trouble. On my other machine I ran a build and the Maven output “looked” different. Stranger still. So I did a mvn version on both.

Whoa! My Big Mac was 2.11 and my MacBook Pro was 3.0.2.

It came down in the “Java for Mac  OS X  10.6 Update 4” which was released on 10 March.

Now a lot of folks are not ready to move to 3. For myself firstly I don’t need the grief until it is rock solid. Secondly, I use the site plugin and reporting extensively and this is all getting a big makeover in 3 which is not yet done.

There are lots of ways to go back to your old version.

My was is to add the old version into the front of my PATH in .bash_profile:

 

News on JSR107 (JCACHE) and JSR342 (Java EE 7)

JSR342

JSR342 was created on 14 March 2011. JSR107, or JCACHE, is included: In JSR342’s words:

The following new JSRs will be candidates for inclusion in the Java EE 7 platform:

Concurrency Utilities for Java EE (JSR-236)
JCache (JSR-107)

Isn’t JSR107 inactive?

But how could this happen if JSR107 is inactive?

Well the answer is that we are reactivating it. Oracle (various staff) and Terracotta (mostly me) have started work on the specification with the hope of having a draft spec for review by 20 April. To actually make sure it happens Oracle have allocated resources to work on this project. They are being led by Cameron Purdy, who is co-spec lead along with myself of JSR107 and the founder of Coherence. And of course I founded and still continue to lead Ehcache as CTO of it at Terracotta.

To be officially reactivated we need to submit the draft spec. So reactivation should happen on 20 April.

Motivations for finishing JSR107

Today there are two leading widely scoped frameworks for developing enterprise applications: Spring and Java EE. With the release of Spring 3.1, Spring, heavily influenced by Ehcache Annotations for Spring, has significantly enhanced their caching API. It is easier for Spring because they are a single vendor framework and can do things outside of a standards process. Java EE is still lacking any general purpose caching API. There are some use-specific APIs scattered throughout such as in JPA, but nothing developers can write to. And I know there is a significant need for a general purpose API. So, Java EE 7 wants a general purpose caching API, and this is the primary reason for finishing JSR107.

Another reason is that in-process caching is now heavily commoditised but not standardised. There are more than 20 open source in-process caching projects and another 5 or so commercial distributed caches. But if a user wants to change implementations, they need to change their code. This is akin to database access not having the JDBC standard. So we need to provide a standard API so that users can change caching implementations at low cost.

Scope of JSR107

There has been a bit of discussion about this, but it is most likely that the scope will be as it has been plus two new areas:

  1. Generics – similar to collections, allow caches to be created with defined keys and values
  2. Annotations and integration with JSR342 – allow caching annotations so that for example the return value from any functional method can be cached

The draft specification as it has existed for a few years is available under a net.sf.jsr107cache package name on SourceForge. And Ehcache provides an implementation of that draft spec via ehcache-jcache.

Get Involved

It seems a good time to look at the expert group and make a general invitation for new members.

If you are interested and would like to spend some time on this, please email me at gluck At gregluck.com and I can explain how to start. Additions to membership of the JSR107 are by voting of the existing members.

Ehcache 2.4 with the new Search feature is out!

Ehcache 2.4 launches today. The big new feature right in the core of Ehcache 2.4 is Search.

It uses a new fluent API which looks like this:

Results results = cache.createQuery().includeKeys().addCriteria(age.eq(32).and(gender.eq(“male”))).execute();

In short, it lets you further offload the database. With Ehcache now supporting up to 2TB and linear scale-out you can do more than ever.

What is searchable?

You can search against predefined indexes of keys, values, or attributes extracted from values.

Attributes can be extracted using JavaBeans conventions or by specifying a method to call.

For example to declare a cache searchable and extract age as a JavaBean and gender as a method out of a Person class:

Caches can also be made searchable programmatically. And Custom Value Extractors can be created so that you can index and search against virtually anything you put in the cache.

Search Query Language

Ehcache Search introduces EQL, a fluent, Object Oriented query language which we call EQL, following DSL principles, which should feel familiar and natural to Java programmers.

Here is a full example. Search for men whose names start with “Greg”, and then order the results by age. Don’t return more than 10 results. We want to include keys and values in the results. Finally we iterate through the Results.

Query query = cache.createQuery();
query.includeKeys();
query.includeValues();
query.addCriteria(name.ilike(“Greg*”).and(gender.eq(Gender.MALE))).addOrderBy(age, Direction.ASCENDING).maxResults(10);

Results results = query.execute();
System.out.println(” Size: ” + results.size());
for (Result result : results.all()) {
System.out.println(“Got: Key[” + result.getKey()
+ “] Value class [” + result.getValue().getClass()
+ “] Value [” + result.getValue() + “]”);
}

EQL is very rich. There is a large number of  Criteria such as ilike, lt, gt, between and or which you use to build up complex queries. There are also Aggregators such as min, max, average, sum and count which will summarise the results.

Like NoSQL, EQL executes against a single cache – there are no joins. If you need to combine the results of searches from two caches, you can perform two searches  and then combine the results yourself.

Standalone and Distributed

Search is built into Ehcache core. It works with standalone in-process caching and will work for distributed caches in the forthcoming Terracotta 3.5 platform release which goes GA in March and is available as a release candidate now.

Distributed cache search is indexed and executes on the Terracotta Server Array using a scatter gather pattern. The EQL is sent to each cache partition (the scatter), returning partial results (the gather) to the requesting Ehcache node which then combines the results and presents them to the caller. Terracotta servers utilise precomputed indexes to speed queries.

Indeed, the distributed cache performance has an important property: searches execute in O(logN)/partitions time. So if you have 50GB of cache in one partition which takes 40ms to search and then you double the data, you can hold the execute time constant by simply adding another partition. Generally, you can hold execution time constant for any size of data with this approach.

The standalone cache takes a different approach. Most in-process caches are relatively small. And Ehcache is lightning fast. We don’t use indexes but instead visit each element in the cache a maximum on once, resolving the EQL. It takes 5ms to run a typical query against a 10,000 element cache. Generally the performance is O(N) but even a 1 million entry cache will take less than a second to search using this approach.

Sample Use Cases

Caches can also be made searchable programmatically. And Custom Value Extractors can be created so that you can index and search against virtually anything you put in the cache.

Database Search Offload

Take a shipping company that creates 50GB of consignment notes per week. Customers search by consignment note id but also by addressee name. Most searches (95%) are done within two weeks of the creation of a consignment note.   The consignment notes get stored in a relational database that grows and grows. Searches against the database now take 650ms which take enquiry outside it’s SLA.

Solution: Put the last two weeks of data in the cache. Index by consignment note id, first name, last name and date.  Search the cache first and only search the database if there the consignment note is not found. This takes about  50ms and provides a 95% database offload.

Real-Time Analytics Search

In-house analytics engines processes large amounts of data and compute some result from it. The results need to be queried very quickly to enable processing of a within a business transaction. And the results need to be updated through the day in response to business events. Some examples are credit card fraud scoring, or a holding position in a trading application.

Create a distributed cache and index it as required. Various roll-ups are cached and updated after the system of record has been written to with new transactions. Use Ehcache’s bulk loading mode to quickly upload the results of overnight analytics runs.

Searches execute much more quickly than it would take to compute positions from scratch using the system of record, enabling the real-time analytics.

More Information

You can get Ehcache 2.4 from ehcache.org. Search is fully documented along with downloadable demos and performance tests here.

What I am Reading This Xmas

I thought I would share a few of the topics I am reading:

Xmas Research//Distributed  Systems

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines Luiz André Barroso and Urs Hölzle 2009- s00193ed1v01y200905cac006.pdf
Lessons from Giant-Scale Services, Eric Brewer 2001

Xmas Research//Distributed  Systems/untitled folder
Xmas Research//Distributed Hashtable
Xmas Research//Distributed Hashtable/Chord (peer-to-peer) – Wikipedia, the free encyclopedia.webarchive
Xmas Research//Paxos
Xmas Research//Paxos/08-08-TA–XOpenEtc-I-1.pdf
Xmas Research//Paxos/Chubby.txt
Xmas Research//Paxos/paxos_made_live.pdf
Xmas Research//Paxos/the Paxos Commit algorithm.webarchive
Xmas Research//Search
Xmas Research//Search/block-oriented-suffixarrays.pdf
Xmas Research//Search/bloom-filter-index.pdf
Xmas Research//Search/compressed inverted.pdf
Xmas Research//Search/compressed-bwt.pdf
Xmas Research//Search/dynamic-fm.pdf
Xmas Research//Search/index-techniques.pdf
Xmas Research//Search/practical-compressed-indexes-review.pdf
Xmas Research//Search/xml-index.pdf
Xmas Research//Search/xpath-query.pdf
Inverted Indexes – http://en.wikipedia.org/wiki/Inverted_index

(The files are up on Dropbox but I discovered you cannot create a public link to a shared folder to everyone. You need to email to share. And there was no blindingly obvious way to share a link to my public folder, which kinds of seems amazing)

Ehcache: The Year In Review

Is it that time of year again. A time for reflection and of New Year’s resolutions. I therefore thought this was good time to reflect on what has been happening with Ehcache.

Ehcache and Terracotta got together in August 2009. We got our first combined release done with Ehcache backed by Terracotta three months later. That was really only the beginning. We have done 5 major releases to date with of course another one in the cooker right now.

The big news is that Ehcache and Terracotta together have been a great success. Most Terracotta users now use it as a distributed backing store for Ehcache. Most of Terracotta’s revenue comes from that use case. This means that Ehcache’s needs are driving the evolution of Terracotta.

At the same time, Ehcache as an open source project has seen a huge investment. Most of the features added are designed to work in Ehcache open source as well as with Terracotta as the backing distributed store.

So let’s look at what we added in 2010 and what is in the cooker so far for 2011.

What we did in 2010

New Hibernate Provider

When we merged Terracotta had just done a Hibernate provider and Ehcache had one too. Neither supported the new Hibernate 3.4 SPI which uses CacheRegionFactories instead of CachingProviders. So we combined the two into a new implementation and added support for the new SPI at the same time. This meant that for the first time Ehcache supported all of the cache strategies, including transactional, in Hibernate and importantly supported them across a caching cluster.  (http://ehcache.org/documentation/hibernate.html)

XA Transactions

Right now XA transactions have fallen from favour partly because of flaky support for them in XAResource implementations out there. But what if you could create a canonically correct implementation that could be absolutely relied on in the world’s most demanding transactional applications? We hired one of the Java world’s foremost experts in transactions (Ludovic Orban, the author of the Bitronix transaction manager) and came out with just that. We challenge anyone to prove that our implementation is not correct and does not fully deal with all failure scenarios. If you need to be absolutely sure that the cache is in sync with your other XAResources with a READ_COMMITTED isolation level you have come to the right place. (http://ehcache.org/documentation/jta.html)

Terabyte Scale Out

Ehcache backed by Terracotta initially held keys in memory in each application that had Ehcache in it. This effectively limited the size of the caching clusters that could be created. With a new storage strategy, we blew the lid of that and stopped storing the keys in memory. The result – horizontal scaling to the terabyte level.

Write-through, behind and every which way

What happens when you have off-loaded reads from your database but now your writes are killing you? The answer is write-behind. You write to the cache. It calls a CacheWriter which you implement and connect to the cache which is called periodically with batches of transactions. In your CacheWriter you open a transaction write the batch and then close the transaction. Much easier for the database. And all done in HA with the write-behind queue is safe because it is stored on the Terracotta cluster.

More caching in more places

We were really happy to extend our support for popular frameworks. During the year:

  • Ehcache became the default caching provider for Grails
  • We created an OpenJPA provider
  • We created Ruby Gems for JRuby and Rails 2 and 3 caching providers
  • We created a Google App Engine module

Acknowledgement of the CAP Theorem

Originally Ehcache with Terracotta was consistent above all else. During the year we flexed both ways to allow CAP tradeoffs to be made by our users. We added XA and manual locking modes on the even stricter side and we added an unlocked reads view of a cache and even coherent=false for scale out without coherence on the looser side. And you can choose this cache by cache. There is a tradeoff between latency and consistency, so you choose the highest speed you can afford for a particular cache.

And rather than just blocking on a network partition, we added NonStopCache, a cache decorator which allows you to choose whether to favour availability or consistency.

BigMemory

BigMemory was a big hit. It surprised a lot of people and frankly it surprised us. We were looking to solve our own GC issues in the Terracotta server and found something that was more generally useful than that one use case. So we added BigMemory to Ehcache standalone as well as the server. In the server we have lifted our field engineering recommendation from 20GB of storage per server partition to 100Gb. And we have tested BigMemory itself out to 350GB and it works great!

A new Disk Store

Let’s say you are using a 100GB in Ehcache standalone. When you restart your JVM you want the cache to be there otherwise it might take hours or days to repopulate such a large cache. So we created a new DiskStore that keeps up with BigMemory. It writes at the rate of 10MB/s. So when it is time to shutdown your JVM it just needs to do a final sync and then your are done. And it starts up straight away and gradually loads data into memory. A nice complement to BigMemory and very important.

Ehcache Monitor/Terracotta Dev Console Improvements

For those using Ehcache standalone we have only ever had a JMX API. That is fine but we found many people built their own web app to gather stats. So we did the same and the result was Ehcache Monitor. One of the highlights is the charts including a chart of estimated memory use per cache.

The Terracotta Developer Console got an Ehcache panel, and as we added features to Ehcache we added more to the panel. If you are using Ehcache with a backing Terracotta store then it is a full featured tool which gives you deep introspection.

What is Coming in 2011

Speed, speed and more speed

What does everybody want? More speed. We are splitting hairs in our concurrency model to enable as much speed as possible for each use case. We now have two and will be adding more modes to allow the best tuning for each use case.

Search

Ehcache is based on a Map API. Maps have keys and values. They have a peculiar property – you need to know the key to access the value. What if you want to search for a key, or you want to index values in numerous ways and search those. All of this is coming to Ehcache in February 2011 and is available right now in beta. Oh and one cool thing: search performance is O(log n/partitions). So as your data grows and spreads out onto more Terracotta server partitions, your search performance stays constant! (http://ehcache.org/documentation/search.html)

New Transaction Modes

We already did the hard one: XA. Now we are adding Local Transactions. If you just want transactionality within the caches in your CacheManager and there are no other XAResources, you can use a Local Transaction. It will be three times faster than an XA cache. (http://ehcache.org/documentation/jta.html)

.NET Client

Quite a few customers use Java but also some .NET. And they want to be able to share caches. We have lots of users happily using ehcache for cross-platform use cases, but are planning on extending our cross-platform support still further – for example with a native .NET client

Bigger BigMemory

We are looking at ongoing speedups and testing against larger and larger memory sizes for BigMemory. We are also looking to provide further speed in BigMemory by allowing pluggable Serialization strategies. This will allow our users to use their Serialization framework of choice – and there are now quite a few.