Browsing Posts in Uncategorized

I sat down at my newish PowerMac this morning to do some hacking. I bought it in January 2006. It has a 149GB hard drive which is now full. My wife complained about iMovie HD being a little slow, which should have been a warning sign. Anyway time to dust off my old script to find the large files on the computer. I initially forgot I had such a script. I googled for something to grab and did not find much. So I will put this on my blog in the hope the next time I am looking for it it will be easier to find.

I am thinking about writing a Dashboard widget to do this. I wrote my first widget last night. It embeds a DOJO widget. Very nice and easy, with a few gotchas not covered in the documentation. I will blog about that another time. Why write a widget to execute a simple script? I am realising that there is a growing group of Mac power users who have not discovered the joy of Unix, and are probably unlikely to.

find . -type f -size +500000c -exec ls -ldh {} ; 2>/dev/null

The script above find all files from the directory you are in down of size greater than 500,000 bytes. Change the number up and down as desired. You will need to be root to find all files on the system. On Mac OS X it is a good idea to start at /Users, which is where most of the stuff you might want to delete will be.

And what you might be wondering did I find. We bought a Sony HDR-HC3 High Definition wide screen camcorder a few months back and have been shooting hi-def movies ever since, along with 4MB photos.

I am writing this on Writely which has impressed me with its WysiWig editing in JavaScript.

I have been doing a lot of architecture work lately using Confluence and discovered they have a Rich Text Editor mode.
It turns out that it, like Writely, does not support Safari as yet. I am using it in Firefox. But Confluence is using tinymce
(http://tinymce.moxiecode.com/) and it is cool.

One of the best things is that you can search within the page you are editing. Browsers do not support finding withing form fields,
which is what you are doing in the wiki editing mode. The ability to do this is huge. Writely also supports this. It is because I guess
it searches the DOM, and these tools write to the DOM as you type.

We have a Python based monitoring application that regularly tops the CPU list, using more CPU than the application it is monitoring.

I am a bit of a newcomer to Python. It is in stable 8th position on the Tiobe programming index (http://www.tiobe.com/tpci.htm). It is mature, has tons of libraries available for it and is suitable for a broad range of tasks.

Python is Sloooooooow

There is something wrong with the monitoring app being hungrier than the app it is monitoring.

So, how fast is Python? Benchmarks always seem to lead to lots of fights. My raw data, and my subesquent conclusion, are based on:

  1. A loop test I wrote. Java was 200 times faster. I had some Python guys look it over. It is valid but is most likely bad because int are primitives in Java but fully fleged objects in Python.
  2. Computer Language Shootout: http://shootout.alioth.debian.org/debian/benchmark.php?test=all&lang=java&lang2=python

    The graphical results for a series of different benchmarks are shown below. Java is up to 150 times faster. The average is around 10 times faster.

  3. http://www.timestretch.com/FractalBenchmark.html This benchmark is 1.25 seconds Java, 15 seconds Python. Java is about 10 times faster.

My conclusion is that Python is about 10 times slower than Java. (Though not of interest right now, I also took a look at Ruby. It seems to be about 15 times slower). Does this matter? Right now we have a problem and it does matter.

Fixing It

So, what to do? The Python books I have suggest using C libraries for the bits that are slow. So, how to tell what is slow? Fortunately Python comes with a very simple and easy to use profiler. To add profiling to your app simply:

import profile
profile.run(‘main()’) #or whatever your entry point is called

We did this for our monitoring app and found, sure enough, that the largest performance antipattern of all time had reared its head yet again: XML. In our case it was a python lib called tramp xml. It works recursively and seems to hit all the bad points of Python performance. Fortunately we can change our file format to non XML and avoid the issue.

What about psyco?

We also considered accelerating Python with pysco. (http://psyco.sourceforge.net/).

We are running 64 Linux on AMD64. So the requirement for “A 32-bit Pentium or any other Intel 386 compatible processor. Sorry, no other processor is supported. Psyco does not support the 64-bit x86 architecture, unless you have a Python compiled in 32-bit compatibility mode.” sort of kills it for us.

Secondly, pysco is being deprecated in favour of PyPy (http://codespeak.net/pypy/dist/pypy/doc/news.html) PyPy is not yet ready for primetime.

What about a C Lib?

The usual solution to Python performance problems is to use a C library to speed up whatever is your chokepoint in Python. I think that is a valid approach, and one we would have used had we not been able to simply remove the XML.

What about Jython?

I have been playing with Jython lately. Unfortunately Jython does not use the JIT, so Java performance sucks.

Conclusions

  1. Python is slooooow. 10 times slower than Java, which itself is about two times slower than C.
  2. If you are not a C shop, the usual solution of porting the slow bits to C will be a bit too hard
  3. As more production apps migrate to 64 bit AMD64 and EMT64, pysco falls away as a solution. (For non 386 it has never been an option)
  4. Carefully consider the performance requirements of your application before you select Python as the implementaiton language.

Can Google be used to measure the popularity of something? Maybe. That is the approach of the famous TIOBE Programming Community Index. ( See http://www.tiobe.com/tpci.htm).

As the maintainer of an open source project, it is something I look at. I am happy to that ehcache now returns more results than any other Java cache with 348,000 results.

So, is is the most popular? Who knows. But downloads and google suggest it is.

(I have been very quite lately. This is my first post in 7 weeks. Blame it all on a new job. I do have some pent up blogs though…)

I have just become a new member of JSR-107, the caching API. This one has been around for a long time and has gotten bogged down. I am shortly going to start an ehcache implementation of JSR-107. I have already done a first pass over ehcache and added extra features and done some restructuring to make that job a lot easier.

The key benefit I can see in JSR-107 is that, if it becomes the standard, like JDBC, then you can write to it and have an almost zero cost of switching to a different caching provider. At present, users of Hibernate and Spring have much the same benefit, based on the cache plugin approach taken by those tools. But there are many caching users who use cache APIs directly. They will get the same benefit.

Hopefully this will also encourage cache implementations to provide a lowest common denominator of functionality.

Since ehcache-1.2 was released a few days ago, there have been plenty of people taking a peek at it.

It is sort of fun to look at the web traffic and downloads of ehcache since the project inception in 2003. The first jump happened in December 2004 with the release of ehcache-1.1. There was steady growth after that and then another jump with the release of 1.2. To date there have been direct 31,796 downloads from the SourceForge site.

The thing that is a little harder to measure is how many people’s machines it is sitting on. Ehcache is included with the very popular Hibernate and Spring frameworks, along with lots of other things. That ease of redistribution is a killer advantage of open source.

Web Traffic

Downloads

After 10 months of development, ehcache-1.2 has been released.

Thanks to all the developers who contributed to the release through feature requests, bug reports and patches during the beta program.

The 1.2 release of ehcache has many new features including:

  • Flexible, extensible, high performance distributed caching. The default implementation supports cache discovery via multicast or manual configuration. Updates are delivered either asynchronously or synchronously via custom RMI connections. Additional discovery or delivery schemes can be plugged in by third parties.
  • New FIFO and LFU caching policies in addition to the standard LRU.
  • Introduced CacheManagerEventListener and CacheEventListener interfaces and default implementations.
  • Multiple CacheManagers per virtual machine.
  • Programmatic flushing of application state to persistent caches
  • Significant (up to 7 fold) DiskStore performance increases.
  • API for Objects in addition to Serializable. Non-serializable Objects can use all parts of ehcache except for DiskStore and replication. Two new methods on Element: getObjectValue and getKeyValue are the only API differences between the Serializable and Object APIs.
  • Backward Compatibility with ehcache-1.1. All users of ehcache-1.1 should be able to upgrade to ehcache-1.2.
  • Tested with Hibernate2.1.8 and Hibernate3.1.3, which can utilise all of the new features except for Object API and multiple session factories each using a different ehcache CacheManager. A new net.sf.ehcache.hibernate.EhCacheProvider makes those additional features available to Hibernate-3.1.3. A version of the new provider should make it into the Hibernate3.2 release.
  • Tested with ehcache-constructs.
  • Apache 2.0 license

Enjoy!

Well another weekend spent on ehcache.

Added a long requested feature to accept keys and values that do not implement Serializable. This makes ehcache suitable for a lot more purposes. I was always afraid of the subtleties involved with NonSerializable Elements getting into the system and then not being able to be persisted to DiskStores or replicated. But it is just a matter of gracefully degrading and logging warnings. It was quite simple really.

Also fixed some minor bugs to do with rare edge conditions. This makes ehcache more robust.

There is only one old bug to fix. No patches to process. And pretty much all the feature requests that are going to get into this release are in.

So what is stopping me from releasing 1.2? I am getting desperate for reasons not to release it. The problem with open source projects are either no one uses them and you can change them as often as you like, or they are widely used like ehcache and you become ultra cautious.

I probably want to do the following before I release:

- fix that last bug
- I promised the Hibernate guys a new ehcache plugin.
- more torture tests for replication.

Perhaps next weekend…

For a couple of weeks now I have been an ex-ThoughtWorker rather than a ThoughtWorker. It turns out there are a few of us, and Adwale Oshineye has thoughtfully created an alumni aggregator at blogs.thoughtworks.com/alumni. Being a consultant can require a big travel commitment, so for a lot of people consulting has a limited lifespan.

Now that I am no longer aggregated on the main page of blogs.thoughtworks, I feel I can make the occasional wry comment. Also, TW Australia is headquartered in Melbourne, Victoria, upon which I am about to play a prank.

I am been doing a lot of driving lately down to my 40 ha (100 acre) farm. I notice a lot of Victorians on the roads here in Queensland, a good 1700km from home. As I drive I cannot help notice the wording on the number plate. Some say “On the move” and some say “The place to be”. In fact the plate has changed from “On the move” to the “place to be”. Let me repeat that: on the move to the place to be! How unfortunate when you see those number plates outside of Victoria.

I am moving my Blog+Wiki to Movable Type. SnipSnap has served me well. It is Java based, which suits me fine, because I am a Java guy. I run several Linux boxes from my home, which are permanently Internet connected.
However I am planning to move overseas for a year with my family. We will be leasing our house, so I must find a new home for either my servers or my sites. I was quoted a cost of co-location of $200 per month per server, which for my two servers would be $4,800 for the year. I can move the sites to a shared Linux environment for $150 per year.
Doing so requires me to move my commercial software site http://simonsayssoftware.com.au and my non-commercial blog to LAMP technologies. So that’s what I am doing.
Hosting companies require those wishing to run Java to either use a dedicated server or co-locate. Why is that? In two words, the Virtual Machine. Java is plenty fast enough compared with PHP or Perl. The problem is the memory the VM consumes. On a shared environment, you need to use small amounts of resources, and then hand them back when you are done. The Java VM grows and rarely shrinks. I have never seen one shrink by more than a few percent. This is customisable but still very limited.
So, will Java always be a non-starter for the shared environments of hosting companies? There are two reasons for optimism.
Firstly, JDK5 introduces Class Data Sharing to Sun-based VMs. See http://java.sun.com/j2se/1.5.0/docs/guide/vm/class-data-sharing.html. Apple have been doing this for a while on OS X. See http://developer.apple.com/documentation/Java/Conceptual/Java141Development/index.html. Sun claims a saving per VM of at least 6MB.
Secondly, there is the Gnu gcj project, which provides Java compiled to native code. It does not use a Virtual Machine. See http://gcc.gnu.org/java/. This project is rapidly maturing. On Fedora Core 2, a whole stack of Java applications compiled with gcj are available, including ant and Tomcat. RedHat are the ones pushing hard for this. I remove these at the moment, because they are the wrong versions for my work and system-wide stuff gets confusing. Having had a read of some user experiences, it seems that it is still a bit raw. If it matures into a first-class Java implementation, then I think the way will be open to use Java on shared hosting environments.
The problem with Virtual Machines is that they generally do not hand back to the OS memory freed from the VM Heap after garbage collection. It depends on the application, but for an application using 100MB of RAM, free memory will be significant, and at times in the order of tens of megabytes. Gcj promises to solve this problem.