Java Synchronization (Mutual Exclusion) Benchmark

I’ve created another benchmark. This time, I’ve benchmarked the different ways of synchronizing a little code using mutual exclusion on this code.

The code to protect will be very simple. It’s a simple counter :

//Init
int counter = 0; 

//Critical section
counter++;

The critical section, if not protected with synchronization system, will not function properly due to possible interleavings (read the article on synchronization if you don’t know what is interleaving).

I’ve used 3 different synchronizers to synchronize this increment :

  1. synchronized block
  2. Semaphores (fair and unfair)
  3. Explicit locks (fair and unfair)

I’ve also used a third way to solve the problem with AtomicInteger. This is not the same as the other ways because it does not provide mutual exclusion but this is a good way to synchronize simple values, like integers or boolean, but also references. The atomicity of the operations of the AtomicInteger is made using the compare-and-swap operation of the operating system. So there is no waiting operations. So there is less context switches and result in more performing code normally.

Here is the code of these 4 ways to solve the problems :

private class SynchronizedRunnable implements Runnable {
    private int counter = 0;

    @Override
    public synchronized void run() {
        counter++;
    }
}

private class ReentrantLockRunnable implements Runnable {
    private int counter = 0;

    private Lock lock;

    private ReentrantLockRunnable(boolean fair) {
        super();

        lock = new ReentrantLock(fair);
    }

    @Override
    public void run() {
        lock.lock();

        try {
            counter++;
        } finally {
            lock.unlock();
        }
    }
}

private class SemaphoreRunnable implements Runnable {
    private int counter = 0;

    private final Semaphore semaphore;

    private SemaphoreRunnable(boolean fair) {
        super();

        semaphore = new Semaphore(1, fair);
    }

    @Override
    public void run() {
        try {
            semaphore.acquire();
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }

        try {
            counter++;
        } finally {
            semaphore.release();
        }
    }
}

private class AtomicIntegerRunnable implements Runnable {
    private AtomicInteger counter = new AtomicInteger(0);

    @Override
    public void run() {
        counter.incrementAndGet();
    }
}

I used Runnable to facilitate the testing and timing of the different mechanisms.

The test is made in two phases :

  1. Test with only one thread with a sophisticated benchmark framework. This act also as warmup for the different code.
  2. Test with several threads (several test with increasing number of threads). The test is made using a little code I wrote for the occasion. Each method is executed 2²³ times (8388608 times exactly).

The source code is available at the end of the post.

The test has been launched on a Ubuntu 10.04 with a Java 6 virtual machine. The computer has a 64 bit Core 2 Duo 3.16 Ghz processor and 6Go of DDR2.

So let’s see the results. First with one thread :

Synchronization Benchmark - One Thread

Synchronization Benchmark - One Thread

The first thing we see is that the AtomicInteger is the fastest version. This is because AtomicInteger do not use waiting operation, so this result in less context switches and more performances. But this is not exactly the case of the benchmark, so let’s concentrate on the 5 others methods. We see that the synchronized method is the fastest and that fair methods are a little slower than unfair, but not a lot.

Now, we’ll test the scalability of all these methods using several threads.

Synchronization - 2 threads

Synchronization - 2 threads

This method we can see that the fair methods are awfully slow compared to the the unfair versions. Indeed adding fairness to a synchronizer is really heavy. When fair, the threads acquire the locks in the order they ask for. With nonfair locks, barging is allowed. So when a thread try to acquire the lock and its available, it can acquire it even if there is threads waiing for the lock. It’s heavier to provide fairness because there is a lot more context switches. The problem was not here with only one thread because it’s always fair.

The results for the other versions are the same as with one thread.

Let’s add two more threads :

Synchronization - 4 threads

Synchronization - 4 threads

The fair versions are more and more slows when we add threads. The scalability of these methods is really bad. Let’s see the graph without the fair versions :

Synchronization - 4 threads

Synchronization - 4 threads

This time we can see some differences. The synchronized method is the slower this time and semaphore has a little advantage. Let’s see with 8 threads :

Synchronization - 8 threads

Synchronization - 8 threads

Here the synchronized method is really slower than the other methods. It appears that the algorithm of the synchronized block is less scalable than the explicit locks and semaphore versions. Let’s watch what happens with other number of threads :

Synchronization - 32 threads

Synchronization - 32 threads

Synchronization - 128 threads

Synchronization - 128 threads

I’ve also made the test with other number of threads (16, 64 and 256), but the results are the same as the other.

We can made several conclusions based on the results :

  1. Fair versions are slow. If you don’t absolutely need fairness, don’t use fair locks or semaphores
  2. Semaphores and explicit locks have the same performances. This is because the 2 classes (Semaphore and ReentrantLock) are based on the same class AbstractQueueSynchronizer that is used by almost all synchronization mechanisms of Java
  3. Explicit locks and semaphores are more scalable than synchronized blocks. But that depend on the virtual machine, I’ve seen other results indicating that the difference is a lot smaller
  4. The AtomicInteger is the most performing method. This class doesn’t provide mutual exclusion, but provide thread safe methods to works on simple values (there is version for Long, Double, Boolean and even Reference)

So that’s all for this benchmark. I hope you found it interesting.

The sources of the benchmark : Synchronization Benchmark Sources

 

  • Alexey

    Hi,

    Fair locks are not much slower. Let me explain.

    Let’s for simplicity assume you have just two cores.

    - First thread enters critical section.
    - Other threads blocked and OS switch second core context to some other task.
    - First thread leaves critical section.
    - But it takes a good chuck of time before other thread will be resumed by OS, while other threads are waking from sleep, first thread can do several loop (entering/leaving) critical section without blocking.

    Non blocking entering critical section is just couple of atomic CPU operation with memory, so your measurement are very close to atomic int. Your tests are not measuring really thread manipulation overhead.

    In case of fair lock situation is different
    - First thread enters CS
    - Other thread is suspended
    - First thread leaves CS
    - Semaphore waits for second thread to wake up, and first thread is blocked and put to queue.

    So in case of fair lock you always has pessimistic scenario then you are have to block and resume thread using calls to OS kernel.

    Thank you for article.
    Regards

    • Baptiste Wicht

      Hi,

      Thanks for the reply. But I don’t understand in what your example shows that the fairness is not that slower. For a little number of threads, my results shows that the performances are not as bad but fairness has not good scalability, isn’t it ?

      Your tests are not measuring really thread manipulation overhead.

      I didn’t understand that point too, sorry…

      If you have an idea on how to improve that test, I can make the test to verify the results, no problem.

      Regards

      • Alexey

        Locks implementations are usually implemented following way:
        1. Use some atomic memory access to acquire lock right
        2. If failed, then add thread to queue, and call OS to suspend thread execution

        Step 2 is order of magnitude more expensive (kernel call + thread context switch)

        In your tests fair lock always do both step in each loop, while non-fair do only 1 one in majority of loop cycles.

        In other word for unfair lock your benchmark measure almost only step 1 time, while for fair locks step 1 + step 2.

        In other words your test case is biased to unfair lock. In real life you will not see so much difference. Fair locks will still have disadvantage (cause they are more strict).

        • Baptiste Wicht

          Ok, this time I understand what you want to say. I completely agree, it’s not the implementation of the fair lock who are slower. It’s almost the same performances at this level. This is only that fair locks makes a lot more context switches that the unfair locks. So like belpk say, this is not the internals mechanism who’s slow, this is the use of this mechanism.

          And yes, in my test, i can agree that the fair locks are faster because of the test and the lot of iterations made by each test.

  • Alexey

    Hi,

    Fair locks are not much slower. Let me explain.

    Let’s for simplicity assume you have just two cores.

    - First thread enters critical section.
    - Other threads blocked and OS switch second core context to some other task.
    - First thread leaves critical section.
    - But it takes a good chuck of time before other thread will be resumed by OS, while other threads are waking from sleep, first thread can do several loop (entering/leaving) critical section without blocking.

    Non blocking entering critical section is just couple of atomic CPU operation with memory, so your measurement are very close to atomic int. Your tests are not measuring really thread manipulation overhead.

    In case of fair lock situation is different
    - First thread enters CS
    - Other thread is suspended
    - First thread leaves CS
    - Semaphore waits for second thread to wake up, and first thread is blocked and put to queue.

    So in case of fair lock you always has pessimistic scenario then you are have to block and resume thread using calls to OS kernel.

    Thank you for article.
    Regards

    • Baptiste Wicht

      Hi,

      Thanks for the reply. But I don’t understand in what your example shows that the fairness is not that slower. For a little number of threads, my results shows that the performances are not as bad but fairness has not good scalability, isn’t it ?

      Your tests are not measuring really thread manipulation overhead.

      I didn’t understand that point too, sorry…

      If you have an idea on how to improve that test, I can make the test to verify the results, no problem.

      Regards

      • Alexey

        Locks implementations are usually implemented following way:
        1. Use some atomic memory access to acquire lock right
        2. If failed, then add thread to queue, and call OS to suspend thread execution

        Step 2 is order of magnitude more expensive (kernel call + thread context switch)

        In your tests fair lock always do both step in each loop, while non-fair do only 1 one in majority of loop cycles.

        In other word for unfair lock your benchmark measure almost only step 1 time, while for fair locks step 1 + step 2.

        In other words your test case is biased to unfair lock. In real life you will not see so much difference. Fair locks will still have disadvantage (cause they are more strict).

        • Baptiste Wicht

          Ok, this time I understand what you want to say. I completely agree, it’s not the implementation of the fair lock who are slower. It’s almost the same performances at this level. This is only that fair locks makes a lot more context switches that the unfair locks. So like belpk say, this is not the internals mechanism who’s slow, this is the use of this mechanism.

          And yes, in my test, i can agree that the fair locks are faster because of the test and the lot of iterations made by each test.

  • belpk

    Alexey, there’s nothing wrong with his tests nor with this article. The article doesn’t really say it wants to measure the actual internals of the synchronization mechanisms, or the actual overhead causes by only the concurrency API. It doesn’t seem it really wants to measure the internals, but rather the *use* of them — which is more relevant to the end-user (i.e. the developer). And there the article is absolutely right: the *use* of fair locks will be slower in most trivial scenarios; it doesn’t matter what the performance of the underlying mechanisms is, compared to unfair locks: that’s not the point, and the reason is found in the article as well.

    • Alexey

      Sure, article it very helpful.
      I just want to explain why result looks this way. Test and results are valid, but it is also important to interpret them in right way.

      Many thanks to author for this article.

  • belpk

    Alexey, there’s nothing wrong with his tests nor with this article. The article doesn’t really say it wants to measure the actual internals of the synchronization mechanisms, or the actual overhead causes by only the concurrency API. It doesn’t seem it really wants to measure the internals, but rather the *use* of them — which is more relevant to the end-user (i.e. the developer). And there the article is absolutely right: the *use* of fair locks will be slower in most trivial scenarios; it doesn’t matter what the performance of the underlying mechanisms is, compared to unfair locks: that’s not the point, and the reason is found in the article as well.

    • Alexey

      Sure, article it very helpful.
      I just want to explain why result looks this way. Test and results are valid, but it is also important to interpret them in right way.

      Many thanks to author for this article.

  • Hans-Peter

    Just as a side note: if your diagrams used a logarithmic scale, you would not have to exclude the fair versions. Also it would have been nice to include a unsynchronized version to see how much of the time is spent on the synchronization and how much was spent on the method calling etc.