[logback-dev] [JIRA] Commented: (LBCORE-97) Starvation on AppenderBase.doAppend

Mon Jun 1 03:39:03 CEST 2009

On May 31, 2009, at 6:11 AM, Joern Huxhorn wrote:
>>
>>
>
> The sledgehammer synchronization using synchronized on the instance  
> itself is definitely a mistake as I've shown in the bug report.
> A first (and absolutely necessary step!!) is to replace it with a  
> fair ReentrantLock so Logback isn't responsible for starvation  
> situations anymore.

While that may be true of AppenderBase, the point I've been making is  
that since UnsynchronizedAppenderBase doesn't lock the issue is  
avoided completely (at least at this level).
>
>>
>> Thread 1 can call logger.info() and before the appender even gets a  
>> chance to lock Thread 2 can call logger.info(). The CPU might give  
>> conrol to Thread 2 and let it finish ahead of thread 1. Even if you  
>> have multiple CPUs, as long as you have more threads hitting the  
>> same logic than there are CPUs you will run into this issue. This  
>> is just a fact of life when dealing with multithreaded  
>> applications. If you look at a standard debug log with a fair  
>> amount of logging you will see this all the time. It isn't readily  
>> apparent because the granularity of the time is usually a  
>> millisecond, but occasionally you will see log records that appear  
>> to be out of order.
>>
>
> I'm talking about the following situation in case of JDK 1.6 on  
> multi-core systems:
> Results for usingSynchronized:
> runnables[0]: SynchronizedRunnable[counter=394, running=false]
> runnables[1]: SynchronizedRunnable[counter=1, running=false]
> runnables[2]: SynchronizedRunnable[counter=1, running=false]
> runnables[3]: SynchronizedRunnable[counter=1, running=false]
> runnables[4]: SynchronizedRunnable[counter=1, running=false]
> runnables[5]: SynchronizedRunnable[counter=1, running=false]
> runnables[6]: SynchronizedRunnable[counter=1, running=false]
> runnables[7]: SynchronizedRunnable[counter=1, running=false]
> runnables[8]: SynchronizedRunnable[counter=1, running=false]
> runnables[9]: SynchronizedRunnable[counter=605, running=false]
>
> This is the result of the app appended to http://jira.qos.ch/browse/LBCORE-97 
>  in that case.
> Please take a look at the code. All 10 threads are executing the  
> same Runnable:
> public static class SynchronizedRunnable
> 	implements Runnable
> {
> 	private final Object lockObject;
> 	private int counter;
> 	private boolean running;
> 	
> 	public SynchronizedRunnable(Object lockObject)
> 	{
> 		this.lockObject=lockObject;
> 		this.counter=0;
> 		this.running=false;
> 	}
> 	
> 	public void run()
> 	{
> 		running=true;
> 		for(;;)
> 		{
> 			synchronized(lockObject)
> 			{
> 				counter++;
> 				try
> 				{
> 					Thread.sleep(10);
> 				}
> 				catch(InterruptedException ex)
> 				{
> 					break;
> 				}
> 			}
> 		}
> 		running=false;
> 	}
> 	
> 	public String toString()
> 	{
> 		return "SynchronizedRunnable[counter="+counter+", running="+running 
> +"]";
> 	}
> }
>
> I'm simulating the AppenderBase situation here, lots of threads are  
> trying to access the very same resource.
> With Java 1.6 on multi-core systems, only <number of cores> threads  
> will ever get the chance to obtain the object monitor. All other  
> threads are starving.
> And this is a valid simplification because it shows exactly what's  
> happening in our webapp right now. We had to disable logging to get  
> it back into a usable state.
>
> Using a fair ReentrantLock fixes this. It makes sure that threads  
> are handled fairly concerning the shared resource, i.e. they are  
> served in a "first come, first served" order.

The issue here isn't really that there is a problem with how  
synchronized locks work. If there were Tomcat would fall on its face.  
Take a look at https://svn.apache.org/repos/asf/tomcat/trunk/java/org/apache/catalina/loader/WebappClassLoader.java 
. I guarantee you this is being called far more often then logging is  
and it doesn't exhibit these problems.

The real culprit here is how much time is being spent inside the  
synchronized block. When you look at the WebappClassLoader you will  
see that locks are held for very short periods of time. This is  
contrasted with Appenders that hold the lock for an extremely long  
period of time.

To prove this I simply added

for (int i=0; i < 100000; ++i) { if lock == null { throw new  
NullPointerException(); } }

before the synchronized or locked block (using lockObject instead of  
lock for the synchronzied block).  While the counters don't come out  
equal they are very close. This shows that the problem is really how  
much time is spent in the synchronized block vs the actual application.

The significance of this is that even when you "fix" the locking so it  
is fair your application performance is still going to suck because it  
is spending so much time in the Appender vs doing actual work.
>
>
> I assume you hadn't understood the point of LBCORE-97 until now...  
> and please don't take that as an insult ;)

I the understanding of it. I just don't think it is the right way to  
fix the problem.
>
> I wrote code under the assumption that synchronized would be at  
> least sufficiently semi-fair, Ceki did in both LOG4J and Logback,  
> and reality up until Java 1.6 on multi-core seemed to indicate that  
> it's ok to assume that total fairness isn't necessary... but all  
> this changed because synchronized is absolutely unfair now - and the  
> documentation shows that the fault is on our side. We were simply  
> working with wrong assumptions.
>
> This just proves my point that locking should be handled by the  
> logging system in a correct, i.e. fair, way. It is "downright  
> dangerous" to assume that every person that implements an appender  
> is aware of this situation and the problem will only show up on Java  
> 1.6 multi-core systems. This is a *very* big risk.
>
> But I agree 100% that the transformation/evaluation of the logging  
> event can and should be handled outside of whatever fair locking  
> mechanism is used. This would solve some issues and seems very  
> logical. We should definitely aim for this in the future... Ideally,  
> AppenderBase would be changed to support this, IMHO.
>
> It would probably be a better idea to use a
>
> protected final ReentrantReadWriteLock readWriteLock=new  
> ReentrantReadWriteLock(true);
>
> fair ReentrantReadWriteLock in AppenderBase instead.
> doAppend would then use the write lock while actual implementations  
> would have the freedom of using either read or write lock, as  
> needed. This would remove the need of synchronizing the worker queue  
> you mentioned earlier. I'm not sure if we really need a  
> ReadWriteLock, though, as removal from the queue would be a write,  
> too. But *some* appender implementation could have a use case for  
> read-only-access...
>
> One way to handle transformation of the event outside the lock is  
> what I suggested in http://jira.qos.ch/browse/LBCLASSIC-45.
> Just dumb down LoggingEvent enough so that it already is detached  
> from the rest of the system after it's created. But someone, and I  
> honestly can't remember but I think it was you, said that it would  
> be bad if the MDC was already transformed into Map<String,String> at  
> that point...
>
> I've just changed my patch to make lock protected instead of  
> private. Seems to be a much better idea :p
>
> And just to get my point straight once more:
> A logging system MUST NOT be a point of starvation! It will  
> naturally always be a chokepoint but it MUST NEVER be a point of  
> starvation!
>
On this we agree. If switching to a ReentrantReadWriteLock fixes the  
problem for AppenderBase then I certainly don't have a problem with  
that. Personally, I see it as sort of a band-aid though. I'd like to  
go through and deal with all the code that isn't thread-safe and move  
the locking down to just those areas where it can't be avoided. Every  
reference you can find on multithreaded programming will tell you to  
hold locks for as short a duration as possible. Logback and Log4j  
don't do that which opens them up to these kinds of issues.

Again, even if you make it "fair" the amount of time the lock is being  
held means AppenderBase will almost certainly still be the biggest  
bottleneck in your system when logging is enabled.

Ralph