[logback-user] Logback becomes excruciatingly slow during rollover

Tue Apr 26 19:13:58 UTC 2016

An update:

The customer disabled Splunk and are still reproducing the problem. They also report that Process Explorer indicates that no other process holds a handle to the log files.

________________________________
From: Vesuvius Prime <vesuvius_prime at hotmail.com>
Sent: Sunday, April 24, 2016 6:57 PM
To: logback-user at qos.ch
Subject: Logback becomes excruciatingly slow during rollover

Hi all,

We use Virgo server and it comes with Logback 1.0.7.v20121108-1250. One of the users of our product started complaining that after about a week of using our product, it was becoming so slow that it was basically unusable. We had a live session with the client and it turned out that the problem was in Logback. The client uses Windows. So far the problem has not been reported on Linux. Here's what I did to determine that Logback was the cause of the problem:

1. The appender is defined as:

   <appender name="LOG_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
      <file>D:/logs/product.log</file>
      <rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
         <FileNamePattern>D:/logs/product_%i.log.zip</FileNamePattern>
         <MinIndex>0</MinIndex>
         <MaxIndex>9</MaxIndex>
      </rollingPolicy>

      <triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
         <MaxFileSize>50MB</MaxFileSize>
      </triggeringPolicy>
      <encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
         <Pattern>[%d{yyyy-MM-dd'T'HH:mm:ss.SSSXXX}] [%-5level] %-28.28thread %X{sessionGuid} %-64.64logger{64} %X{medic.eventCode} %msg %ex%n</Pattern>
      </encoder>
   </appender>

In the appender configuration above, I modified the directories and log names a little bit, so that the actual product name would not be visible.

2. All the rolled-over zip files were generated and present, and the main log file, product.log, had exceeded its maximum size of 50 MB and had reached 64 MB. We had something like:

product.log           <------- 64 MB

product_0.log.zip

product_1.log.zip

product_2.log.zip

product_3.log.zip

product_4.log.zip

product_5.log.zip

product_6.log.zip

product_7.log.zip

product_8.log.zip

product_9.log.zip

3. This was suspicious, so I examined the log file and saw many messages that were logged out of order, e.g.

09:55:43.676
09:55:42.836
09:55:46.165
09:55:44.583
...

Furthermore, even the simplest operations, e.g. simply sending a redirect response back to the browser, was taking several seconds. (However, before sending the redirect, we were logging the URL to which we were redirecting, so I suspected that the logger was extremely slow.)

3. I moved the archived log files to a different directory and they got recreated *immediately*. I did this 7-8 times, each time with exactly the same result -- the .zip files got recreated immediately.

4. The main log, product.log, was locked by the JVM, so I couldn't move it or delete it. However, I opened it with a text editor, selected the whole content (Ctrl+A), deleted it (Del) and saved the file. As soon as I did that, the performance problems disappeared. The product began working normally. Also, the log file started growing with normal speed. Previously, it was growing super-slowly.

So, we told the users to do the following workaround when the preconditions from step 2 above were reached and the product became very slow:

(1) Open the main log file, select all, delete, save.

(2) Delete all archived .zip files

This workaround has been working fine so far. However, we still don't know why Logback behaves like this. That's why I'm writing this email. Can someone, who knows the Logback source code, figure out what might be happening and what triggers it?

Here are some additional details that we've been able to figure out:

1. When the users were experiencing the problem during the live session with them, I opened ProcessExplorer and saw that Splunk was holding a lock onto the main log file -- product log. I asked the users to kill the Splunk service and they did, but that didn't solve the problem. However, depending on the way Logback is written, it may turn out that it indeed went crazy because of Splunk, but *after* it goes crazy, killing the other process that hold a lock on the file doesn't help.

2. We managed to reproduce the same behavior in our company by letting two different appenders write to the same log file. Then, as soon as the preconditions on step 2 above were reached, everything became extremely slow, the logs couldn't roll over and the main log started growing beyond its specified limit of 50 MB. However, the users do *not* have such a configuration in which two appenders write to the same file. I'm only giving you information on how we managed to reproduce the slowness in case it might help you figure out what's wrong. The end result is the same -- Logback slows down to a crawl, but apparently the trigger in our users' environment is different and it's not related to appenders writing to the same file.

3. A thread dump while we were reproducing the problem (with two appenders writing to the same file) reveals the following:

3.1. One thread holds a lock (0x00000000dfa5d600):

"Thread-48" #199 daemon prio=5 os_prio=0 tid=0x0000000024b61000 nid=0x772c runnable [0x000000002422e000]
   java.lang.Thread.State: RUNNABLE
    at java.util.zip.Deflater.deflateBytes(Native Method)
    at java.util.zip.Deflater.deflate(Deflater.java:442)
    - locked <0x00000000fdb8dcf0> (a java.util.zip.ZStreamRef)
    at java.util.zip.Deflater.deflate(Deflater.java:364)
    at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:251)
    at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
    at java.util.zip.ZipOutputStream.write(ZipOutputStream.java:331)
    - locked <0x00000000fdb8da00> (a java.util.zip.ZipOutputStream)
    at ch.qos.logback.core.rolling.helper.Compressor.zipCompress(Compressor.java:110)
    at ch.qos.logback.core.rolling.helper.Compressor.compress(Compressor.java:58)
    at ch.qos.logback.core.rolling.FixedWindowRollingPolicy.rollover(FixedWindowRollingPolicy.java:147)
    at ch.qos.logback.core.rolling.RollingFileAppender.rollover(RollingFileAppender.java:111)
    - locked <0x00000000dfa5d460> (a ch.qos.logback.core.spi.LogbackLock)
    at ch.qos.logback.core.rolling.RollingFileAppender.subAppend(RollingFileAppender.java:144)
    - locked <0x00000000dfa5d600> (a ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy)
    at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:103)
    at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:88)
    at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:48)
    at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:280)
    at ch.qos.logback.classic.Logger.callAppenders(Logger.java:267)
    at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:449)
    at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:403)
    at ch.qos.logback.classic.Logger.log(Logger.java:803)
    at org.apache.commons.logging.impl.SLF4JLocationAwareLog.warn(SLF4JLocationAwareLog.java:185)
    ...

3.2. A multitude of threads are blocked and waiting to acquire that lock, e.g.:

"Thread-55" #206 daemon prio=5 os_prio=0 tid=0x000000001f50d000 nid=0x7288 waiting for monitor entry [0x00000000263ef000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at ch.qos.logback.core.rolling.RollingFileAppender.subAppend(RollingFileAppender.java:142)
    - waiting to lock <0x00000000dfa5d600> (a ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy)
    at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:103)
    at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:88)
    at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:48)
    at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:280)
    at ch.qos.logback.classic.Logger.callAppenders(Logger.java:267)
    at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:449)
    at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:403)
    at ch.qos.logback.classic.Logger.log(Logger.java:803)
    at org.apache.commons.logging.impl.SLF4JLocationAwareLog.warn(SLF4JLocationAwareLog.java:185)
    ...

I have attached the logging configuration from the Virgo server. It's called serviceability.xml. (That's the default configuration file name for the Virgo server.)

I'm sorry the mail turned out so long. I hope this will not discourage you from reading it. I will be extremely grateful if you, guys, can help us figure out what makes Logback go crazy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.qos.ch/pipermail/logback-user/attachments/20160426/21d0cbef/attachment.html>