[logback-dev] [JIRA] Created: (LBCORE-246) ResilientOutputStreamBase does not recover from transient network outage when the appender uses the LayoutWrappingEncoder

Thomas P Rees (JIRA) noreply-jira at qos.ch
Tue Feb 21 18:06:26 CET 2012


ResilientOutputStreamBase does not recover from transient network outage when the appender uses the LayoutWrappingEncoder
-------------------------------------------------------------------------------------------------------------------------

                 Key: LBCORE-246
                 URL: http://jira.qos.ch/browse/LBCORE-246
             Project: logback-core
          Issue Type: Bug
          Components: Layout
    Affects Versions: 0.9.30
         Environment: Windows 
            Reporter: Thomas P Rees
            Assignee: Logback dev list


Our customer has a problem with network glitches, so we upgraded from 0.9.17 to 0.9.30 to take advantage of the ResilientOutputStream. The particular appender that caused problems at the customer site used a custom layout class, so we wrapped that class in a LayoutWrappingEncoder in build.xml:

  <appender name="AUDIT" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <File>${logging.audit.root}/audit/audit.log</File>
    <encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
      <layout class="com.chiliad.log.util.PropertiesLayout" />
    </encoder>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
     <!-- stuff deleted here -->
    </rollingPolicy>
  </appender>

We tested the ResilientOutputStream by editing logback.xml to write the logfile to another system, then we removed the ethernet cable from that other system. A couple minutes later we plugged the ethernet cable back into the other system. Stdout of the logging system reported

    2012/02/14 16:33:47 | INFO   | jvm 1    | srvmain | 16:33:47,414 |-ERROR in c.q.l.c.recovery.ResilientFileOutputStream at 12297237 - IO failure while writing to file [\\cushman\e\logs\audit\audit.log] java.io.IOException: The specified network name is no longer available

There was also a long stack trace from Java with more detail about the IOException.

However immediately after this stack trace the following message appeared

    2012/02/14 16:33:47 | INFO   | jvm 1    | srvmain | 16:33:47,408 |-INFO in c.q.l.c.recovery.ResilientFileOutputStream at 12297237 - Recovered from IO failure on file [\\cushman\e\logs\audit\audit.log]

Furthermore, logging did not actually recover. The audit log on the remote server did not get any more messages, and stdout continued to record the messaages shown above, always paired with the complaint about the IO failure and then saying that it recovered from the IO failure.

I downloaded the source code and found the source of the problem. The LayoutWrappingEncoder always calls write and then flush together:

    outputStream.write(convertToBytes(txt));
    outputStream.flush();

In ResilientOutputStreamBase the write method checks for a bad state in the output stream, but the flush method does not:

  public void write(int b) {
    if (isPresumedInError()) {
      if (!recoveryCoordinator.isTooSoon()) {
        attemptRecovery();
      }
      return; // return regardless of the success of the recovery attempt
    }
    try {
      os.write(b);
      postSuccessfulWrite();
    } catch (IOException e) {
      postIOFailure(e);
    }
  }

  public void flush() {
    if (os != null) {
      try {
        os.flush();
        postSuccessfulWrite();
      } catch (IOException e) {
        postIOFailure(e);
      }
    }
  }

So, the write method encounters the IOException and sets the presumedClean flag to false. However, the flush method does not call isPresumedInError and the flush succeeds (I presume because there is no data to flush). The flush method then calls postSuccessfulWrite, which sets the presumedClean flag to true.

I fixed this problem by changing the flush method to call isPresumedInError::

  public void flush() {
    if (os != null) {
      if (isPresumedInError()) {
        if (!recoveryCoordinator.isTooSoon()) {
          attemptRecovery();
        }
        return; // return regardless of the success of the recovery attempt
      }
      try {
        os.flush();
        postSuccessfulWrite();
      } catch (IOException e) {
        postIOFailure(e);
      }
    }
  }

Now when we run the test of unplugging the ethernet cable on the system where the audit logs are written and then plug the cable back in the logger starts logging again and the logfile resumes recording messages.

Does this solution look correct to you?
thanks
Tom


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.qos.ch/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the logback-dev mailing list