Android开发高手课笔记 - Chapter02

Android开发高手课第二节课后作业解析 —— 通过Hook 系统代码解决一个 Native Crash

处理Native Crash

发现问题
分析问题
解决问题

在本节中，主要是针对一个 TimeoutException 的问题，是来自系统的 FinalizerWatchdogDaemon 的异常。是因为finalize方法GC超过10s，就会抛出这个异常。在解决这个问题之前，首先要了解什么是 FinalizerWatchdogDaemon :

FinalizerWatchdogDaemon 是继承自 Damons 的，在启动应用的时候，Zygote会fork一个进程，Daemon的就是在创建子进程的时候创建的。创建的过程包括三个步骤：

1、VM_HOOK.preFork(), 该方法是做一些fork进程前的准备工作。

2、nativeForkAndSpecialize：创建子进程的方法。

3、VM_HOOK.postForkCommon() : 启动Zygote的四个Damon线程，其中就包括了 FinalizerWatchdogDaemon。

public static int forkAndSpecialize(int uid, int gid, int[] gids, int runtimeFlags,
          int[][] rlimits, int mountExternal, String seInfo, String niceName, int[] fdsToClose,
          int[] fdsToIgnore, boolean startChildZygote, String instructionSet, String appDataDir) {
        VM_HOOKS.preFork();
        // Resets nice priority for zygote process.
        resetNicePriority();
        int pid = nativeForkAndSpecialize(
                  uid, gid, gids, runtimeFlags, rlimits, mountExternal, seInfo, niceName, fdsToClose,
                  fdsToIgnore, startChildZygote, instructionSet, appDataDir);
        // Enable tracing as soon as possible for the child process.
        if (pid == 0) {
            Trace.setTracingEnabled(true, runtimeFlags);

            // Note that this event ends at the end of handleChildProc,
            Trace.traceBegin(Trace.TRACE_TAG_ACTIVITY_MANAGER, "PostFork");
        }
        VM_HOOKS.postForkCommon();
        return pid;
    }

//step 1:
   //停止四个线程：Daemon线程，java堆整理，引用队列，析构线程 
   //也就是创建子进程的时候，这几个线程要停止运行。
   public void preFork() {
       Daemons.stop();
       waitUntilAllThreadsStopped();
       token = nativePreFork();
   }
   
   ....
   
   /**
    * Called by the zygote in both the parent and child processes after
    * every fork. In the child process, this method is called after
    * {@code postForkChild}.
    */
    //step 3, 启动Daemons
   public void postForkCommon() {
       Daemons.start();
   }

在了解了创建过程的之后，再来看一下上面说到的四个Damon线程：

ReferenceQueueDaemon：引用队列守护线程。我们知道，在创建引用对象的时候，可以关联一个队列。当被引用对象引用的对象被GC回收的时候，被引用对象就会被加入到其创建时关联的队列去。这个加入队列的操作就是由ReferenceQueueDaemon守护线程来完成的。这样应用程序就可以知道哪些被引用的对象已经被回收了。
FinalizerDaemon：析构守护线程。对于重写了成员函数finalize的对象，它们被GC决定回收时，并没有马上被回收，而是被放入到一个队列中，等待FinalizerDaemon守护线程去调用它们的成员函数finalize，然后再被回收。
FinalizerWatchdogDaemon：析构监护守护线程。用来监控FinalizerDaemon线程的执行。一旦检测那些重写了finalize的对象在执行成员函数finalize时超出一定时间，那么就会退出VM。
HeapTaskDaemon : 堆裁剪守护线程。用来执行裁剪堆的操作，也就是用来将那些空闲的堆内存归还给系统。

可以看到，FinalizerWatchdogDaemon 主要就是监控finalize的时间的。那么再看下它的源码：

@Override public void runInternal() {
           while (isRunning()) {
               if (!sleepUntilNeeded()) {
                   // We have been interrupted, need to see if this daemon has been stopped.
                   continue;
               }
               final Object finalizing = waitForFinalization();
               if (finalizing != null && !VMRuntime.getRuntime().isDebuggerActive()) {
                   finalizerTimedOut(finalizing);
                   break;
               }
           }
       }

可以看的出来，当执行完waitForFinalization 之后，会返回一个finalizing，如果不为空，则会调用 finalizerTimeOut , 首先看一下 waitForFinalization :

/**
* Return an object that took too long to finalize or return null.
* Wait MAX_FINALIZE_NANOS.  If the FinalizerDaemon took essentially the whole time
* processing a single reference, return that reference.  Otherwise return null.
*/
private Object waitForFinalization() {
  long startCount = FinalizerDaemon.INSTANCE.progressCounter.get();
  // Avoid remembering object being finalized, so as not to keep it alive.
  if (!sleepFor(MAX_FINALIZE_NANOS)) {
    // Don't report possibly spurious timeout if we are interrupted.
    return null;
  }
  if (getNeedToWork() && FinalizerDaemon.INSTANCE.progressCounter.get() == startCount) {
    // ...
    Object finalizing = FinalizerDaemon.INSTANCE.finalizingObject;
    sleepFor(NANOS_PER_SECOND / 2);
    //...
    if (getNeedToWork()
        && FinalizerDaemon.INSTANCE.progressCounter.get() == startCount) {
      return finalizing;
    }
  }
  return null;
}

从这个方法的注释就可以看的出来，如果finalize超过了 MAX_FINALIZE_NANOS （也就是10s），则会返回一个FinalizerDaemon的实例赋值给finalizing并且返回，否则返回null。上面说过，如果这个方法返回值不为null，则会调用 finalizerTimeOut 方法：

private static void finalizerTimedOut(Object object) {
            // The current object has exceeded the finalization deadline; abort!
            String message = object.getClass().getName() + ".finalize() timed out after "
                    + (MAX_FINALIZE_NANOS / NANOS_PER_SECOND) + " seconds";
            Exception syntheticException = new TimeoutException(message);
            // We use the stack from where finalize() was running to show where it was stuck.
            syntheticException.setStackTrace(FinalizerDaemon.INSTANCE.getStackTrace());

            // Send SIGQUIT to get native stack traces.
            try {
                Os.kill(Os.getpid(), OsConstants.SIGQUIT);
                // Sleep a few seconds to let the stack traces print.
                Thread.sleep(5000);
            } catch (Exception e) {
                System.logE("failed to send SIGQUIT", e);
            } catch (OutOfMemoryError ignored) {
                // May occur while trying to allocate the exception.
            }

            //...
            if (Thread.getUncaughtExceptionPreHandler() == null &&
                    Thread.getDefaultUncaughtExceptionHandler() == null) {
                // If we have no handler, log and exit.
                System.logE(message, syntheticException);
                System.exit(2);
            }

            // Otherwise call the handler to do crash reporting.
            // We don't just throw because we're not the thread that
            // timed out; we're the thread that detected it.
            Thread.currentThread().dispatchUncaughtException(syntheticException);
        }

可以看到这个方法就是构造了一个 TimeoutException 并且抛出，这里退出程序调用了 System.exit(2) , 好像在我们平时写代码的过程中不常见，一般都是调用 System.exit(0) ，那这个exit的参数是什么意义呢？

System.exit(int code) 中的code参数，除了0以外，其余的都是代表发生错误或者异常而退出程序，只有0代表正常的退出程序。

1-127: 1-127是用户定义的code。

128-255: 表示unix定义的不同的异常信号量，例如 SIGSEGV 或者 SIGTERM。

回到TimeoutException , 通过上面的分析，已经知道了异常的抛出源头在哪里，所以应该只要让这个方法不要执行，或者说让 FinalizeWatchdogDaemon 停止，因为它本质上是一个线程，通过它的父类也能看到有提供 stop 方法，所以，首先考虑Hook这个类，然后调用stop方法：

final Class clazz = Class.forName("java.lang.Daemons$FinalizerWatchdogDaemon");
final Field field = clazz.getDeclaredField("INSTANCE");
field.setAccessible(true);
final Method method = clazz.getSuperclass().getDeclaredMethod("stop");
method.setAccessible(true);
method.invoke(watchdog);

这样看起来没有问题，但是当运行在 Android 6.0以下的系统的时候，可能会发生一些线程同步的问题，所以需要来对比一下 Android 6.0以上和 Android 5.1的源码有什么区别：

Android 7.0:

public void stop() {
  Thread threadToStop;
  synchronized (this) {
    threadToStop = thread;
    thread = null;
  }
  if (threadToStop == null) {
    throw new IllegalStateException("not running");
  }
  interrupt(threadToStop);
  while (true) {
    try {
      threadToStop.join();
      return;
    } catch (InterruptedException ignored) {
    } catch (OutOfMemoryError ignored) {
      // An OOME may be thrown if allocating the InterruptedException failed.
    }
  }
}

Android 5.1

public void stop() {
  Thread threadToStop;
  synchronized (this) {
    threadToStop = thread;
    thread = null;
  }
  if (threadToStop == null) {
    throw new IllegalStateException("not running");
  }
  threadToStop.interrupt();
  while (true) {
    try {
      threadToStop.join();
      return;
    } catch (InterruptedException ignored) {
    }
  }
}

通过对比发现，Android 6.0 以上中断线程是通过调用方法 interrupt(threadToStop) 实现的，而Android 5.1 是通过直接调用 Thread.interrupt , 看一下 interrupt方法：

public synchronized void interrupt(Thread thread) {
  if (thread == null) {
    throw new IllegalStateException("not running");
  }
  thread.interrupt();
}

到这里应该能发现，如果是5.0以下，没有对interrupt做同步处理，在多线程的访问下就可能会发生问题。因此，在给的Demo中用了另外一种方式：

1
2
3

final Field thread = clazz.getSuperclass().getDeclaredField("thread");
thread.setAccessible(true);
thread.set(watchdog, null);

是直接将Damon的thread属性赋值为null，在 FinlaizerWatchdogDaemon 的 runInternal方法中，是通过 :

while(isRunning()){
	//...
}

protected synchronized boolean isRunning() {
  return thread != null;
}

可以看到，当thread为null的时候，while会跳出循环，和调用stop的效果一样，所以，通过这种方式可以停止对finalize的10s监听，从而解决TimeoutException的异常。