在一位前辈微博上看到,他总是给自己找点事做,让自己忙碌起来,严于律己。我甚有感触,相比之下自惭形秽,一直做不到,这应该就是大神跟我等小菜之间的差距吧。如今,事情主动找到我,而且是一而再再而三的找上来,我再去逃避,再不认真去面对这些问题,实在说不过去。
朋友在群里问了问题,ubuntu server 14.04版本,apt-get install的PHP5-FPM,自己编译了phpredis,却发现始终没成功加载phpredis这个扩展。
ubuntu@shnj-b-batch-30-222:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty
以及他的操作方式
ubuntu@shnj-b-batch-30-222:~$sudo /etc/init.d/php5-fpm restart
而结果,却无任何回显。ps aux也看到所有FPM原来进程都还在,尝试sudo /etc/init.d/php5-fpm stop也无法停止掉进程,也无回显。无奈之下killall php5-fpmkill掉了所有FPM进程。当他再尝试sudo /etc/init.d/php5-fpm start,却发现PHP5-FPM启动不起来了。
那么问题来了,为什么sudo /etc/init.d/php5-fpm stop无法停止掉PHP5-FPM,是SIGNAL信号没传到FPM那边吗?为什么KILL之后不能重新启动了?是php加载了拓展时遇到意外,进程崩溃退出了吗?如何排查?群内有朋友建议上gdb来查看是不是php加载扩展出现意外了。但我觉得,php加载拓展前,还有很多事情要做,php加载扩展是/etc/init.d/php5-fpm 脚本执行/usr/sbin/php5-fpm去启动的,在执行之前,还有很多事情去做,比如获取系统shell变量,获取获取系统其他配置。执行后,还会去去读ini配置,读取拓展的so文件等等。我想先看下程序大约终止之前的文件操作走到了哪里。故我决定先用strace看下。
00:49:06.070713 stat("/usr/sbin/php5-fpm", {st_mode=S_IFREG|0755, st_size=9102528, ...}) = 0
00:49:06.070735 faccessat(AT_FDCWD, "/usr/sbin/php5-fpm", X_OK) = 0
00:49:06.070762 faccessat(AT_FDCWD, "/etc/default/php5-fpm", R_OK) = -1 ENOENT (No such file or directory)
00:49:06.070789 open("/lib/init/vars.sh", O_RDONLY) = 3
00:49:06.070810 fcntl(3, F_DUPFD, 10) = 11
00:49:06.070827 close(3)= 0
//有删减
00:49:06.070904 stat("/etc/default/rcS", {st_mode=S_IFREG|0644, st_size=691, ...}) = 0
00:49:06.070929 open("/etc/default/rcS", O_RDONLY) = 3
00:49:06.070949 fcntl(3, F_DUPFD, 10) = 12
00:49:06.070965 close(3)= 0
//有删减
00:49:06.073407 open("/lib/lsb/init-functions", O_RDONLY) = 3
00:49:06.073430 fcntl(3, F_DUPFD, 10) = 11
00:49:06.073447 close(3)= 0
00:49:06.073463 fcntl(11, F_SETFD, FD_CLOEXEC) = 0
00:49:06.073480 read(11, "# /lib/lsb/init-functions for Debian"..., 8192) = 3154
00:49:06.073788 pipe([3, 4])= 0
//有删减
00:49:06.076034 faccessat(AT_FDCWD, "/lib/lsb/init-functions.d/50-ubuntu-logging", R_OK) = 0
00:49:06.076056 open("/lib/lsb/init-functions.d/50-ubuntu-logging", O_RDONLY) = 3
//有删减
00:49:06.076259 stat("/etc/lsb-base-logging.sh", 0x7fff86a1d090) = -1 ENOENT (No such file or directory)
00:49:06.076281 read(11, "", 8192) = 0
00:49:06.076297 close(11) = 0
00:49:06.076318 geteuid() = 0
00:49:06.076333 stat("/sbin/initctl", {st_mode=S_IFREG|0755, st_size=193512, ...}) = 0
00:49:06.076354 faccessat(AT_FDCWD, "/sbin/initctl", X_OK) = 0
//有删减
00:49:06.076568 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11197
00:49:06.082628 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=11197, si_status=0, si_utime=0, si_stime=0} ---
00:49:06.082642 rt_sigreturn() = 11197
00:49:06.082659 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11196
00:49:06.082753 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=11196, si_status=0, si_utime=0, si_stime=0} ---
00:49:06.082764 rt_sigreturn() = 11196
00:49:06.082796 exit_group(1) = ?
00:49:06.082837 +++ exited with 1 +++
从strace结果可以看到,程序/etc/init.d/php5-fpm依次访问(open\stat)了
/usr/sbin/php5-fpm
/etc/default/rcS
/lib/init/vars.sh
/lib/lsb/init-functions
/sbin/initctl
之后,就以+++ exited with 1 +++结束了。并没有执行/usr/sbin/php5-fpm来操作对应的pidfile,执行stop指令。下面看下/etc/init.d/php5-fpm里如何写
#// 注释:/etc/init.d/php5-fpm 脚本的 23-39行
# Exit if the package is not installed
[ -x "$DAEMON" ] || exit 0
# Read configuration variable file if it is present
[ -r /etc/default/$NAME ] && . /etc/default/$NAME
# Load the VERBOSE setting and other rcS variables
. /lib/init/vars.sh
# Define LSB log_* functions.
# Depend on lsb-base (>= 3.0-6) to ensure that this file is present.
. /lib/lsb/init-functions
# Don't run if we are running upstart
if init_is_upstart; then #//注:shell里,if中函数返回0,则执行then部分。
exit 1
fi
#// 注释:/lib/lsb/init-functions 261-270行
# If the currently running init daemon is upstart, return zero; if the
# calling init script belongs to a package which also provides a native
# upstart job, it should generally exit non-zero in this case.
init_is_upstart()
{
if [ -x /sbin/initctl ] && /sbin/initctl version 2>/dev/null | /bin/grep -q upstart; then
return 0
fi
return 1
}
大约可以看出,脚本执行到 init_is_upstart() 函数判断后,就退出了,exit 1 。从代码中文件访问(这里包括open、stat)顺序,以及退出是exit(1)的错误产生后退出。这也可以在执行echo $?来确认。既然知道这里退出的,那么修复方法就很简单了,注释掉 if init_is_upstart; then \ exit 1 \ fi这三行,就好了。
注释掉/etc/init.d/php5-fpm的27、28、29行之后,尝试/etc/init.d/php5-fpm start\stop\restart“果然”均正常…
注释掉/etc/init.d/php5-fpm的27、28、29行之后,就可以了????
为什么代码里要加上那句判断,为什么判断后就退出了,不执行FPM的相关操作了?这么改之后,虽然FPM的启动、重启、停止都声场,但总觉得哪里不对,回到虚拟机上,我多测试了几遍,出现了更奇怪的事情。在上面的程序启动后,我随手敲了service php5-fpm stop返回了stop: Unknown instance:,嗯?service服务管理程序不能管理/etc/init.d/php5-fpm启动的程序?我记得是可以的,又立刻到服务器上做了测试,服务器是UBUNTU server 12.04 的,测试结果为他们可以互相管理对方启动的进程。
问题又来了,为什么ubuntu 12.04上都正常,而在ubuntu 14.04上不可以呢? 是不是我注释掉的那三行代码导致的?系统更新,变更了哪些内容?为此,我认真的看了service脚本的代码。
# Ubuntu 14.04.2 LTS 在/usr/sbin/service ,以下为118-145行代码
if [ -r "/etc/init/${SERVICE}.conf" ] && which initctl >/dev/null \
&& initctl version | grep -q upstart
then
# Upstart configuration exists for this job and we're running on upstart
case "${ACTION}" in
start|stop|status|reload)
# Action is a valid upstart action
exec ${ACTION} ${SERVICE} ${OPTIONS}
;;
restart)
# Map restart to the usual sysvinit behavior.
stop ${SERVICE} ${OPTIONS} || :
exec start ${SERVICE} ${OPTIONS}
;;
force-reload)
# Upstart just uses reload for force-reload
exec reload ${SERVICE} ${OPTIONS}
;;
esac
fi
# Otherwise, use the traditional sysvinit
if [ -x "${SERVICEDIR}/${SERVICE}" ]; then
exec env -i LANG="$LANG" PATH="$PATH" TERM="$TERM" "$SERVICEDIR/$SERVICE" ${ACTION} ${OPTIONS}
else
echo "${SERVICE}: unrecognized service" >&2
exit 1
fi
###以下为我服务上ubuntu 12.04版本上 service脚本的部分代码###
# Ubuntu 12.04.5 LTS LTS 在/usr/sbin/service ,以下为118-144行代码
if [ -r "/etc/init/${SERVICE}.conf" ]; then
# Upstart configuration exists for this job
case "${ACTION}" in
start|stop|status|reload)
# Action is a valid upstart action
echo "${ACTION} ${SERVICE} ${OPTIONS}"
exec ${ACTION} ${SERVICE} ${OPTIONS}
;;
restart)
# Map restart to the usual sysvinit behavior.
stop ${SERVICE} ${OPTIONS} || :
exec start ${SERVICE} ${OPTIONS}
;;
force-reload)
# Upstart just uses reload for force-reload
exec reload ${SERVICE} ${OPTIONS}
;;
esac
fi
# Otherwise, use the traditional sysvinit
if [ -x "${SERVICEDIR}/${SERVICE}" ]; then
exec env -i LANG="$LANG" PATH="$PATH" TERM="$TERM" "$SERVICEDIR/$SERVICE" ${ACTION} ${OPTIONS}
else
echo "${SERVICE}: unrecognized service" >&2
exit 1
fi
解释这段代码时,$SERVICE以php5-fpm为例,代码中写着,若系统中存在/etc/init/${SERVICE}.conf、同时initctl存在,且initctl version结果中包含upstart字符串,则执行start\stop\reload ${SERVICE} ${OPTIONS}。若不符合条件,则执行$SERVICEDIR/$SERVICE脚本来管理该程序。这里根据当前操作系统当前使用的init系统,来选择使用upstart还是sysvinit形式来管理服务。
同时,service在手册里的注释如下:
DESCRIPTION
service runs a System V init script or upstart job in as predictable environment as possible, removing most
environment variables and with current working directory set to /.
The SCRIPT parameter specifies a System V init script, located in /etc/init.d/SCRIPT, or the name of an
upstart job in /etc/init. The existence of an upstart job of the same name as a script in /etc/init.d will
cause the upstart job to take precedence over the init.d script. The supported values of COMMAND depend on
the invoked script, service passes COMMAND and OPTIONS to the init script unmodified. For upstart jobs, start,
stop, status, are passed through to their upstart equivilents. Restart will call the upstart ‘stop’ for the
job, followed immediately by the ‘start’, and will exit with the return code of the start command. All
scripts should support at least the start and stop commands. As a special case, if COMMAND is –full-restart,
the script is run twice, first with the stop command, then with the start command. This option has no effect
on upstart jobs.
service –status-all runs all init scripts, in alphabetical order, with the status command. This option only
calls status for sysvinit jobs, upstart jobs can be queried in a similar manner with initctl list’.
FILES
/etc/init.d
The directory containing System V init scripts.
/etc/init
The directory containing upstart jobs.
根据man手册的解释,upstart是ubuntu特有的init管理系统,最初也是使用system v init系统来管理服务,现在debian好像也用了。system v init是最初大多数类unix系统上的init系统。后来system v init系统的性能问题,对热插拔的USB支持问题等诸多问题,大部分类unix系统改为了systemd。 但大多数linux都对system v init做了兼容。所以,service脚本为了支持两种init系统,做了封装,根据当前系统的init系统,来选择执行响应的操作。在ubuntu上,
upstart系统的程序是initctl,而且initctl start\stop\restart\reload可简写为start\stop\restart\reload。
system v init系统的程序是start-stop-daemon(centos 上是daemon程序),service是对两种init系统执行的代码封装。/etc/init.d是System V init系统脚本的执行文件目录。/etc/init是upstart的配置文件目录。我特意对比虚拟机跟服务器上的service脚本的区别,只发现对upstart的判断更严格了,并没有太大区别…不过,我发现了一件奇葩的事情…就是service php5-fpm status指令跟service –status-all|php5-fpm的结果(在ubuntu 14.04上)并不一样,前者是upstart管理的服务状态数据,即upstart init系统的 status 指令来查看php5-fpm状态的;而后者是system v init的结果,即从$SERVICEDIR/$SERVICE下的脚本程序获取状态的,这……
但是。也就是说知道service脚本的封装操作后,然而,这并没有解决我刚刚的疑问「为什么ubuntu 12.04上service脚本跟/etc/init.d/php5-fpm可以相互管理,而14.04上不可以?」,尽管我知道system v init跟upstart不能对同一个服务的状态互通,但为什么12.04上却可以?这到底是为什么呢?在我做了很多测试(走了很多弯路)之后,决定去服务器上再检查一遍环境。在我写出检查结果之前,我说下走弯路期间的发现吧。
之前提到过一个错stop: Unknown instance:,这也比较好理解,php5-fpm程序是用(注释那3行代码后)/etc/init.d/php5-fpm启动的,即system v init管理的。而使用service php5-fpm stop时,显然不能获取到PHP5-FPM服务的状态的,故返回stop: Unknown instance:。如果我反过来,使用system v init来关闭upstart启动的php5-fpm会发生什么呢?
root@vmware-cnxct:/home/cfc4n# service php5-fpm start
php5-fpm start/running, process 9582
root@vmware-cnxct:/home/cfc4n# start php5-fpm
start: Job is already running: php5-fpm
root@vmware-cnxct:/home/cfc4n# ps -ef|grep php
root 9582 1 0 16:04 ?00:00:00 php-fpm: master process (/etc/php5/fpm/php-fpm.conf)
www-data 9584 9582 0 16:04 ?00:00:00 php-fpm: pool www
www-data 9585 9582 0 16:04 ?00:00:00 php-fpm: pool www
www-data 9586 9582 0 16:04 ?00:00:00 php-fpm: pool www
www-data 9587 9582 0 16:04 ?00:00:00 php-fpm: pool www
www-data 9588 9582 0 16:04 ?00:00:00 php-fpm: pool www
root@vmware-cnxct:/home/cfc4n# service php5-fpm status
php5-fpm start/running, process 9582
root@vmware-cnxct:/home/cfc4n# status php5-fpm
php5-fpm start/running, process 9582
root@vmware-cnxct:/home/cfc4n# service --status-all|grep php
[ + ] php5-fpm
root@vmware-cnxct:/home/cfc4n# /etc/init.d/php5-fpm status
* php5-fpm is running
#注释 : 尝试用start-stop-daemon 来关闭upstart启动的php5-fpm守护进程
root@vmware-cnxct:/home/cfc4n# /etc/init.d/php5-fpm stop
root@vmware-cnxct:/home/cfc4n# ps -ef|grep php
www-data 10232 1 0 16:19 ?00:00:00 php-fpm: pool www
www-data 10233 1 0 16:19 ?00:00:00 php-fpm: pool www
root 10297 6433 0 16:19 pts/000:00:00 grep --color=auto php
root@vmware-cnxct:/home/cfc4n# status php5-fpm
php5-fpm stop/waiting
root@vmware-cnxct:/home/cfc4n# /etc/init.d/php5-fpm status
* php5-fpm is not running
细心的朋友会发现,原来的FPM主进程已经不存在,残留了2个子进程(父进程为1,是因为其父进程意外退出,被PID 为1 的init进程接管),而且, 从最后剩余的pid来看,也不是原来的子进程了。
root@vmware-cnxct:/home/cfc4n# cat /var/run/php5-fpm.pid
10223
同时,原来的pid file 跟 unix domain socket 监听文件都也残留….可以看出start-stop-daemon程序可以关闭原来的PHP5-FPM进程(直接执行kill指令,发送SIGTERM信号,当然可以关闭其他进程)。为什么会出现进程残留,原来子进程ID都不一样的问题。
查看/var/log/upstart/php5-fpm.log中,有这么几条日志
[08-Jul-2015 03:53:43] NOTICE: systemd monitor interval set to 10000ms
[08-Jul-2015 03:54:13] WARNING: [pool www] child 1722 exited on signal 15 (SIGTERM) after 29.959611 seconds from start
[08-Jul-2015 03:54:13] NOTICE: [pool www] child 1733 started
[08-Jul-2015 03:54:13] WARNING: [pool www] child 1721 exited on signal 15 (SIGTERM) after 29.961075 seconds from start
[08-Jul-2015 03:54:13] NOTICE: [pool www] child 1734 started
[08-Jul-2015 03:54:13] NOTICE: Terminating ...
[08-Jul-2015 03:54:13] NOTICE: exiting, bye-bye!
[08-Jul-2015 03:54:13] NOTICE: fpm is running, pid 1739
[08-Jul-2015 03:54:13] NOTICE: ready to handle connections
[08-Jul-2015 03:54:13] NOTICE: systemd monitor interval set to 10000ms
[08-Jul-2015 03:54:18] WARNING: [pool www] child 1745 exited on signal 9 (SIGKILL) after 4.946135 seconds from start
[08-Jul-2015 03:54:18] NOTICE: [pool www] child 1746 started
[08-Jul-2015 03:54:18] WARNING: [pool www] child 1744 exited on signal 9 (SIGKILL) after 4.947469 seconds from start
[08-Jul-2015 03:54:18] NOTICE: [pool www] child 1747 started
[08-Jul-2015 03:54:18] ERROR: An another FPM instance seems to already listen on /var/run/php5-fpm.sock
[08-Jul-2015 03:54:18] ERROR: FPM initialization failed
[08-Jul-2015 03:54:18] ERROR: An another FPM instance seems to already listen on /var/run/php5-fpm.sock
system v init系统的start-stop-daemon调用kill指令,发送SIGTERM信号给FPM进程,让其退出。然而upstart发现进程退出,进行保护又启新的进程,然后又被start-stop-daemon 发送SIGKILL信号杀了。然后子进程还存活着…同时,原来的pid file 跟 unix domain socket 监听文件都也残留….也导致后来再启动时,发现sock文件依旧存在,无法再次启动。
回到刚刚的话题上,我去重新确认了配置,在ubuntu 12.04上,/lib/lsb/init-functions中,并没有init_is_upstart函数的定义。最主要的是,在/etc/init/目录下,根本没有/etc/init/php5-fpm.conf这文件的存在,也就是说,ubuntu 12.04上 是以system v init 来管理php5-fpm的。。。知道真相的我,眼泪都要留下来。。。
好了,最初的修改/etc/init.d/php5-fpm注释掉的3行代码,显然要还原。解决PHP5-FPM(包括但不限于)启动、停止、重启的方式,就用系统封装好的service吧。
总结本次获得的新技能
推荐使用service来管理服务,让service根据当前操作系统,自己确定选择用upstart、sysvinit、systemd来管理程序,而非我们主观上的认定。
ubuntu 1404 上 service XXXX status 跟 service –status-all|grep XXXX 的获取XXXX服务状态的方式不一样。前者是upstart,后者是sysvinit.
UBUNTU 1404 使用sysvinit方式(/etc/init.d/php5-fpm stop)来关闭upstart启动(start php5-fpm或service php5-fpm start)的FPM进程时,会造成FPM进程无法全部杀死,直接发送SIGTERM信号给FPM主进程,然后upstart监控发现原来PID的进程不存在了,立刻创建。然后sysvinit(start-stop-daemon进程)二次检测,又继续kill。新启动的FPM子进程的所属主进程被kill了,他们的父ID没了,就被PID为1的init进程接管…发现类似的情况时,多数是主进程被意外终止掉了。
相应的程序提供了一定的信息输出,比如“Stopping or restarting the networking job is not supported. ”起码告诉我们说不支持这种方式了,FPM应该echo一句话,告诉使用者不支持这种方式管理此服务了,这样,大家都方便了。