线上一个短信推送接口存在慢速响应继而雪崩效应导致整体服务异常,先看些统计数据:
接口请求量统计:每天18点有批量推送, 且规律
服务器负载,显著毛刺现象:
DB连接数满:
分析:
- 每日18点开始批量推送,负载飙升;到20点负载爆表,出现大量慢响应
- RDS连接数为什么满?存在慢查SQL导致历史任务不断堆积,在临界条件下并发数打满RDS的连接数。
- 大量慢请求,导致负载急剧升高,CPU爆了!满请求的原因:
Token::mobileFilter()
中读库时读取16万条黑名单手机号,array_diff()函数效率不高。 - 慢请求和RDS连接数满,之间谁是因,谁是果?慢请求是因,导致任务堆积,连接数累积,打满。
影响范围:
此接口耗时在1s以上,随时间积累最终20点后服务器CPU占有率100%
改进方法:
- 修改慢查SQL,只取当前手机号是否对应黑名单,读库记录数从16万—> 请求手机号数N,读库耗时从1.3秒下降为60毫秒
- 修改慢查SQL后,根据查询SQL可以给
blacklist_mobile
表增加mobile字段索引 - array_diff()在小数量下并无性能问题
反思:
- 排障步骤:框架日志—> Nginx日志慢请求 —-> PHP slow log —> ECS/RDS负载,不同问题步骤不一,根据现象分析原因定位问题
- 线上服务缺少接口耗时、框架ERROR日志监控
- 其实黑名单表设计之初为解决用户投诉后屏蔽短信使用,但随着后续扩展了用户恢复TD后屏蔽营销短信的功能,导致该表记录增长迅速,慢SQL随着数据量增长情况恶化
优化后的效果:
- 上线后,短信推送量不变,接口性能从1.5s->400ms
- 机器负载的毛刺现象消失,负载稳定在预期范围
- 大量并发短信推送阶段,服务器负载稳定,送达率、成功率各项数据稳定
优化前后机器负载情况:(6月7日、8日的负载在高峰推送期间保持在较低水平-下图红色框中部分,无毛刺现象)
相关代码:
从04-Jun-2017 19:51:40开始,大量Token::mobileFilter()
函数慢请求,代码如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20public static function mobileFilter($mobile, $token){
$res = \DB::table('xxxxxx')->select('blacklist_status')->where('token', $token)->first();
// 若该token的black_status值为0,则不需要经过电话黑名单校验
if (empty($res['blacklist_status'])) {
return $mobile;
}
try {
// 原始代码:
// $blacklist = \DB::table('blacklist_mobile')->where('status', self::BLACK_STATUS)->lists('mobile')
// 该表有16万条黑名单记录!
// 优化后代码:
$blacklist = \DB::table('blacklist_mobile')->where('status', self::BLACK_STATUS)->whereIn('mobile', $mobile)->lists('mobile');
} catch (\RuntimeException $e) {
Log::error("DB table blacklist_mobile error" . $e->getMessage());
return $mobile;
}
$mobile_send = array_diff($mobile, $blacklist); // 16万条记录的数组执行array_diff()
return $mobile_send;
}
服务器Nginx请求日志,分钟级请求量:1
2
3
4
5
6
7
8
9
10
11$ cat nginx_log/message.2017-06-04-access.log | grep -v HEAD | grep message | grep "2017:20:00" | awk -F[][] '{print $6}' | sort | uniq -c | sort -nr | head
162 04/Jun/2017:20:00:55 +0800 // 此时累计请求量已很大,导致负载飙高
161 04/Jun/2017:20:00:56 +0800
81 04/Jun/2017:20:00:54 +0800
64 04/Jun/2017:20:00:57 +0800
49 04/Jun/2017:20:00:48 +0800
42 04/Jun/2017:20:00:58 +0800
20 04/Jun/2017:20:00:53 +0800
18 04/Jun/2017:20:00:59 +0800
18 04/Jun/2017:20:00:49 +0800
12 04/Jun/2017:20:00:52 +0800
Nginx慢请求:1
2
3
4remote_addr=[] http_x_forward=[] time=[04/Jun/2017:20:00:49 +0800] request=[POST /v1/message HTTP/1.0] status=[200] byte=[248] elapsed=[62.014] refer=[] ua=[Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7] cookie=[-]
remote_addr=[] http_x_forward=[] time=[04/Jun/2017:20:00:49 +0800] request=[POST /v1/message HTTP/1.0] status=[200] byte=[248] elapsed=[59.037] refer=[] ua=[Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7] cookie=[-]
remote_addr=[] http_x_forward=[] time=[04/Jun/2017:20:00:49 +0800] request=[POST /v1/message HTTP/1.0] status=[200] byte=[248] elapsed=[57.245] refer=[] ua=[Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7] cookie=[-]
remote_addr=[] http_x_forward=[] time=[04/Jun/2017:20:00:49 +0800] request=[POST /v1/message HTTP/1.0] status=[200] byte=[248] elapsed=[66.404] refer=[] ua=[Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7] cookie=[-]
php慢日志:
- Token::mobileFilter()函数慢,线上request_slowlog_timeout = 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45[04-Jun-2017 19:51:29] [pool www] pid 106028
script_filename = message/public/index.php
[0x00007f5ac3b96888] array_diff() message/app/Models/Token.php:211
[0x00007f5ac3b96730] mobileFilter() message/app/Http/Controllers/SmsController.php:195
[0x00007f5ac3b965b8] mobileFilter() message/app/Http/Controllers/SmsController.php:163
[0x00007f5ac3b96450] checkParams() message/app/Http/Controllers/SmsController.php:106
[0x00007fff14beafc0] index() unknown:0
[0x00007f5ac3b96300] call_user_func_array() message/vendor/laravel/framework/src/Illuminate/Routing/Controller.php:246
[0x00007f5ac3b961c8] callAction() message/vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:162
[0x00007f5ac3b96088] call() message/vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:107
[0x00007fff14beb640] Illuminate\Routing\{closure}() unknown:0
[0x00007f5ac3b95f60] call_user_func() message/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:141
[0x00007fff14beb9e0] Illuminate\Pipeline\{closure}() unknown:0
[0x00007f5ac3b95de8] call_user_func() message/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:101
[0x00007f5ac3b95ca8] then() message/vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:108
[0x00007f5ac3b95b40] callWithinStack() message/vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:67
[0x00007f5ac3b959e8] dispatch() message/vendor/laravel/framework/src/Illuminate/Routing/Route.php:204
[0x00007f5ac3b958b8] runWithCustomDispatcher() message/vendor/laravel/framework/src/Illuminate/Routing/Route.php:134
[0x00007f5ac3b95760] run() message/vendor/laravel/framework/src/Illuminate/Routing/Router.php:701
[0x00007fff14bec4b0] Illuminate\Routing\{closure}() unknown:0
[0x00007f5ac3b95638] call_user_func() message/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:141
[0x00007fff14bec850] Illuminate\Pipeline\{closure}() unknown:0
[04-Jun-2017 19:51:52] [pool www] pid 106029
script_filename = message/public/index.php
[0x00007f5ac3b96f08] explode() message/vendor/laravel/framework/src/Illuminate/Support/helpers.php:404
[0x00007f5ac3b96d80] data_get() message/vendor/laravel/framework/src/Illuminate/Support/Arr.php:303
[0x00007f5ac3b96c30] pluck() message/vendor/laravel/framework/src/Illuminate/Support/helpers.php:257
[0x00007f5ac3b96b30] array_pluck() message/vendor/laravel/framework/src/Illuminate/Support/Collection.php:368
[0x00007f5ac3b969f8] lists() message/vendor/laravel/framework/src/Illuminate/Database/Query/Builder.php:1513
[0x00007f5ac3b96888] lists() message/app/Models/Token.php:206
[0x00007f5ac3b96730] mobileFilter() message/app/Http/Controllers/SmsController.php:195
[0x00007f5ac3b965b8] mobileFilter() message/app/Http/Controllers/SmsController.php:163
[0x00007f5ac3b96450] checkParams() message/app/Http/Controllers/SmsController.php:106
[0x00007fff14beafc0] index() unknown:0
[0x00007f5ac3b96300] call_user_func_array() message/vendor/laravel/framework/src/Illuminate/Routing/Controller.php:246
[0x00007f5ac3b961c8] callAction() message/vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:162
[0x00007f5ac3b96088] call() message/vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:107
[0x00007fff14beb640] Illuminate\Routing\{closure}() unknown:0
[0x00007f5ac3b95f60] call_user_func() message/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:141
[0x00007fff14beb9e0] Illuminate\Pipeline\{closure}() unknown:0
[0x00007f5ac3b95de8] call_user_func() message/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:101
[0x00007f5ac3b95ca8] then() message/vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:108
[0x00007f5ac3b95b40] callWithinStack() message/vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:67
[0x00007f5ac3b959e8] dispatch() message/vendor/laravel/framework/src/Illuminate/Routing/Route.php:204
- DB慢
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16$ cat php/log/www.log.slow | grep "20:00:58" -A2| grep execute
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
[0x00007f5ac3b970e0] execute() message/vendor/laravel/framework/src/Illuminate/Database/Connection.php:294
框架日志:1
2
3
4
5
6
7
8-rbash-4.2$ cat message/storage/logs/laravel-2017-06-04.log | grep ERROR | head
[2017-06-04 19:53:37] production.ERROR: App\Exceptions\RrcPDOException: SQLSTATE[HY000] [1040] Too many connections in message/vendor/laravel/framework/src/Illuminate/Database/Connectors/Connector.php:47
[2017-06-04 19:53:37] production.ERROR: App\Exceptions\RrcPDOException: SQLSTATE[HY000] [1040] Too many connections in message/vendor/laravel/framework/src/Illuminate/Database/Connectors/Connector.php:47
[2017-06-04 19:53:37] production.ERROR: App\Exceptions\RrcPDOException: SQLSTATE[HY000] [1040] Too many connections in message/vendor/laravel/framework/src/Illuminate/Database/Connectors/Connector.php:47
-rbash-4.2$
-rbash-4.2$ cat message/storage/logs/laravel-2017-06-04.log | grep ERROR | tail
[2017-06-04 20:01:02] production.ERROR: App\Exceptions\RrcPDOException: SQLSTATE[HY000] [1040] Too many connections in message/vendor/laravel/framework/src/Illuminate/Database/Connectors/Connector.php:47
[2017-06-04 20:01:02] production.ERROR: App\Exceptions\RrcPDOException: SQLSTATE[HY000] [1040] Too many connections in message/vendor/laravel/framework/src/Illuminate/Database/Connectors/Connector.php:47
总结
- 黑名单表设计之初并未考虑大数据量,而随业务发展表规模不断膨胀,而代码逻辑未相应调整优化
- 第一版代码存在过于原始的处理方式,2行优化代码完成SQL优化
- 基础服务的监控应具有更高优先级,对系统资源的优化收益较大
- 及早发现问题防患于未然,时刻对线上服务保持敬畏之心
- 一个合格的工程师应严格要求自己,精益求精