背景

前段时间在某现场遇到个问题,早上到项目现场,技术支持反映数据库不可查,32个现场的CPU直接全部100%。

为了尽快恢复使用,在得到业务同意情况下,直接重启Postgresql,重启后CPU恢复正常,未出现再持续增长情况。

问题定位

现场已经装了插件pg_stat_statements,直接查看pg_stat_statements发现有个SQL从凌晨到早上9点一直再跑总共16800+次,耗时很高平均4.5s。询问得知这是个新上定时任务,重启后未发现有此SQL再运行;从日志中捞出SQL,此SQL很长达到150+行,各种左连接、子查询,要知道这种SQL优化起来如果光梳理里面逻辑关系那将是一个巨大工作量。直接用explain(analyze,buffers)查看对应执行计划。可想而知执行计划也是很长的,查看执行计划发现瓶颈点

->  Nested Loop Left Join  (cost=1000.14..628415.70 rows=8047 width=196) (actual time=141.912..1794.075 rows=1968 loops=1)
                      Buffers: shared hit=360497
                      ->  Gather  (cost=1000.00..627036.71 rows=8047 width=75) (actual time=141.886..1792.694 rows=1968 loops=1)
                            Workers Planned: 6
                            Workers Launched: 6
                            Buffers: shared hit=358529
                            ->  Parallel Seq Scan on pt  (cost=0.00..625232.01 rows=1341 width=75) (actual time=106.850..1782.563 rows=281 loops=7)
                                  Filter: ((status = 1) AND (entity_id = ANY ('{xxx*n}'::bigint[])))
                                  Rows Removed by Filter: 2096053
                                  Buffers: shared hit=358529
                      ->  Index Scan using "idxxxxx ASC NULLS LAST" on plab  (cost=0.14..0.16 rows=1 width=300) (actual time=0.000..0.000 rows=0 loops=1968)
                            Index Cond: ((pt.property_id)::text = (property_id)::text)
                            Buffers: shared hit=1968

从以上截取的执行计划可以看到,6个并行全表扫描pt表花了1.6s。

其他用到pt表的执行计划代码

->  Materialize  (cost=1000.00..632096.42 rows=6 width=34) (actual time=5.555..8.164 rows=27 loops=81)
                Buffers: shared hit=358526
                ->  Gather  (cost=1000.00..632096.39 rows=6 width=34) (actual time=449.888..661.183 rows=27 loops=1)
                      Workers Planned: 6
                      Workers Launched: 6
                      Buffers: shared hit=358526
                      ->  Parallel Seq Scan on pt t3  (cost=0.00..631095.79 rows=1 width=34) (actual time=464.883..649.423 rows=4 loops=7)
                            Filter: (((property_type)::text = 'xx2'::text) AND (status = 1) AND (entity_id = ANY ('{xxx*n}'::bigint[])))
                            Rows Removed by Filter: 2096331
                            Buffers: shared hit=358526
->  Materialize  (cost=1000.00..632096.52 rows=7 width=34) (actual time=5.551..8.138 rows=28 loops=81)
                Buffers: shared hit=358526
                ->  Gather  (cost=1000.00..632096.49 rows=7 width=34) (actual time=449.649..659.117 rows=28 loops=1)
                      Workers Planned: 6
                      Workers Launched: 6
                      Buffers: shared hit=358526
                      ->  Parallel Seq Scan on pt t3_1  (cost=0.00..631095.79 rows=1 width=34) (actual time=432.199..643.978 rows=4 loops=7)
                            Filter: (((property_type)::text = 'xx1'::text) AND (status = 1) AND (entity_id = ANY ('{xxx*n}'::bigint[])))
                            Rows Removed by Filter: 2096330
                            Buffers: shared hit=358526

从以上执行计划能看出,扫描pt表使用并行顺序扫描性能不容乐观。

创建索引试试?

create index idx_xxx on pt (entity_id,status);
--原1.6s NestLoop变成采用hash left join耗时变成了1.5ms        
             ->  Hash Left Join  (cost=13.26..6795.78 rows=8391 width=196) (actual time=0.027..1.547 rows=1968 loops=1)
                      Hash Cond: ((pt.property_id)::text = (plab.property_id)::text)
                      Buffers: shared hit=635
                      ->  Index Scan using idx_xxx on pt  (cost=0.56..6751.44 rows=8391 width=75) (actual time=0.020..1.100 rows=1968 loops=1)
                            Index Cond: ((entity_id = ANY ('{xxx*n}'::bigint[])) AND (status = 1))
                            Buffers: shared hit=635
                      ->  Hash  (cost=11.20..11.20 rows=120 width=300) (actual time=0.003..0.003 rows=0 loops=1)
                            Buckets: 1024  Batches: 1  Memory Usage: 8kB
                            ->  Seq Scan on plab  (cost=0.00..11.20 rows=120 width=300) (actual time=0.002..0.002 rows=0 loops=1)

最终整体SQL由4.5s->44ms。

问题解决总结:

  1. 如果短时间无法定位问题,为了尽快恢复先申请重启,重启后某些定时任务会断开不会再连接,可尽快恢复;

  2. 如果重启CPU依旧疯狂增长,可以用pg_stat_statements_reset()函数刷新pg_stat_statements表,监控一段时间看哪个SQL耗时长导致;

  3. CPU100%通常情况下就是有慢sql造成的,这里的慢sql包括全表扫描,扫描数据量过大,内存排序,磁盘排序,锁争用等;本次CPU100%,根据后续分析,是全表扫描磁盘达到性能瓶颈导致CPU一直等待由此导致32线程CPU占满。

  4. 关于如何定位瓶颈点,关键字 actual、Filter前者能定位到耗时瓶颈,后者能看到哪些字段没用上索引。

  5. .如果之前一直正常,突然某天出现此问题,可能是新上业务SQL导致,此时要询问业务具体做了什么更改,尤其是在线系统必须要测试才能上到正式环境;

  6. 如果周期性出现CPU瓶颈,那极有可能是定时任务了。具体怎么定位可能采用周期对比:每天同时间对比,每周同时间对比,每月同时间对比等;

  7. 关于长SQL如何优化,建议直接读执行计划,找到瓶颈点后做相应操作,如:建索引,索引是否合理(多列索引顺序是否正确、索引是否重复),参数是否使用当前场景等;如本次故障就是该新表无合理索引导致。

文章来源:https://www.cnblogs.com/duanleiblog/p/14192944.html