PG优化实战系列-32线程CPU100%

PostgreSQL

背景

前段时间在某现场遇到个问题，早上到项目现场，技术支持反映数据库不可查，32个现场的CPU直接全部100%。

为了尽快恢复使用，在得到业务同意情况下，直接重启Postgresql，重启后CPU恢复正常，未出现再持续增长情况。

问题定位

现场已经装了插件pg_stat_statements，直接查看pg_stat_statements发现有个SQL从凌晨到早上9点一直再跑总共16800+次，耗时很高平均4.5s。询问得知这是个新上定时任务，重启后未发现有此SQL再运行；从日志中捞出SQL，此SQL很长达到150+行，各种左连接、子查询,要知道这种SQL优化起来如果光梳理里面逻辑关系那将是一个巨大工作量。直接用explain（analyze,buffers）查看对应执行计划。可想而知执行计划也是很长的，查看执行计划发现瓶颈点

->  Nested Loop Left Join  (cost=1000.14..628415.70 rows=8047 width=196) (actual time=141.912..1794.075 rows=1968 loops=1)
                      Buffers: shared hit=360497
                      ->  Gather  (cost=1000.00..627036.71 rows=8047 width=75) (actual time=141.886..1792.694 rows=1968 loops=1)
                            Workers Planned: 6
                            Workers Launched: 6
                            Buffers: shared hit=358529
                            ->  Parallel Seq Scan on pt  (cost=0.00..625232.01 rows=1341 width=75) (actual time=106.850..1782.563 rows=281 loops=7)
                                  Filter: ((status = 1) AND (entity_id = ANY ('{xxx*n}'::bigint[])))
                                  Rows Removed by Filter: 2096053
                                  Buffers: shared hit=358529
                      ->  Index Scan using "idxxxxx ASC NULLS LAST" on plab  (cost=0.14..0.16 rows=1 width=300) (actual time=0.000..0.000 rows=0 loops=1968)
                            Index Cond: ((pt.property_id)::text = (property_id)::text)
                            Buffers: shared hit=1968

从以上截取的执行计划可以看到，6个并行全表扫描pt表花了1.6s。

其他用到pt表的执行计划代码

->  Materialize  (cost=1000.00..632096.42 rows=6 width=34) (actual time=5.555..8.164 rows=27 loops=81)
                Buffers: shared hit=358526
                ->  Gather  (cost=1000.00..632096.39 rows=6 width=34) (actual time=449.888..661.183 rows=27 loops=1)
                      Workers Planned: 6
                      Workers Launched: 6
                      Buffers: shared hit=358526
                      ->  Parallel Seq Scan on pt t3  (cost=0.00..631095.79 rows=1 width=34) (actual time=464.883..649.423 rows=4 loops=7)
                            Filter: (((property_type)::text = 'xx2'::text) AND (status = 1) AND (entity_id = ANY ('{xxx*n}'::bigint[])))
                            Rows Removed by Filter: 2096331
                            Buffers: shared hit=358526

->  Materialize  (cost=1000.00..632096.52 rows=7 width=34) (actual time=5.551..8.138 rows=28 loops=81)
                Buffers: shared hit=358526
                ->  Gather  (cost=1000.00..632096.49 rows=7 width=34) (actual time=449.649..659.117 rows=28 loops=1)
                      Workers Planned: 6
                      Workers Launched: 6
                      Buffers: shared hit=358526
                      ->  Parallel Seq Scan on pt t3_1  (cost=0.00..631095.79 rows=1 width=34) (actual time=432.199..643.978 rows=4 loops=7)
                            Filter: (((property_type)::text = 'xx1'::text) AND (status = 1) AND (entity_id = ANY ('{xxx*n}'::bigint[])))
                            Rows Removed by Filter: 2096330
                            Buffers: shared hit=358526

从以上执行计划能看出，扫描pt表使用并行顺序扫描性能不容乐观。

创建索引试试？

create index idx_xxx on pt (entity_id,status);

--原1.6s NestLoop变成采用hash left join耗时变成了1.5ms　　　　　　　　
             ->  Hash Left Join  (cost=13.26..6795.78 rows=8391 width=196) (actual time=0.027..1.547 rows=1968 loops=1)
                      Hash Cond: ((pt.property_id)::text = (plab.property_id)::text)
                      Buffers: shared hit=635
                      ->  Index Scan using idx_xxx on pt  (cost=0.56..6751.44 rows=8391 width=75) (actual time=0.020..1.100 rows=1968 loops=1)
                            Index Cond: ((entity_id = ANY ('{xxx*n}'::bigint[])) AND (status = 1))
                            Buffers: shared hit=635
                      ->  Hash  (cost=11.20..11.20 rows=120 width=300) (actual time=0.003..0.003 rows=0 loops=1)
                            Buckets: 1024  Batches: 1  Memory Usage: 8kB
                            ->  Seq Scan on plab  (cost=0.00..11.20 rows=120 width=300) (actual time=0.002..0.002 rows=0 loops=1)

最终整体SQL由4.5s->44ms。

问题解决总结：

如果短时间无法定位问题，为了尽快恢复先申请重启，重启后某些定时任务会断开不会再连接，可尽快恢复；
如果重启CPU依旧疯狂增长，可以用pg_stat_statements_reset()函数刷新pg_stat_statements表，监控一段时间看哪个SQL耗时长导致；
CPU100%通常情况下就是有慢sql造成的，这里的慢sql包括全表扫描，扫描数据量过大，内存排序，磁盘排序，锁争用等;本次CPU100%，根据后续分析，是全表扫描磁盘达到性能瓶颈导致CPU一直等待由此导致32线程CPU占满。
关于如何定位瓶颈点，关键字 actual、Filter前者能定位到耗时瓶颈，后者能看到哪些字段没用上索引。
.如果之前一直正常，突然某天出现此问题，可能是新上业务SQL导致，此时要询问业务具体做了什么更改，尤其是在线系统必须要测试才能上到正式环境；
如果周期性出现CPU瓶颈，那极有可能是定时任务了。具体怎么定位可能采用周期对比：每天同时间对比，每周同时间对比，每月同时间对比等；
关于长SQL如何优化，建议直接读执行计划，找到瓶颈点后做相应操作，如：建索引，索引是否合理（多列索引顺序是否正确、索引是否重复），参数是否使用当前场景等；如本次故障就是该新表无合理索引导致。

文章来源：https://www.cnblogs.com/duanleiblog/p/14192944.html

如果觉得文章对你有用，请随意赞赏

PG优化实战系列-32线程CPU100%

https://www.zhangyan1997.xyz/archives/pgyou-hua-shi-zhan-xi-lie-32xian-cheng-cpu100

作者

张颜

发布于

2025-08-20

更新于

2025-08-20

许可协议

CC BY 4.0

PG优化实战系列-32线程CPU100%

背景

问题定位

问题解决总结：

作者

发布于

更新于

许可协议

评论