今天在查找日志信息的时候,发现了一条有趣的 SQL 写法,随即记录下来。
在 SQL 中EXISTS (SELECT 1 ...)
是一种常见的用法,用于检查子查询是否返回任何行。这里的 SELECT 1是一个习惯写法。
子查询 SELECT 1 不关心具体返回的值,只关心是否有行被返回,而并不关心具体返回什么数据。如果子查询返回至少一行,EXISTS 结果为 TRUE;否则为 FALSE。
数据准备
-- 客户表
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(50) NOT NULL
);
-- 订单表
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
amount DECIMAL(10,2)
);
-- 插入客户数据
INSERT INTO customers (customer_id, name) VALUES
(1, '张三'),
(2, '李四'),
(3, '王五'),
(4, '赵六');
-- 插入订单数据
INSERT INTO orders (order_id, customer_id, amount) VALUES
(101, 1, 100.00),
(102, 1, 200.00),
(103, 3, 150.00),
(104, 4, 300.00);
select * from customers c where exists (
select 1 from orders o where o.customer_id=c.customer_id
);
customer_id | name
-------------+------
1 | 张三
3 | 王五
4 | 赵六
可以看到,用户 id 为 2 的李四在订单表里没有数据,所以没有显示。
这种写法跟 in 有什么区别呢?
我们来分别看下执行计划就知道了。
explain analyze select * from customers c where exists (
select 1 from orders o where o.customer_id=c.customer_id
);
Hash Join (cost=34.12..53.95 rows=270 width=122) (actual time=0.026..0.028 rows=3 loops=1)
Hash Cond: (c.customer_id = o.customer_id)
-> Seq Scan on customers c (cost=0.00..15.40 rows=540 width=122) (actual time=0.009..0.010 rows=4 loops=1)
-> Hash (cost=31.63..31.63 rows=200 width=4) (actual time=0.013..0.013 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> HashAggregate (cost=29.63..31.63 rows=200 width=4) (actual time=0.011..0.012 rows=3 loops=1)
Group Key: o.customer_id
Batches: 1 Memory Usage: 40kB
-> Seq Scan on orders o (cost=0.00..25.70 rows=1570 width=4) (actual time=0.003..0.004 rows=4 loops=1)
Planning Time: 0.110 ms
Execution Time: 0.054 ms
explain analyze select * from customers c where customer_id in (
select customer_id from orders o where o.customer_id=c.customer_id
);
Seq Scan on customers c (cost=0.00..8020.90 rows=270 width=122) (actual time=0.016..0.022 rows=3 loops=1)
Filter: (SubPlan 1)
Rows Removed by Filter: 1
SubPlan 1
-> Seq Scan on orders o (cost=0.00..29.62 rows=8 width=4) (actual time=0.002..0.002 rows=1 loops=4)
Filter: (customer_id = c.customer_id)
Rows Removed by Filter: 2
Planning Time: 0.092 ms
Execution Time: 0.037 ms
exists 的执行计划呢,是先对orders 进行全表扫描,然后按customer_id 进行去重后构建哈希表。然后全表扫描customers,再通过哈希匹配找到符合条件的行数。orders 只需要扫描 4 次。
in 的执行计划呢,则是先全表扫描orders,使用过滤条件customer_id = c.customer_id,然后全表扫描customers,每行都会触发子查询,也就是orders要扫描 4*4 行。
因为数据量有点小,两者执行时间并不明显,接下来我们插入一万条数据再来看看差别:
-- 清空现有数据(可选)
TRUNCATE TABLE orders CASCADE;
TRUNCATE TABLE customers CASCADE;
-- 插入 10,000 个客户
INSERT INTO customers (customer_id, name)
SELECT
id,
'客户_' || id || '_' ||
CASE WHEN random() < 0.3 THEN '集团'
WHEN random() < 0.6 THEN '有限公司'
ELSE '个人'
END
FROM generate_series(1, 10000) AS id;
-- 插入 50,000 个订单(平均每个客户5单)
INSERT INTO orders (order_id, customer_id, amount)
SELECT
id,
-- 80%的订单分配给活跃客户(前2000名),20%分配给其他客户
CASE WHEN id <= 40000
THEN floor(random() * 2000) + 1
ELSE floor(random() * 8000) + 2001
END,
-- 订单金额:50%在100-500之间,30%在500-2000,20%大额订单
CASE
WHEN random() < 0.5 THEN (random() * 400 + 100)::numeric(10,2)
WHEN random() < 0.8 THEN (random() * 1500 + 500)::numeric(10,2)
ELSE (random() * 10000 + 2000)::numeric(10,2)
END
FROM generate_series(1, 10000) AS id;
analyze customers;
analyze orders;
我们直接贴耗时时长
exists 只用了Execution Time: 3.655 ms
而 in 则用了Execution Time: 2950.236 ms
差距对比这就很明显了吧。
总结
select 1 不涉及实际列的计算或扫描,只需检查是否存在符合条件的行。
执行效率对 in 查询更加高效,看执行计划可知。
一般用于子查询关联查询,如: select * from customers c where exists (select 1 from orders o where o.customer_id=c.customer_id);
评论