今天在查找日志信息的时候,发现了一条有趣的 SQL 写法,随即记录下来。

在 SQL 中EXISTS (SELECT 1 ...) 是一种常见的用法,用于检查子查询是否返回任何行。这里的 SELECT 1是一个习惯写法。

子查询 SELECT 1 不关心具体返回的值,只关心是否有行被返回,而并不关心具体返回什么数据。如果子查询返回至少一行,EXISTS 结果为 TRUE;否则为 FALSE。

数据准备

-- 客户表
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(50) NOT NULL
);

-- 订单表
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    amount DECIMAL(10,2)
);

-- 插入客户数据
INSERT INTO customers (customer_id, name) VALUES
(1, '张三'),
(2, '李四'),
(3, '王五'),
(4, '赵六');

-- 插入订单数据
INSERT INTO orders (order_id, customer_id, amount) VALUES
(101, 1, 100.00),  
(102, 1, 200.00), 
(103, 3, 150.00),  
(104, 4, 300.00); 
select * from customers c where exists (
select 1 from orders o where o.customer_id=c.customer_id
);
 customer_id | name 
-------------+------
           1 | 张三
           3 | 王五
           4 | 赵六

可以看到,用户 id 为 2 的李四在订单表里没有数据,所以没有显示。

这种写法跟 in 有什么区别呢?

我们来分别看下执行计划就知道了。

explain analyze select * from customers c where exists (
select 1 from orders o where o.customer_id=c.customer_id
);
Hash Join  (cost=34.12..53.95 rows=270 width=122) (actual time=0.026..0.028 rows=3 loops=1)
  Hash Cond: (c.customer_id = o.customer_id)
  ->  Seq Scan on customers c  (cost=0.00..15.40 rows=540 width=122) (actual time=0.009..0.010 rows=4 loops=1)
  ->  Hash  (cost=31.63..31.63 rows=200 width=4) (actual time=0.013..0.013 rows=3 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        ->  HashAggregate  (cost=29.63..31.63 rows=200 width=4) (actual time=0.011..0.012 rows=3 loops=1)
              Group Key: o.customer_id
              Batches: 1  Memory Usage: 40kB
              ->  Seq Scan on orders o  (cost=0.00..25.70 rows=1570 width=4) (actual time=0.003..0.004 rows=4 loops=1)
Planning Time: 0.110 ms
Execution Time: 0.054 ms

explain analyze select * from customers c where customer_id in (
select customer_id from orders o where o.customer_id=c.customer_id
);
Seq Scan on customers c  (cost=0.00..8020.90 rows=270 width=122) (actual time=0.016..0.022 rows=3 loops=1)
  Filter: (SubPlan 1)
  Rows Removed by Filter: 1
  SubPlan 1
    ->  Seq Scan on orders o  (cost=0.00..29.62 rows=8 width=4) (actual time=0.002..0.002 rows=1 loops=4)
          Filter: (customer_id = c.customer_id)
          Rows Removed by Filter: 2
Planning Time: 0.092 ms
Execution Time: 0.037 ms

exists 的执行计划呢,是先对orders 进行全表扫描,然后按customer_id 进行去重后构建哈希表。然后全表扫描customers,再通过哈希匹配找到符合条件的行数。orders 只需要扫描 4 次。

in 的执行计划呢,则是先全表扫描orders,使用过滤条件customer_id = c.customer_id,然后全表扫描customers,每行都会触发子查询,也就是orders要扫描 4*4 行。

因为数据量有点小,两者执行时间并不明显,接下来我们插入一万条数据再来看看差别:

-- 清空现有数据(可选)
TRUNCATE TABLE orders CASCADE;
TRUNCATE TABLE customers CASCADE;

-- 插入 10,000 个客户
INSERT INTO customers (customer_id, name)
SELECT 
    id, 
    '客户_' || id || '_' || 
    CASE WHEN random() < 0.3 THEN '集团'
         WHEN random() < 0.6 THEN '有限公司'
         ELSE '个人' 
    END
FROM generate_series(1, 10000) AS id;

-- 插入 50,000 个订单(平均每个客户5单)
INSERT INTO orders (order_id, customer_id, amount)
SELECT
    id,
    -- 80%的订单分配给活跃客户(前2000名),20%分配给其他客户
    CASE WHEN id <= 40000 
         THEN floor(random() * 2000) + 1 
         ELSE floor(random() * 8000) + 2001 
    END,
    -- 订单金额:50%在100-500之间,30%在500-2000,20%大额订单
    CASE 
        WHEN random() < 0.5 THEN (random() * 400 + 100)::numeric(10,2)
        WHEN random() < 0.8 THEN (random() * 1500 + 500)::numeric(10,2)
        ELSE (random() * 10000 + 2000)::numeric(10,2)
    END
FROM generate_series(1, 10000) AS id;

analyze customers;
analyze orders;

我们直接贴耗时时长

exists 只用了Execution Time: 3.655 ms

而 in 则用了Execution Time: 2950.236 ms

差距对比这就很明显了吧。

总结

select 1 不涉及实际列的计算或扫描,只需检查是否存在符合条件的行。

执行效率对 in 查询更加高效,看执行计划可知。

一般用于子查询关联查询,如: select * from customers c where exists (select 1 from orders o where o.customer_id=c.customer_id);