MySQL使用笔记:重新组织表的内容

本文以一个实例(GTFS的stop_times表)介绍如何重新组织表的内容。

1. 一个实例

GTFS有一张表stop_times用来描述公交车的时刻表,每一组(trip_id相同)描述一趟公交车从起点到终点的时刻表,取stop_times.txt部分数据如下:

trip_id,stop_id,stop_sequence,arrival_time,departure_time,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
4503599630773892,3377699720881838,0,15:00:00,15:00:00,,2,1,
4503599630773892,3377699720882320,1,15:02:00,15:02:00,,2,2,
4503599630773892,3377699720888273,2,15:03:00,15:03:00,,2,2,
4503599630773892,3377699720888299,3,15:04:00,15:04:00,,2,2,
4503599630773892,3377699720888297,4,15:05:00,15:05:00,,2,2,
4503599630773892,3377699720888295,5,15:05:00,15:05:00,,2,2,
4503599630773892,3377699720888272,6,15:05:00,15:05:00,,2,2,
4503599630773892,3377699720888296,7,15:06:00,15:06:00,,2,2,
4503599630773892,3377699720888298,8,15:07:00,15:07:00,,2,2,
4503599630773892,3377699720888300,9,15:08:00,15:08:00,,2,2,
4503599630773892,3377699720888274,10,15:09:00,15:09:00,,2,2,
4503599630773892,3377699720882321,11,15:10:00,15:10:00,,2,2,
4503599630773892,3377699720885633,12,15:12:00,15:12:00,,2,2,
4503599630773892,3377699720888973,13,15:13:00,15:13:00,,2,2,
4503599630773892,3377699720888974,14,15:13:00,15:13:00,,2,2,
4503599630773892,3377699722609740,15,15:14:00,15:14:00,,2,2,
4503599630773892,3377699720882322,16,15:14:00,15:14:00,,2,2,
4503599630773892,3377699720888357,17,15:16:00,15:16:00,,2,2,
4503599630773892,3377699720888347,18,15:17:00,15:17:00,,2,2,
4503599630773892,3377699720888349,19,15:18:00,15:18:00,,2,2,
4503599630773892,3377699720888741,20,15:19:00,15:19:00,,2,2,
4503599630773892,3377699720888350,21,15:20:00,15:20:00,,2,2,
4503599630773892,3377699720882324,22,15:22:00,15:22:00,,2,2,
4503599630773892,3377699720888352,23,15:24:00,15:24:00,,2,2,
4503599630773892,3377699720882326,24,15:24:00,15:24:00,,2,2,
4503599630773892,3377699720882327,25,15:28:00,15:28:00,,2,2,
4503599630773892,3377699722609739,26,15:29:00,15:29:00,,2,2,
4503599630773892,3377699723931112,27,15:29:00,15:29:00,,1,2,

可见,这一趟公交车总共停了28个站,15:00:00从起点出发,15:29:00到达终点。

现在想找出每一趟公交车的起点和终点,那么新表就形如:

-- 新表格式
trip_id, 起始站的stop_id(stop_sequence最小的时候,即0), 终点站的stop_id(stop_sequence最大的时候)

-- 举例
trip_id, stop_sequence, stop_id, stop_sequence, stop_id
4503599630773892,0,3377699720881838,27,3377699723931112

2. 重新组织表的内容

新表中的每一行都是从stop_times抽取,这就需要对stop_times的内容重新组织。基本思路是:分别建立起始站和终点站两张临时表(皆以trip_id作为索引),再根据trip_id将两个表合并。源代码如下:

SELECT T1.trip_id, T1.stop_sequence, T1.stop_id, T2.stop_sequence, T2.stop_id
FROM
    -- create a new table T1: trip_id, stop_sequence=0, stop_id (first stop)
    (SELECT st_first1.trip_id, st_first1.stop_sequence, st_first1.stop_id
    FROM stop_times st_first1
    INNER JOIN 
        -- filter out the first stop: trip_id, stop_sequence=0
        (SELECT stop_times.trip_id, MIN(CAST(stop_times.stop_sequence AS UNSIGNED)) AS first_stop
        FROM stop_times
        GROUP BY stop_times.trip_id
        ) st_first2
    ON st_first1.trip_id=st_first2.trip_id AND st_first1.stop_sequence=st_first2.first_stop
    ) T1
    
LEFT JOIN -- combine T1 and T2

    -- create a new table T2: trip_id, stop_sequence=MAX, stop_id (last stop)
    (SELECT st_last1.trip_id, st_last1.stop_sequence, st_last1.stop_id
    FROM stop_times st_last1
    INNER JOIN
        -- filter out the last stop: trip_id, stop_sequence=MAX
        (SELECT stop_times.trip_id, MAX(CAST(stop_times.stop_sequence AS UNSIGNED)) AS last_stop
        FROM stop_times
        GROUP BY stop_times.trip_id
        ) st_last2
    ON st_last1.trip_id=st_last2.trip_id AND st_last1.stop_sequence=st_last2.last_stop
    ) T2
    
ON T1.trip_id=T2.trip_id
LIMIT 10;

机智的你一眼就看出上面的代码执行效率很低,因为包含了很多嵌套。不管怎么说,至少正确实现了,效率再慢慢改进吧。

运行得到的结果如下:

trip_id    stop_sequence    stop_id    stop_sequence    stop_id
4503599630773892    0    3377699720881838    27    3377699723931112
4503599630773894    0    3377699720881838    27    3377699723931112
4503599630773896    0    3377699720881838    27    3377699723931112
4503599630773898    0    3377699720881838    27    3377699723931112
4503599630773900    0    3377699720881838    27    3377699723931112
4503599630773902    0    3377699720881838    27    3377699723931112
4503599630810392    0    3377699722882625    17    3377699720888950
4503599630810394    0    3377699722882625    17    3377699720888950
4503599630810396    0    3377699722882625    17    3377699720888950
4503599630810398    0    3377699722882625    17    3377699720888950

参考资料:
[1]StackOverflow: SQL Select only rows with Max Value on a Column

发表评论

电子邮件地址不会被公开。 必填项已用*标注