本文以一个实例(GTFS的stop_times表)介绍如何重新组织表的内容。
1. 一个实例
GTFS有一张表stop_times用来描述公交车的时刻表,每一组(trip_id相同)描述一趟公交车从起点到终点的时刻表,取stop_times.txt部分数据如下:
trip_id,stop_id,stop_sequence,arrival_time,departure_time,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled 4503599630773892,3377699720881838,0,15:00:00,15:00:00,,2,1, 4503599630773892,3377699720882320,1,15:02:00,15:02:00,,2,2, 4503599630773892,3377699720888273,2,15:03:00,15:03:00,,2,2, 4503599630773892,3377699720888299,3,15:04:00,15:04:00,,2,2, 4503599630773892,3377699720888297,4,15:05:00,15:05:00,,2,2, 4503599630773892,3377699720888295,5,15:05:00,15:05:00,,2,2, 4503599630773892,3377699720888272,6,15:05:00,15:05:00,,2,2, 4503599630773892,3377699720888296,7,15:06:00,15:06:00,,2,2, 4503599630773892,3377699720888298,8,15:07:00,15:07:00,,2,2, 4503599630773892,3377699720888300,9,15:08:00,15:08:00,,2,2, 4503599630773892,3377699720888274,10,15:09:00,15:09:00,,2,2, 4503599630773892,3377699720882321,11,15:10:00,15:10:00,,2,2, 4503599630773892,3377699720885633,12,15:12:00,15:12:00,,2,2, 4503599630773892,3377699720888973,13,15:13:00,15:13:00,,2,2, 4503599630773892,3377699720888974,14,15:13:00,15:13:00,,2,2, 4503599630773892,3377699722609740,15,15:14:00,15:14:00,,2,2, 4503599630773892,3377699720882322,16,15:14:00,15:14:00,,2,2, 4503599630773892,3377699720888357,17,15:16:00,15:16:00,,2,2, 4503599630773892,3377699720888347,18,15:17:00,15:17:00,,2,2, 4503599630773892,3377699720888349,19,15:18:00,15:18:00,,2,2, 4503599630773892,3377699720888741,20,15:19:00,15:19:00,,2,2, 4503599630773892,3377699720888350,21,15:20:00,15:20:00,,2,2, 4503599630773892,3377699720882324,22,15:22:00,15:22:00,,2,2, 4503599630773892,3377699720888352,23,15:24:00,15:24:00,,2,2, 4503599630773892,3377699720882326,24,15:24:00,15:24:00,,2,2, 4503599630773892,3377699720882327,25,15:28:00,15:28:00,,2,2, 4503599630773892,3377699722609739,26,15:29:00,15:29:00,,2,2, 4503599630773892,3377699723931112,27,15:29:00,15:29:00,,1,2,
可见,这一趟公交车总共停了28个站,15:00:00从起点出发,15:29:00到达终点。
现在想找出每一趟公交车的起点和终点,那么新表就形如:
-- 新表格式 trip_id, 起始站的stop_id(stop_sequence最小的时候,即0), 终点站的stop_id(stop_sequence最大的时候) -- 举例 trip_id, stop_sequence, stop_id, stop_sequence, stop_id 4503599630773892,0,3377699720881838,27,3377699723931112
2. 重新组织表的内容
新表中的每一行都是从stop_times抽取,这就需要对stop_times的内容重新组织。基本思路是:分别建立起始站和终点站两张临时表(皆以trip_id作为索引),再根据trip_id将两个表合并。源代码如下:
SELECT T1.trip_id, T1.stop_sequence, T1.stop_id, T2.stop_sequence, T2.stop_id FROM -- create a new table T1: trip_id, stop_sequence=0, stop_id (first stop) (SELECT st_first1.trip_id, st_first1.stop_sequence, st_first1.stop_id FROM stop_times st_first1 INNER JOIN -- filter out the first stop: trip_id, stop_sequence=0 (SELECT stop_times.trip_id, MIN(CAST(stop_times.stop_sequence AS UNSIGNED)) AS first_stop FROM stop_times GROUP BY stop_times.trip_id ) st_first2 ON st_first1.trip_id=st_first2.trip_id AND st_first1.stop_sequence=st_first2.first_stop ) T1 LEFT JOIN -- combine T1 and T2 -- create a new table T2: trip_id, stop_sequence=MAX, stop_id (last stop) (SELECT st_last1.trip_id, st_last1.stop_sequence, st_last1.stop_id FROM stop_times st_last1 INNER JOIN -- filter out the last stop: trip_id, stop_sequence=MAX (SELECT stop_times.trip_id, MAX(CAST(stop_times.stop_sequence AS UNSIGNED)) AS last_stop FROM stop_times GROUP BY stop_times.trip_id ) st_last2 ON st_last1.trip_id=st_last2.trip_id AND st_last1.stop_sequence=st_last2.last_stop ) T2 ON T1.trip_id=T2.trip_id LIMIT 10;
机智的你一眼就看出上面的代码执行效率很低,因为包含了很多嵌套。不管怎么说,至少正确实现了,效率再慢慢改进吧。
运行得到的结果如下:
trip_id stop_sequence stop_id stop_sequence stop_id 4503599630773892 0 3377699720881838 27 3377699723931112 4503599630773894 0 3377699720881838 27 3377699723931112 4503599630773896 0 3377699720881838 27 3377699723931112 4503599630773898 0 3377699720881838 27 3377699723931112 4503599630773900 0 3377699720881838 27 3377699723931112 4503599630773902 0 3377699720881838 27 3377699723931112 4503599630810392 0 3377699722882625 17 3377699720888950 4503599630810394 0 3377699722882625 17 3377699720888950 4503599630810396 0 3377699722882625 17 3377699720888950 4503599630810398 0 3377699722882625 17 3377699720888950
参考资料:
[1]StackOverflow: SQL Select only rows with Max Value on a Column
微信赞赏
支付宝赞赏