抓取由jQuery动态产生的网页数据:以东方财富中的沪港通历史数据为例

本文以东方财富中的沪港通历史数据为例,介绍如何获取由jQuery动态产生的网页数据。

1. 抓取的内容

Tushare只提供沪港通的净买额(=买入成交额-卖出成交额),我还需要买入成交额和卖出成交额。东方财富恰好有,如下图所示:

沪港通历史数据
图1 沪港通历史数据

然而,在查看页数时,并没有页数的超链接,而是href="javascript:",如下图所示:

页数导航
图2 页数导航

2. 问题分析

页面2对应的HTML代码为<a target="_self" href="" data-page="2" rel="noopener">2</a>javascript:表示点击超链接时,会去执行一个javascript函数,地址不发生跳转。在这里,执行了一条空的js代码。

有一个疑问,<a></a>中也没有类似于onclick="js_method()"的代码,那去执行哪个js函数,把数据加载出来?

3. 数据爬取

欲抓取的数据是通过JavaScript加载。在浏览器右击 --> Inspect --> Network,找到JS脚本返回的JSON数据。

沪港通历史数据JS脚本
图3 沪港通历史数据JS脚本

找到了get?callback=jQuery112309076618069356868_1612589090585&...,完整的Request URL为http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?callback=jQuery112304606706430005292_1612531660704&st=DetailDate&sr=-1&ps=10&p=1&type=HSGTHIS&token=894050c76af8597a853f5b408b759f5d&js=%7Bpages%3A(tp)%2Cdata%3A(x)%7D&filter=(MarketType%3D1),实际上是执行回调函数jQuery112309076618069356868_1612589090585(),回调函数释义如下(jQuery是一个JavaScript库,极大简化了JS编程):

A callback is a function that is passed as an argument to another function and is executed after its parent function has completed.

Headers标签上,可以看到查询字符串参数(Query String Parameters)为(p为页码):

callback: jQuery112304606706430005292_1612531660704
st: DetailDate
sr: -1
ps: 10
p: 1
type: HSGTHIS
token: 894050c76af8597a853f5b408b759f5d
js: {pages:(tp),data:(x)}
filter: (MarketType=1)

Response为:

jQuery112304606706430005292_1612531660704({pages:145,data:[{"MarketType":1.0,"DetailDate":"2021-02-05T00:00:00","DRZJLR":3962.18,"DRYE":48037.82,"LSZJLR":636794.880000001,"DRCJJME":2715.71,"MRCJE":28084.09,"MCCJE":25368.38,"LCGCode":"600704","LCG":"物产中大","LCGZDF":10.0962,"SSEChange":3496.33,"SSEChangePrecent":-0.00157916078883799},{"MarketType":1.0,"DetailDate":"2021-02-04T00:00:00","DRZJLR":2369.05,"DRYE":49630.95,"LSZJLR":634079.170000001,"DRCJJME":1069.98,"MRCJE":28478.16,"MCCJE":27408.18,"LCGCode":"600583","LCG":"海油工程","LCGZDF":10.0467,"SSEChange":3501.86,"SSEChangePrecent":-0.00439256136081261},{"MarketType":1.0,"DetailDate":"2021-02-03T00:00:00","DRZJLR":1923.2,"DRYE":50076.8,"LSZJLR":633009.190000001,"DRCJJME":566.299999999999,"MRCJE":28155.03,"MCCJE":27588.73,"LCGCode":"600596","LCG":"新安股份","LCGZDF":10.0379,"SSEChange":3517.31,"SSEChangePrecent":-0.00463256435217674},{"MarketType":1.0,"DetailDate":"2021-02-02T00:00:00","DRZJLR":1684.64,"DRYE":50315.36,"LSZJLR":632442.890000001,"DRCJJME":298.84,"MRCJE":27462.14,"MCCJE":27163.3,"LCGCode":"600803","LCG":"新奥股份","LCGZDF":10.0248,"SSEChange":3533.68,"SSEChangePrecent":0.00810206317326993},{"MarketType":1.0,"DetailDate":"2021-02-01T00:00:00","DRZJLR":4212.37,"DRYE":47787.63,"LSZJLR":632144.050000001,"DRCJJME":2846.62,"MRCJE":26253.6,"MCCJE":23406.98,"LCGCode":"600740","LCG":"山西焦化","LCGZDF":10.0318,"SSEChange":3505.28,"SSEChangePrecent":0.00637655861065096},{"MarketType":1.0,"DetailDate":"2021-01-29T00:00:00","DRZJLR":1922.0,"DRYE":50078.0,"LSZJLR":629297.430000001,"DRCJJME":485.43,"MRCJE":27270.84,"MCCJE":26785.41,"LCGCode":"600970","LCG":"中材国际","LCGZDF":10.0637,"SSEChange":3483.07,"SSEChangePrecent":-0.00630780730233531},{"MarketType":1.0,"DetailDate":"2021-01-28T00:00:00","DRZJLR":-2238.38,"DRYE":54238.38,"LSZJLR":628812.000000001,"DRCJJME":-3660.4,"MRCJE":23665.29,"MCCJE":27325.69,"LCGCode":"601216","LCG":"内蒙君正","LCGZDF":10.1031,"SSEChange":3505.18,"SSEChangePrecent":-0.0190745912787477},{"MarketType":1.0,"DetailDate":"2021-01-27T00:00:00","DRZJLR":226.730000000003,"DRYE":51773.27,"LSZJLR":632472.400000001,"DRCJJME":-1231.26,"MRCJE":25394.48,"MCCJE":26625.74,"LCGCode":"600143","LCG":"金发科技","LCGZDF":10.0114,"SSEChange":3573.34,"SSEChangePrecent":0.00109541299311103},{"MarketType":1.0,"DetailDate":"2021-01-26T00:00:00","DRZJLR":-1547.56,"DRYE":53547.56,"LSZJLR":633703.660000001,"DRCJJME":-2916.14,"MRCJE":26748.18,"MCCJE":29664.32,"LCGCode":"603687","LCG":"大胜达","LCGZDF":10.0327,"SSEChange":3569.43,"SSEChangePrecent":-0.0151231706509503},{"MarketType":1.0,"DetailDate":"2021-01-25T00:00:00","DRZJLR":2456.61,"DRYE":49543.39,"LSZJLR":636619.800000001,"DRCJJME":1626.93,"MRCJE":33213.59,"MCCJE":31586.66,"LCGCode":"600516","LCG":"方大炭素","LCGZDF":10.0629,"SSEChange":3624.24,"SSEChangePrecent":0.00484924100644619}]})

现在好办了,思路有了:访问Request URL(参数页码p从1到145)得到上述的Response,再从中提取想要的数据。

步骤1:访问Request URL得到上述的Response

import requests

request_url = 'http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?callback=jQuery112304606706430005292_1612531660704&st=DetailDate&sr=-1&ps=10&p={page}&type=HSGTHIS&token=894050c76af8597a853f5b408b759f5d&js=%7Bpages%3A(tp)%2Cdata%3A(x)%7D&filter=(MarketType%3D1)'

response = requests.get(request_url.format(page=i))
print(response.text)

步骤2:提取数据,转换成JSON格式

response.text返回的是字符串,咱们关注的数据是在data:后面的中括号里[...],用正则表达摘取内容[...]

import re

s = re.findall(r'\[.*?\]', response.text)[0]    # Extract substrings between Square brackets

将得到的子字符串转换成JSON格式:

import json

l = json.loads(s) # return a list of dicts

步骤3:提到想要的数据

json.loads(s)返回一个列表,每个元素是一个字典,对应于每一天的记录:

日期  当日成交    净买额 买入成交额   卖出成交额   历史累计    净买额 当日资金    流入  当日余额    领涨股 领涨股 涨跌幅 上证指数    涨跌幅
2021-02-05  27.16亿元 280.84亿元    253.68亿元    6367.95亿元   39.62亿元 480.38亿元    物产中大    10.10%  3496.33 -0.16%

这里只提取买入成交额卖出成交额,关键代码如下:

lists = [['trade_date', 'hgt_in', 'hgt_out', 'hgt_net', 'hgt_total']]  # unit: million
for i in range(1, 146):
    response = requests.get(request_url.format(page=i), headers=headers)
    s = re.findall(r'\[.*?\]', response.text)[0]    # Extract substrings between Square brackets

    for d in json.loads(s): # json.loads(s) returns a list of dicts
        #{'MarketType': 1.0, 'DetailDate': '2021-02-05T00:00:00', 'DRZJLR': 3962.18, 'DRYE': 48037.82, 'LSZJLR': 636794.880000001, 'DRCJJME': 2715.71, 'MRCJE': 28084.09, 'MCCJE': 25368.38, 'LCGCode': '600704', 'LCG': '物产中大', 'LCGZDF': 10.0962, 'SSEChange': 3496.33, 'SSEChangePrecent': -0.00157916078883799}

        # change the format of trade_date
        dt_trade_date = datetime.datetime.strptime(d['DetailDate'], '%Y-%m-%dT%H:%M:%S')
        trade_date = dt_trade_date.strftime('%Y%m%d')
        hgt_in = d['MRCJE']
        hgt_out = d['MCCJE']

            print(trade_date)
            lists.append([trade_date, hgt_in, hgt_out, hgt_in-hgt_out, hgt_in+hgt_out])

搞定:-)

最后得到的数据如下(单位为百万):

trade_date,hgt_in,hgt_out,hgt_net,hgt_total
20210205,28084.09,25368.38,2715.709999999999,53452.47
20210204,28478.16,27408.18,1069.9799999999996,55886.34
20210203,28155.03,27588.73,566.2999999999993,55743.759999999995
20210202,27462.14,27163.3,298.84000000000015,54625.44
20210201,26253.6,23406.98,2846.619999999999,49660.58
20210129,27270.84,26785.41,485.4300000000003,54056.25
20210128,23665.29,27325.69,-3660.399999999998,50990.979999999996
20210127,25394.48,26625.74,-1231.260000000002,52020.22
20210126,26748.18,29664.32,-2916.1399999999994,56412.5
20210125,33213.59,31586.66,1626.9299999999967,64800.25
...

4. 完整代码

代码很短,懒得放GitHub,直接贴在文末:

#!/usr/bin/env python3
import requests
import json
import re
import csv
import datetime

def main():
    # Step 1: Extract data [trade_date, hgt_in, hgt_out, hgt_in-hgt_out, hgt_in+hgt_out]
    # 'http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?callback=jQuery112304606706430005292_1612531660704&st=DetailDate&sr=-1&ps=10&p=1&type=HSGTHIS&token=894050c76af8597a853f5b408b759f5d&js=%7Bpages%3A(tp)%2Cdata%3A(x)%7D&filter=(MarketType%3D1)'
    request_url = 'http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?callback=jQuery112304606706430005292_1612531660704&st=DetailDate&sr=-1&ps=10&p={page}&type=HSGTHIS&token=894050c76af8597a853f5b408b759f5d&js=%7Bpages%3A(tp)%2Cdata%3A(x)%7D&filter=(MarketType%3D1)'

    headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Mobile Safari/537.36'}

    lists = [['trade_date', 'hgt_in', 'hgt_out', 'hgt_net', 'hgt_total']]  # unit: million
    for i in range(1, 146):
        response = requests.get(request_url.format(page=i), headers=headers)
        s = re.findall(r'\[.*?\]', response.text)[0]    # Extract substrings between Square brackets

        for d in json.loads(s): # json.loads(s) returns a list of dicts
            #{'MarketType': 1.0, 'DetailDate': '2021-02-05T00:00:00', 'DRZJLR': 3962.18, 'DRYE': 48037.82, 'LSZJLR': 636794.880000001, 'DRCJJME': 2715.71, 'MRCJE': 28084.09, 'MCCJE': 25368.38, 'LCGCode': '600704', 'LCG': '物产中大', 'LCGZDF': 10.0962, 'SSEChange': 3496.33, 'SSEChangePrecent': -0.00157916078883799}

            # change the format of trade_date
            dt_trade_date = datetime.datetime.strptime(d['DetailDate'], '%Y-%m-%dT%H:%M:%S')
            trade_date = dt_trade_date.strftime('%Y%m%d')
            hgt_in = d['MRCJE']
            hgt_out = d['MCCJE']

            print(trade_date)
            lists.append([trade_date, hgt_in, hgt_out, hgt_in-hgt_out, hgt_in+hgt_out])

    # Step 2: save to file
    with open('hgt_in_out.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerows(lists)

if __name__ == '__main__':
    main()

专题: 爬虫与反爬 (6/6)

赞赏

微信赞赏支付宝赞赏

发表评论

邮箱地址不会被公开。 必填项已用*标注