爬取微信小程序内容:以微信指数为例

本文以微信指数为例介绍如何爬取微信小程序上的内容。

1. 思路

微信小程序上访问的内容,看不到URL,没法直接抓取。一个可行的方法:使用HTTP代理,比如在笔记本上代理手机上的HTTP访问。可以通过Charles实现这一点。

Charles是一款网页调试代理,支持HTTP代理、HTTPS代理、HTTP监视、反向代理。

Charles is an HTTP proxy / HTTP monitor / Reverse Proxy that enables a developer to view all of the HTTP and SSL / HTTPS traffic between their machine and the Internet. This includes requests, responses and the HTTP headers (which contain the cookies and caching information).

2. 环境配置

我使用的工具:macOS笔记本 + iPhone手机。首先,在官网下载Charles。接下来步骤如下:

(1)安装证书

为笔记本安装证书。在菜单栏,点击Help -> SSL Proxying -> Install Charles Root Certificate

移动端同样也需要安装Charles证书。在菜单栏,点击Help -> SSL Proxying -> Install Charles Root Certificate on a Mobile Device or remote Browser,会弹出一个对话框。

Charles弹出对话框中的IP地址和端口
图1 Charles弹出对话框中的IP地址和端口

在手机上将网络中的HTTP代理设为对话框的IP地址和端口。具体做法为:在手机上,Settings --> Wi-Fi,点击网络右侧的i,滑到末尾,配置HTTP PROXY。

手机上下载Charles证书。在手机浏览器上访问chls.pro/ssl,自动下载安装证书。再到手机设置里安装证书(Profile)。

接下来,对安装的根证书启用完全信任,Settings --> Gerneral --> About --> Certificate Trust Settings

(2)配置Charles

取消macOS代理。Charles默认会抓取笔记本的报文,为减少报文干扰,取消macOS代理(不勾选Proxy --> macOS Proxy),只抓取手机上的HTTP报文。

设置HTTP代理端口号(在Proxy --> Proxy Settings --> Proxies),端口号在上述的Help -> SSL Proxying -> Install Charles Root Certificate on a Mobile Device or remote Browser弹出的对话框可以看得到。

开启SSL代理,在Proxy->SSL Proxying Settings --> SSL,勾选Enable SSL Proxying。点击Add按钮,添加哪些主机和端口号可以被代理。粗爆的做法,将Host和Port都设成*

3. 抓取微信指数

在手机微信上搜索微信指数,打开微信指数小程序,搜索某个关键词,对应的HTTPS报文会反映在Charles窗口中,这样,就可以通过分析报文写一个爬虫实现自动抓取。

Charles抓取报文示意图
图2 Charles抓取报文示意图

可见,search.weixin.qq.com/...是搜索关键词发送的HTTPS请求报文。分析HTTP请求URL:

https://search.weixin.qq.com/cgi-bin/searchweb/wxindex/querywxindexgroup?query_sug_list=&group_query_sug_list=%3B&group_query_list=Cn%3BChina&wxindex_query_list=China&gid=521846948515557828&openid=ov4n***0iN9M&search_key=16106895***_3038585660

URL包含好多参数,传入的参数(Query String)有:

query_sug_list  
group_query_sug_list    
group_query_list    China; Cn
wxindex_query_list  Cn
gid 
openid  ov4n***0iN9M
search_key  16106895***_3038585660

open_id是微信小程序标识用户的ID。wxindex_query_list是查询的词。值得注意的是:search_key会动态更新。

Json text选项中查看返回的结果:

{
    "data": {
        "follow_group_gid": "521846948561492458",
        "group_wxindex": [{
            "query": "China",
            "sug": "",
            "wxindex_str": "1046726,952811,1231842,1328559,1113814,1229927,1147777,1115248,859039,1152225,1136789,1131971,1029275,1115375,848046,844365,1215277,1109188,1182704,1762680,1257487,1047824,909230,1188578,1264397,1262510,1244853,1250531,940216,905975,1124739,1109681,1162519,1312195,1419075,1292797,1026179,1261010,1386209,1523541,1591826,1421808,1179703,974055,1159993,1171809,1153880,1130496,1080624,975301,886678,991995,1073214,932342,952776,979360,748935,726215,959261,1144377,1046903,998359,1072579,857873,722989,836615,910001,915710,935492,799013,703706,631521,810869,755020,800770,843455,698834,511534,700016,870081,1331160,910981,918299,875975,737716,659149,807353,899437,654414,944741"
        }],
        "is_new_group": true,
        "timestamp": 1610681122
    },
    "errcode": 0,
    "msg": "",
    "retcode": 0
}

微信指数的值wxindex_str正好90个,对应于90天,timestamp对应于当下的日期。对数据稍加处理,保存成想要的格式,比如我抓取了上证180只股票(股票成份列表动态变化,最新上证180成份列表在这里查看)的微信指数,部分数据如下:

Date,浦发银行,白云机场,上海机场,包钢股份,华能国际,民生银行,宝钢股份,华能水电,中国石化,南方航空,中信证券,三一重工,招商银行,中直股份,保利地产,中国联通,国投资本,宇通客车,葛洲坝,特变电工,上汽集团,国金证券,北方稀土,东方航空,中国卫星,金发科技,中国船舶,华创阳安,天坛生物,中国巨石,生益科技,复星医药,生物股份,新湖中宝,恒瑞医药,安琪酵母,恒顺醋业,万华化学,白云山,华夏幸福,恒力石化,浙江龙盛,江西铜业,金地集团,国电南瑞,片仔癀,通威股份,亨通光电,中金黄金,方大炭素,贵州茅台,华海药业,中天科技,中国软件,山东黄金,恒生电子,长电科技,海螺水泥,用友网络,北大荒,青岛啤酒,绿地控股,华鑫股份,东方明珠,福耀玻璃,海尔智家,均胜电子,三安光电,中航资本,中粮糖业,辽宁成大,华域汽车,闻泰科技,中航沈飞,通策医疗,水井坊,华新水泥,山西汾酒,百联股份,海通证券,王府井,通化东宝,中炬高新,国投电力,伊利股份,航发动力,张江高科,长江电力,华安证券,中泰证券,江苏银行,杭州银行,东方证券,中国电影,招商证券,大秦铁路,南京银行,隆基股份,春秋航空,中信建投,渝农商行,中国神华,太平洋,恒立液压,财通证券,中国国航,工业富联,新城控股,天风证券,兴业银行,北京银行,中国铁建,东兴证券,国泰君安,陕西煤业,上海银行,红塔证券,广汽集团,农业银行,中国平安,中国人保,交通银行,新华保险,三六零,兴业证券,中国中铁,工商银行,大智慧,东吴证券,中国太保,上海医药,中国中冶,中国人寿,长城汽车,邮储银行,中国建筑,中国电建,华泰证券,中银证券,中国卫通,中国中车,晶科科技,光大证券,中国交建,中海油服,京沪高铁,光大银行,中国石油,招商轮船,正泰电器,浙商证券,中国银河,中国中免,紫金矿业,方正证券,浙商银行,永辉超市,建设银行,中国核电,中国银行,中国重工,南京证券,中科曙光,甘李药业,汇顶科技,公牛集团,药明康德,海天味业,今世缘,万泰生物,韦尔股份,绝味食品,口子窖,璞泰来,良品铺子,华友钴业,欧派家居,兆易创新,洛阳钼业,中微公司
2020-10-21,976120,273475,146712,6835,5706,967299,28458,-1,2099555,528757,318673,115239,5339272,-1,60008,2398685,-1,34292,217085,32705,167650,122109,-1,468479,977720,51041,144446,-1,9972,8769,9545,72310,-1,5200,171834,34839,10674,46406,3283104,167743,-1,8146,12363,43517,2578,263810,32315,10540,5033,7838,530912,16917,16969,81837,35031,22803,21515,73389,9553,329040,1738850,35421,1465,208460,56009,112309,24941,58491,6924,-1,17503,7368,-1,-1,7039,142049,9315,60113,4900,93845,693616,7365,3663,18805,14805,-1,15119,14330,64628,58652,240575,160761,55723,457785,398539,8297,167834,174728,643796,216264,-1,10034,3982603,-1,25064,85956,15635,74114,69290,1943230,725062,298385,16541,126173,46773,622798,25537,21399,2411954,5917345,1132331,1773879,478059,-1,986648,155878,5565369,998157,72306,25892,46096,20760,2060860,630760,592461,838388,136540,280652,30604,6587,51686,-1,63120,45473,3792,60189,759197,1052613,6695,20855,38937,59404,-1,177548,38890,246120,1885386,4239861,29237,6717566,37577,149159,27279,13525,25822,15728,99763,54475,154021,-1,-1,14786,54593,-1,998419,21326,404187,36232,15793,-1
2020-10-22,979602,290973,170557,7246,4810,972305,28607,-1,2130922,554083,248928,116027,5267684,-1,47714,2568067,-1,25541,173763,31463,173671,99877,-1,450897,1050472,24393,121190,-1,10666,6713,10596,69935,-1,6500,117169,39157,12146,43702,844148,218547,-1,6217,10497,42769,7558,249669,18027,7513,4562,10466,506726,30397,24792,117420,55192,28325,27972,59960,17624,308505,403092,23191,1775,177161,55902,86296,16538,62168,6329,-1,13119,13917,-1,-1,8831,134590,17294,48849,4040,80749,909899,8541,6144,6466,9350,-1,13933,15229,36953,34759,476786,155773,54285,447853,353627,6272,218478,88619,595805,188500,-1,4261,2979312,-1,26247,90399,6310,190957,62789,1491974,1159605,248516,14570,113557,19277,559881,14245,31531,2198305,477300,1466694,2971975,471109,-1,564359,125534,5664150,818788,56877,108085,67007,14736,1307884,485215,487871,883130,123854,271876,29190,4329,64994,-1,59829,51551,20435,91740,841829,1145311,4655,21235,37774,38772,-1,166889,42752,239146,1773777,4341686,28783,7542809,27603,27338,23781,12965,22770,11232,127807,167715,147288,-1,-1,5886,75847,-1,835374,22288,541185,35468,16390,-1
2020-10-23,855378,262125,147726,6883,4250,947212,31557,-1,2531747,568606,320341,111226,4870527,-1,118013,2275770,-1,35589,180819,38053,167181,91489,-1,456007,1059881,30108,119054,-1,6648,7070,15413,86080,-1,9500,91941,33601,6983,69079,1355064,158039,-1,6921,14859,45394,7642,264003,40863,9300,6765,17345,557987,13350,21885,88528,38196,28367,22647,77667,16266,449518,454633,18658,2875,234833,59653,86904,15492,41930,6583,-1,12178,4968,-1,-1,6695,132141,15031,67356,5501,69955,986376,12963,4567,10169,17371,-1,15443,14336,28032,61778,287817,167007,51523,409999,308862,9663,137018,87987,601639,187168,-1,10209,2380722,-1,30448,102353,8594,112807,65661,1338505,1024026,312660,16917,126053,16763,610360,11860,34089,2260976,462663,1019889,2772589,544924,-1,303926,186637,5700277,856599,42146,69734,30512,14919,1445089,446844,467208,820913,132599,287817,32201,4430,66001,-1,65879,65513,10888,672263,887660,1204570,6336,25955,38964,35343,-1,120879,72499,175202,2461064,4648142,21122,10777591,19514,24811,41817,22870,39999,71433,132721,60592,147174,-1,-1,4294,64411,-1,1210805,15272,751169,36003,8782,-1
2020-10-24,1259856,283534,112598,8159,6237,890052,28222,-1,2287935,539254,461917,148101,5990823,-1,51718,2424515,-1,34506,202601,40782,152955,81270,-1,762638,1066600,19899,93702,-1,5261,9409,36614,66951,-1,4000,84581,34794,13267,100644,1113199,135053,-1,3455,11668,64305,2884,282734,81088,8268,7750,19936,548567,12320,17350,183201,36493,24304,20758,60255,23543,243637,417911,23273,3719,223159,43207,99430,11717,28012,5991,-1,11965,3334,-1,-1,6793,149907,15473,48277,5287,68596,994379,5297,6058,16238,19088,-1,15492,10819,39579,46218,439179,154499,43864,568483,257424,7514,117824,85215,687064,392295,-1,5146,2376241,-1,21222,151511,31966,89571,131913,1107321,971800,307319,14912,112931,14069,656577,13517,29826,2288824,601256,1037901,2618623,454191,-1,205926,176018,6019357,1185496,29986,90910,34087,11284,1339548,484407,603933,680419,109342,334149,26758,4528,53684,-1,57910,92918,9029,884030,1076941,1052111,7205,23610,33289,30422,-1,100449,58620,162456,3077292,4243851,32322,8575182,15324,18037,24342,15617,26543,57236,89380,44037,142777,-1,-1,17141,66150,-1,1130705,10922,648321,22199,11146,-1
2020-10-25,623497,215608,83512,6269,2817,498104,15366,-1,1974749,461857,181063,105482,4926225,-1,87966,2309307,-1,29705,142852,22803,176078,63336,-1,352506,994441,14905,194789,-1,1612,6257,28349,47421,-1,4400,60416,26342,3558,49928,1627363,110163,-1,3543,1731,31097,2711,368739,15496,5430,11238,13315,553229,5709,14410,104092,20525,12873,12181,39934,12868,236433,519325,17866,1195,248482,37076,110696,6813,14608,4804,-1,9421,3559,-1,-1,6317,170802,10721,46195,2486,45075,1318578,4703,7947,12067,12732,-1,6143,10170,22303,27375,467789,247346,41273,487540,100834,6435,84789,52645,504472,110795,-1,4149,1674499,-1,25763,97889,12764,46625,68041,635280,903859,238172,13315,80904,13818,866280,5173,19722,2001996,363236,776552,1528645,271382,-1,103423,137712,5225614,998234,18835,63837,38589,9065,960248,454670,432602,740329,103276,286153,13880,1019,53423,-1,39388,50614,4836,2367871,830257,917815,2799,19938,20236,21693,-1,79985,29764,118313,2735583,3121943,19541,8646805,9920,9994,19553,12670,17434,18903,71706,27032,149362,-1,-1,2906,71368,-1,1255249,11229,783291,18629,6588,-1
2020-10-26,448674,214673,197353,5907,3348,464217,13752,-1,2058186,821547,247241,85515,4775637,-1,40851,2649746,-1,24037,134736,21999,185058,58322,-1,358109,1060986,14490,110991,-1,6737,3953,27907,34994,-1,1400,58514,26101,5359,59346,1503288,97822,-1,2154,6697,22063,2129,537507,23179,6329,10400,6834,664488,6703,13781,85727,17277,16457,9869,36081,14768,244211,448400,16994,452,197283,41452,89791,6198,16541,3894,-1,9441,3296,-1,-1,6285,163885,10035,54347,2117,39549,1084267,9376,6030,9527,14922,-1,7145,10385,25274,22391,172341,231462,47138,409979,109759,5800,77500,47418,490704,142237,-1,8223,1549572,-1,38915,84617,8629,49198,75093,626059,575185,212254,12772,150691,24754,703594,9860,25453,1810835,334431,654120,1528060,248079,-1,80112,117713,4592919,966736,22384,38525,25246,10342,827445,486196,455507,596566,106569,243897,12852,5632,50628,-1,35276,42217,5699,570600,666766,789247,3104,11655,17652,19798,-1,98709,27895,136048,2103807,2999577,24166,8441844,10023,12090,14279,9416,11629,12451,54658,25896,172823,-1,-1,2632,66711,-1,1236032,13729,466470,26510,4996,-1
2020-10-27,865569,227312,375688,10104,3201,837040,53655,-1,2104137,708863,267308,111337,5376249,-1,48147,2687785,-1,39978,193928,34927,193268,80268,-1,1051918,949406,21049,89795,-1,7335,11701,49263,54966,-1,4400,83975,36722,7020,45080,775622,179682,-1,7003,9647,28953,4975,537949,29169,4855,12746,11317,4356154,11707,16665,140587,60886,67085,24096,112794,23193,548749,466054,131924,1705,238667,45619,95329,7239,31993,5079,-1,15420,5572,-1,-1,10648,137277,17082,55898,4298,65678,956367,9734,6360,8004,13224,-1,20398,13976,29022,46498,253417,158606,65013,509410,251870,6235,135119,65497,472222,218669,-1,42206,1906113,-1,43219,93223,14985,98720,100815,1024370,583117,232250,19101,154829,21878,452654,20264,30058,2257094,1368934,1116460,2445331,477594,-1,95779,149050,5580784,1606969,47343,48867,40476,9934,1326315,511312,531720,688762,119122,952422,35424,11536,43636,-1,243168,51029,5268,578725,647760,1004382,2904,24015,45294,24248,-1,107758,61979,196603,1324737,4342313,21621,7164608,10704,41326,25674,11075,28872,11534,78896,42923,138605,-1,-1,5817,67614,-1,1337956,18611,531411,31721,9069,-1
2020-10-28,915680,231949,269618,9322,6497,1255811,26054,-1,1933704,583073,397400,109724,4266613,-1,78958,2828894,-1,39086,241836,27186,173003,79410,-1,640274,733873,30997,119155,-1,27600,36567,24751,53007,-1,7300,82838,33904,18330,63236,689875,165945,-1,17403,6929,37793,5903,383951,19834,8566,3133,9324,1164512,10983,14122,107702,49388,54747,16522,116522,17896,268155,468712,52402,506,174353,99920,108443,23138,42974,7412,-1,17518,5615,-1,-1,38261,147022,18279,72903,3525,86171,756127,7729,13411,9072,37495,-1,20859,14652,32350,41454,352757,158732,63970,443561,226116,9963,126505,64192,685501,208414,-1,14937,1521866,-1,31944,113764,9646,110713,82816,2231758,705375,204375,18109,128701,17417,472765,16365,27739,2980011,808708,1050860,1318636,512870,-1,98894,170145,5443009,1356308,34039,59933,63895,9547,1256622,491073,992224,858039,118993,427423,26511,13922,49897,-1,124357,51356,9856,349192,712800,924606,6958,20573,43457,19455,-1,84545,52003,313457,1029614,4132184,26129,7981820,13650,32750,26201,16943,29469,16784,118916,89750,157973,-1,-1,3577,68895,-1,1510092,17682,801832,31053,10800,-1
2020-10-29,1017003,269989,241466,7618,16985,1162067,21264,-1,1965328,587191,488351,100437,4702714,-1,131052,2871217,-1,73593,348127,36327,177309,70643,-1,578531,1049033,16371,98106,-1,28557,22114,22679,55668,-1,6100,63651,32390,31783,153389,578178,159454,-1,7140,10212,32077,4201,1466062,34136,9494,6016,11497,1418399,11135,26260,87085,32903,43387,25744,117887,14984,448709,449044,29894,1505,220316,54730,102524,49472,49285,5425,-1,28345,6651,-1,-1,17358,1292583,18371,65317,6652,76337,875249,11080,10251,9547,35752,-1,23141,15904,26190,42723,255438,168672,70362,378496,302307,9351,168932,74861,640510,206632,-1,12211,1383387,-1,26231,95901,10292,112027,81140,2168607,618138,247466,22718,142539,16399,498507,13196,25816,2425637,880273,1075954,1398798,525827,-1,111026,117558,5535729,1331949,38591,178297,56697,12176,1284441,440879,876719,824165,121496,384885,46142,7458,36386,-1,75382,44575,7468,216215,711198,970271,4583,19734,30348,19939,-1,119756,71661,323783,1048301,4177322,27849,7601288,18938,24565,48273,23179,30371,21006,100598,51732,153720,-1,-1,5770,70489,-1,1725171,26219,607162,27994,12083,-1
...

另,也可以查看到HTTP请求的header

headers = {
        "Referer": "https://servicewechat.com/wxc026e7662ec26a3a/10/page-frame.html",
        "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 MicroMessenger/7.0.21(0x17001522) NetType/WIFI Language/en"
    }

4. 其他问题

(1)暂未被收录

有些词没有被微信指数收录,返回的结果如下:

{
    "data": {
        "follow_group_gid": "",
        "group_wxindex": [],
        "is_new_group": false,
        "timestamp": 1610893860
    },
    "errcode": -1,
    "msg": "BatchGetWxIndex Failed",
    "retcode": -1
}

(2) 操作太频繁,请稍后尝试

微信指数搜索时显示“操作太频繁,请稍后尝试”。换一个微信号登录,修改open_idsearch_key

另一点,在爬取微信指数词条之间添加间隔,如time.sleep(random.randint(1, 3))

(3)response.textresponse.content区别

response.content返回的是二进制数据(bytes型)。response.text返回的是unicode文本数据,如果没有指定编码(如response.encoding = 'utf-8'),如果没有指定,会使用chardet(The Universal Character Encoding Detector)模块去猜。

5. 关键代码

关键代码如下:

import datetime
import json
import requests
from requests.adapters import HTTPAdapter
import csv
import random
import time

s = requests.Session()
requests.adapters.DEFAULT_RETRIES = 5
s.mount('http://', HTTPAdapter(max_retries=5))
s.mount('https://', HTTPAdapter(max_retries=5))

def scrap_wechat_index(open_id, search_key, search_word):
    """
    return a list of the value of wechat index for search word
    """
    url = "https://search.weixin.qq.com/cgi-bin/searchweb/wxindex/querywxindexgroup?wxindex_query_list={search_word}&openid={open_id}&search_key={search_key}".format(
            open_id=open_id,
            search_key = search_key,
            search_word = search_word)

    headers = {
        "Referer": "https://servicewechat.com/wxc026e7662ec26a3a/10/page-frame.html",
        "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 MicroMessenger/7.0.21(0x17001522) NetType/WIFI Language/en"
    }

    response = s.get(url, headers=headers, timeout=3)

    content = json.loads(response.content)

    if content['retcode'] == -1: # doesn't indexed by wechat index
        l_wechat_index = [-1] * 90
    else:
        try:
            wxindex_str = content['data']['group_wxindex'][0]['wxindex_str']
            l_wechat_index = wxindex_str.split(',')
        except KeyError:
            print('KeyError: You might need to update the field search_key.')
            l_wechat_index = ['KeyError'] * 90
        except IndexError:
            print('IndexError', search_word)
            l_wechat_index = ['IndexError'] * 90

    return l_wechat_index

参考资料:

[1] Python爬取微信小程序(Charles)_HeyShHeyou的博客-CSDN博客_charles

[2] Quickstart — Requests documentation

专题: 爬虫与反爬 (4/6)

赞赏

微信赞赏支付宝赞赏

发表评论

邮箱地址不会被公开。 必填项已用*标注