完善淘宝爬取的使用方式

DropsDevopsOrg · Oct 27, 2019 · 7f16ed5 · 7f16ed5
1 parent ea895b6
commit 7f16ed5
Show file tree

Hide file tree

Showing 5 changed files with 152 additions and 11 deletions.
diff --git a/TaobaoCrawler/Readme.md b/TaobaoCrawler/Readme.md
@@ -3,14 +3,68 @@
 
 ## 淘宝爬虫基础
 
-* [webdriver 淘宝页面登录]()
-* [webdriver 微博页面登录]()
-* [pyppteer 淘宝页面登录]()
-* [获取浏览器数据]()
-* [获取接口数据]()
+* [webdriver 淘宝页面登录](待完善)
+* [webdriver 微博页面登录](待完善)
+* [pyppteer 淘宝页面登录](待完善)
+* [获取浏览器数据](待完善)
+* [获取接口数据](待完善)
 
 # 实现代码实例
 
+爬取淘宝的数据除了xsign的key的方式，头疼的一点就是被识别、出现滑动验证码。
+
+本开源程序原理使用代码操作webdriver，流量走到 mitmproxy进行过滤浏览器参数，这些参数会会让淘宝的js知道你使用的是webdriver,这样出现小二滑动也能轻松的过。
+
+![](https://raw.githubusercontent.com/Hatcat123/GraphicBed/master/Img/20191027124113.gif)
+
+
+## 使用方式
+
+- [x] python3.5
+- [x] requirements.txt
+- [x] webdriver.exe
+- [x] mongodb
+- [x] mitmproxy
+
+>windows运行
+
+1 、 运行开启mongodb数据库，配置数据库密码 
+
+```
+mongod.exe
+
+```
+
+2 、下载mitmproxy。
+
+ - 直接下载，百度搜索下载方法，去官方下载。
+ - 或pip下载：`pip install mitmproxy`。在python的包中site找到mitmproxy.exe、mitmdump.exe、mitmweb.exe即说明成功
+
+3、 安装`requirements.txt`
+
+>我的库比较杂，最好使用虚拟环境
+
+```
+pip install -r requirements.txt -i https://pypi.douban.com/simple
+```
+
+4、 使用`webdriver`在代码中使用的是火狐内核的无头浏览器。在滑动验证码的时候，与过滤浏览器参数的时候发现使用火狐的方式成功率更加的高
+
+5、 开启mitmproxy
+
+```
+mitmdump -p 8888 -s proxy.py (代理脚本路径)
+```
+6、 运行软件
+
+> 建议在虚拟环境下运行
+
+```
+python TK_crawler.py
+```
+之后就能看到这个界面了，简单说明此程序是用TK写的界面，tk比较麻烦，大概半年前写的比较菜，现在再看逻辑与代码结构糟糕透了，很多公共的部分我都没封装。但是里面的功能都是可以用的。
+>2019年10月27日周末测试
+
 ## 打开软件
 
 ![](https://raw.githubusercontent.com/Hatcat123/GraphicBed/master/Img/20190416182559.png)
@@ -90,3 +144,4 @@
 关闭浏览器 关闭软件
 
 ![](https://raw.githubusercontent.com/Hatcat123/GraphicBed/master/Img/20190416182442.png)
+
diff --git a/TaobaoCrawler/TK_crawler.py b/TaobaoCrawler/TK_crawler.py
@@ -605,7 +605,7 @@ def test_time(over_time):
     else:return False
 
 if __name__ == '__main__':
-    if test_time('2019-4-17 16:0:0'):
+    if test_time('2020-4-17 16:0:0'):
         window = tk.Tk()  # 父容器
         window.title("淘宝同款信息采集器定制版ByAjay13")  # 父容器标题
         MainPage(window)

diff --git a/TaobaoCrawler/proxy.py b/TaobaoCrawler/proxy.py
@@ -0,0 +1,20 @@
+# -*- coding: utf-8 -*-
+from mitmproxy import ctx
+
+def response(flow):
+	if 'um.js' in flow.request.url or '120.js' in flow.request.url or '/sufei_data/3.7.5/index.js' in flow.request.url:
+		# 屏蔽selenium检测
+		flow.response.text = flow.response.text + 'Object.defineProperties(navigator,{webdriver:{get:() => false}}); '
+
+	for webdriver_key in ['webdriver', '__driver_evaluate', '__webdriver_evaluate', '__selenium_evaluate',
+						  '__fxdriver_evaluate', '__driver_unwrapped', '__webdriver_unwrapped', '__selenium_unwrapped',
+						  '__fxdriver_unwrapped', '_Selenium_IDE_Recorder', '_selenium', 'calledSelenium',
+						  '_WEBDRIVER_ELEM_CACHE', 'FirefoxDriverw', 'driver-evaluate', 'webdriver-evaluate',
+						  'selenium-evaluate', 'webdriverCommand', 'webdriver-evaluate-response', '__webdriverFunc',
+						  '__webdriver_script_fn', '__$webdriverAsyncExecutor', '__lastWatirAlert',
+						  '__lastWatirConfirm', '__lastWatirPrompt', '$chrome_asyncScriptInfo',
+						  '$cdc_asdjflasutopfhvcZLmcfl_']:
+		ctx.log.info('Remove "{}" from {}.'.format(webdriver_key, flow.request.url))
+		flow.response.text = flow.response.text.replace('"{}"'.format(webdriver_key), '"NO-SUCH-ATTR"')
+	flow.response.text = flow.response.text.replace('t.webdriver', 'false')
+	flow.response.text = flow.response.text.replace('FirefoxDriver', '')
diff --git a/TaobaoCrawler/requirements.txt b/TaobaoCrawler/requirements.txt
@@ -0,0 +1,63 @@
+altgraph==0.16.1
+appdirs==1.4.3
+autopep8==1.4.3
+beautifulsoup4==4.7.1
+certifi==2018.8.24
+cffi==1.12.3
+chardet==3.0.4
+Click==7.0
+colorama==0.4.1
+copyheaders==0.0.1
+cssselect==1.0.3
+docx==0.2.4
+et-xmlfile==1.0.1
+Flask==1.0.2
+FOFA==1.0.1
+future==0.17.1
+getmac==0.8.1
+gevent==1.4.0
+greenlet==0.4.15
+idna==2.7
+itchat==1.3.10
+itsdangerous==1.1.0
+jdcal==1.4
+Jinja2==2.10
+lxml==4.3.2
+macholib==1.11
+MarkupSafe==1.1.0
+openpyxl==2.6.2
+pefile==2019.4.18
+Pillow==6.0.0
+prompt-toolkit==2.0.9
+py-backwards==0.7
+py-backwards-astunparse==1.5.0.post3
+pycodestyle==2.5.0
+pycparser==2.19
+pyee==5.0.0
+PyExecJS==1.5.1
+PyInstaller==3.4
+pymongo==3.7.2
+pypng==0.0.19
+pyppeteer==0.0.25
+PyQRCode==1.2.1
+PyQt5==5.12.1
+PyQt5-sip==4.19.15
+pyquery==1.4.0
+pywin32-ctypes==0.2.0
+requests==2.21.0
+retrying==1.3.3
+schedule==0.6.0
+selenium==3.14.0
+six==1.12.0
+soupsieve==1.8
+tkMessageBox==0.1
+tqdm==4.31.1
+typed-ast==1.3.1
+urldecode==0.1
+urllib3==1.23
+wcwidth==0.1.7
+websockets==7.0
+Werkzeug==0.14.1
+wit==5.1.0
+xlrd==1.2.0
+xlwt==1.3.0
diff --git a/WechatCrawler/readme.md b/WechatCrawler/readme.md
@@ -1,12 +1,14 @@
 
-关于微信的公众号的爬取，等忙完这一阵子，将会推出
-
-部分内容功能将进行二次开发，推出商业版，自媒体的朋友可以先联系使用
+关于微信的公众号的爬取，部分内容功能将进行二次开发，推出商业版，自媒体的朋友可以先联系使用
 
 ![](https://raw.githubusercontent.com/Hatcat123/GraphicBed/master/Img/20190515130702.png)
 
 
-关于公众号的爬取：常规的分为三种方式。1、爬取搜狗微信接口。2、通过代理拦截到微信的请求数据与响应数据。3、hook微信的对象被动爬取。
+关于公众号的爬取：[常规的分为三种方式](https://github.com/DropsDevopsOrg/ECommerceCrawlers/wiki/%E5%BE%AE%E4%BF%A1%E5%85%AC%E4%BC%97%E5%8F%B7%E7%88%AC%E5%8F%96%E7%A0%94%E7%A9%B6)。
+
+ - 1、爬取搜狗微信接口。
+ - 2、通过代理拦截到微信的请求数据与响应数据。
+ - 3、hook微信的对象被动爬取。
 
 
 ## 公众号聚合平台
@@ -17,6 +19,7 @@
 公众号聚合平台采用layui前端模板与bootstrap模板结合开发，服务应用采用Python Flask语言开发。是一款为了获取微信安全方面的公众号聚合平台。为客户提供优质的聚合服务。
 
 * 解决了常规公众号难以采集的技术难题。
+* 能够无人监听模式自动化采集。
 * 使用友好的界面展示。在三端设备做了自适应展示。
 * 提供api数据接口方便调用。使用者可以进行二次开发。
 * 数据索引语句高优化，服务响应速度快。
@@ -28,7 +31,7 @@
 
 ## 部署条件
 
-windows服务器与linux服务器最低配即可。支持docker镜像部署。
+windows服务器（或加linux服务器）最低配即可。支持docker镜像部署。
 
 ## 平台有哪些内容？