怎样使用scrapy爬取js动态生成的数据

 我来答

1个回答

#热议# 为什么有人显老，有人显年轻？

龙氏风采
推荐于2017-10-06 · 知道合伙人互联网行家

龙氏风采
知道合伙人互联网行家

采纳数：5849 获赞数：12817

从事互联网运营推广，5年以上互联网运营推广经验，丰富的实战经

向TA提问私信TA

关注

展开全部

　　解决方案：
　　利用第三方中间件来提供JS渲染服务： scrapy-splash 等。
　　利用webkit或者基于webkit库
　　Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器，Splash是用Python实现的，同时使用Twisted和QT。Twisted（QT）用来让服务具有异步处理能力，以发挥webkit的并发能力。
　　下面就来讲一下如何使用scrapy-splash：
　　利用pip安装scrapy-splash库：
　　$ pip install scrapy-splash
　　scrapy-splash使用的是Splash HTTP API，所以需要一个splash instance，一般采用docker运行splash，所以需要安装docker。
　　安装docker, 安装好后运行docker。
　　拉取镜像(pull the image)：
　　$ docker pull scrapinghub/splash
　　用docker运行scrapinghub/splash：
　　$ docker run -p 8050:8050 scrapinghub/splash
　　配置splash服务（以下操作全部在settings.py）：
　　1）添加splash服务器地址：
　　SPLASH_URL = 'http //localhost:8050'
　　2）将splash middleware添加到DOWNLOADER_MIDDLEWARE中：
　　DOWNLOADER_MIDDLEWARES = {
　　'scrapy_splash.SplashCookiesMiddleware': 723,
　　'scrapy_splash.SplashMiddleware': 725,
　　'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
　　}
　　3)Enable SplashDeduplicateArgsMiddleware:
　　SPIDER_MIDDLEWARES = {
　　'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
　　}
　　4)Set a custom DUPEFILTER_CLASS:
　　DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
　　5)a custom cache storage backend:
　　HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
　　例子
　　获取HTML内容：
　　import scrapy
　　from scrapy_splash import SplashRequest
　　class MySpider(scrapy.Spider):
　　start_urls = ["http //example com", "http //example com/foo"]
　　def start_requests(self):
　　for url in self.start_urls:
　　yield SplashRequest(url, self.parse, args={'wait': 0.5})
　　def parse(self, response):
　　# response.body is a result of render.html call; it
　　# contains HTML processed by a browser.
　　# ...

已赞过 已踩过<

评论收起

推荐律师服务：若未解决您的问题，请您详细描述您的问题，通过百度律临进行免费专业咨询

怎样使用scrapy爬取js动态生成的数据

其他类似问题

为你推荐：