香港云主机最佳企业级服务商!

ADSL拨号VPS包含了中国大陆(联通,移动,电信,)

中国香港,国外拨号VPS。

当前位置:云主机 > python >

电信ADSL拨号VPS
联通ADSL拨号VPS
移动ADSL拨号VPS

python实现博客文章爬虫示例


时间:2021-02-07 11:22 作者:admin


复制代码 代码如下:
#!/usr/bin/python/' target='_blank'>python
#-*-coding:utf-8-*-
# JCrawler
# Author: Jam <810441377@qq.com>

import time
import urllib2
from bs4 import BeautifulSoup

# 目标站点
TargetHost = "http://adirectory.blog.com"
# User Agent
UserAgent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36'
# 链接采集规则
# 目录链接采集规则
CategoryFind = [{'findMode':'find','findTag':'div','rule':{'id':'cat-nav'}},
{'findMode':'findAll','findTag':'a','rule':{}}]
# 文章链接采集规则
ArticleListFind = [{'findMode':'find','findTag':'div','rule':{'id':'content'}},
{'findMode':'findAll','findTag':'h2','rule':{'class':'title'}},
{'findMode':'findAll','findTag':'a','rule':{}}]
# 分页URL规则
PageUrl = 'page/#page/'
PageStart = 1
PageStep = 1
PageStopHtml = '404: Page Not Found'

def GetHtmlText(url):
request = urllib2.Request(url)
request.add_header('Accept', "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp")
request.add_header('Accept-Encoding', "*")
request.add_header('User-Agent', UserAgent)
return urllib2.urlopen(request).read()

def ArrToStr(varArr):
returnStr = ""
for s in varArr:
returnStr += str(s)
return returnStr


def GetHtmlFind(htmltext, findRule):
findReturn = BeautifulSoup(htmltext)
returnText = ""
for f in findRule:
if returnText != "":
findReturn = BeautifulSoup(returnText)
if f['findMode'] == 'find':
findReturn = findReturn.find(f['findTag'], f['rule'])
if f['findMode'] == 'findAll':
findReturn = findReturn.findAll(f['findTag'], f['rule'])
returnText = ArrToStr(findReturn)
return findReturn

def GetCategory():
categorys = [];
htmltext = GetHtmlText(TargetHost)
findReturn = GetHtmlFind(htmltext, CategoryFind)

for tag in findReturn:
print "[G]->Category:" + tag.string + "|Url:" + tag['href']
categorys.append({'name': tag.string, 'url': tag['href']})
return categorys;

def GetArticleList(categoryUrl):
articles = []
page = PageStart
#pageUrl = PageUrl
while True:
htmltext = ""
pageUrl = PageUrl.replace("#page", str(page))
print "[G]->PageUrl:" + categoryUrl + pageUrl
while True:
try:
htmltext = GetHtmlText(categoryUrl + pageUrl)
break
except urllib2.HTTPError,e:
print "[E]->HTTP Error:" + str(e.code)
if e.code == 404:
htmltext = PageStopHtml
break
if e.code == 504:
print "[E]->HTTP Error 504: Gateway Time-out, Wait"
time.sleep(5)
else:
break

if htmltext.find(PageStopHtml) >= 0:
print "End Page."
break
else:

findReturn = GetHtmlFind(htmltext, ArticleListFind)

for tag in findReturn:
if tag.string != None and tag['href'].find(TargetHost) >= 0:
print "[G]->Article:" + tag.string + "|Url:" + tag['href']
articles.append({'name': tag.string, 'url': tag['href']})

page += 1

return articles;

print "[G]->GetCategory"
Mycategorys = GetCategory();
print "[G]->GetCategory->Success."
time.sleep(3)
for category in Mycategorys:
print "[G]->GetArticleList:" + category['name']
GetArticleList(category['url'])

(责任编辑:admin)






帮助中心
会员注册
找回密码
新闻中心
快捷通道
域名登录面板
虚机登录面板
云主机登录面板
关于我们
关于我们
联系我们
联系方式

售前咨询:17830004266(重庆移动)

企业QQ:383546523

《中华人民共和国工业和信息化部》 编号:ICP备00012341号

Copyright © 2002 -2018 香港云主机 版权所有
声明:香港云主机品牌标志、品牌吉祥物均已注册商标,版权所有,窃用必究

云官方微信

在线客服

  • 企业QQ: 点击这里给我发消息
  • 技术支持:383546523

  • 公司总台电话:17830004266(重庆移动)
  • 售前咨询热线:17830004266(重庆移动)