香港云主机最佳企业级服务商!

ADSL拨号VPS包含了中国大陆(联通,移动,电信,)

中国香港,国外拨号VPS。

当前位置:云主机 > python >

电信ADSL拨号VPS
联通ADSL拨号VPS
移动ADSL拨号VPS

python实现爬取千万淘宝商品的方法


时间:2021-12-08 14:52 作者:admin


本文实例讲述了python/' target='_blank'>python实现爬取千万淘宝商品的方法。分享给大家供大家参考。具体实现方法如下:

import timeimport leveldbfrom urllib.parse import quote_plus import reimport jsonimport itertoolsimport sysimport requestsfrom queue import Queuefrom threading import ThreadURL_BASE = 'http://s.m.taobao.com/search?q={}&n=200&m=api4h5&style=list&page={}'def url_get(url):  # print('GET ' + url)  header = dict()  header['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'  header['Accept-Encoding'] = 'gzip,deflate,sdch'  header['Accept-Language'] = 'en-US,en;q=0.8'  header['Connection'] = 'keep-alive'  header['DNT'] = '1'  #header['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36'  header['User-Agent'] = 'Mozilla/12.0 (compatible; MSIE 8.0; Windows NT)'  return requests.get(url, timeout = 5, headers = header).textdef item_thread(cate_queue, db_cate, db_item):  while True:    try:      cate = cate_queue.get()      post_exist = True      try:        state = db_cate.Get(cate.encode('utf-8'))        if state != b'OK': post_exist = False      except:        post_exist = False      if post_exist == True:        print('cate-{}: {} already exists ... Ignore'.format(cate, title))        continue      db_cate.Put(cate.encode('utf-8'), b'crawling')      for item_page in itertools.count(1):        url = URL_BASE.format(quote_plus(cate), item_page)        for tr in range(5):          try:            items_obj = json.loads(url_get(url))            break          except KeyboardInterrupt:            quit()          except Exception as e:            if tr == 4: raise e        if len(items_obj['listItem']) == 0: break        for item in items_obj['listItem']:          item_obj = dict(            _id = int(item['itemNumId']),            name = item['name'],            price = float(item['price']),            query = cate,            category = int(item['category']) if item['category'] != '' else 0,            nick = item['nick'],            area = item['area'])          db_item.Put(str(item_obj['_id']).encode('utf-8'),                json.dumps(item_obj, ensure_ascii = False).encode('utf-8'))        print('Get {} items from {}: {}'.format(len(items_obj['listItem']), cate, item_page))        if 'nav' in items_obj:          for na in items_obj['nav']['navCatList']:            try:              db_cate.Get(na['name'].encode('utf-8'))            except:              db_cate.Put(na['name'].encode('utf-8'), b'waiting')      db_cate.Put(cate.encode('utf-8'), b'OK')      print(cate, 'OK')    except KeyboardInterrupt:      break    except Exception as e:      print('An {} exception occured'.format(e))def cate_thread(cate_queue, db_cate):  while True:    try:      for key, value in db_cate.RangeIter():        if value != b'OK':          print('CateThread: put {} into queue'.format(key.decode('utf-8')))          cate_queue.put(key.decode('utf-8'))      time.sleep(10)    except KeyboardInterrupt:      break    except Exception as e:      print('CateThread: {}'.format(e))if __name__ == '__main__':  db_cate = leveldb.LevelDB('./taobao-cate')  db_item = leveldb.LevelDB('./taobao-item')  orig_cate = '正装'  try:    db_cate.Get(orig_cate.encode('utf-8'))  except:    db_cate.Put(orig_cate.encode('utf-8'), b'waiting')  cate_queue = Queue(maxsize = 1000)  cate_th = Thread(target = cate_thread, args = (cate_queue, db_cate))  cate_th.start()  item_th = [Thread(target = item_thread, args = (cate_queue, db_cate, db_item)) for _ in range(5)]  for item_t in item_th:    item_t.start()  cate_th.join()

希望本文所述对大家的Python程序设计有所帮助。

(责任编辑:admin)






帮助中心
会员注册
找回密码
新闻中心
快捷通道
域名登录面板
虚机登录面板
云主机登录面板
关于我们
关于我们
联系我们
联系方式

售前咨询:17830004266(重庆移动)

企业QQ:383546523

《中华人民共和国工业和信息化部》 编号:ICP备00012341号

Copyright © 2002 -2018 香港云主机 版权所有
声明:香港云主机品牌标志、品牌吉祥物均已注册商标,版权所有,窃用必究

云官方微信

在线客服

  • 企业QQ: 点击这里给我发消息
  • 技术支持:383546523

  • 公司总台电话:17830004266(重庆移动)
  • 售前咨询热线:17830004266(重庆移动)