iOS工程师怎么样选取python写网路爬虫,python写的

python写的简易可行的爬虫代码

原稿链接: iOS工程师如何运用python写网路爬虫
以前见到叶孤城写的iOS技师怎样利用python写网路爬虫一文,就写了四个爬虫练练手,
方今察觉最先的小说作的用的Beautiful Soup 3 近来早就终止开荒,推荐在前几天的品类中动用Beautiful Soup 4
爬的网址三番两次都不能用了,写下本人那边校订的局地东西,仅供参照他事他说加以考查

python写的简要有效的爬虫代码

by 伍雪颖

import re
import urllib

def getHtml(url):
        html = urllib.urlopen(url)
        scode = html.read()
        return scode

def getImage(source):
        reg = r'src="(.*?.jpg)"'
        imgre = re.compile(reg)
        images = re.findall(imgre,source)
        x = 0
        for i in images:
                urllib.urlretrieve(i,'%s.jpg' % x)
                x+=1

source = getHtml('http://tieba.baidu.com/p/3237470549')
print getImage(source)

python写的简练可行的爬虫代码 by 伍雪颖 import reimport urllib def getHtml(url卡塔尔国: html = urllib.urlopen(url卡塔尔(قطر‎ scode = html.read(卡塔尔(قطر‎...

应接我们来嘲笑

<pre>
$sudo pip install beautifulsoup4 或
$sudo easy_install beautifulsoup4 </p>
急需在意的是初阶化的时候要动用 BeautifulSoup(html, "html.parser"卡塔尔国
</pre>

运用的主网站连接:千图网
页面构造结构如下图,首就算<div class='show-area-pic'
下一场得到在那之中的img就能够了

千图网页布局.jpg

现实的一些操作细节能够参谋iOS程序员怎么样选用python写网路爬虫一文,笔者就十分少废话了,暗许是保留在桌面上新建了一个58pic的公文夹里,也足以纠正一下存到sqlite里,也比较轻易的,以往再立异

详细代码如下:
<pre>

--coding:utf-8--

! /usr/bin/python

import math
import urllib
import urllib2
import os

import request

import sqliteManager
import sys
from bs4 import BeautifulSoup;
def getAllImageLink(start,end):
i = start
path = os.path.expanduser(r'~/Desktop/')
folder = '58pic'
fullPath = path + folder
os.chdir(path)
if os.path.exists(fullPath) == False :
os.mkdir(folder)
else:
# print (unicode('已经存在文件夹: rosi','utf-8'卡塔尔(قطر‎卡塔尔(قطر‎
print ('has exist')
os.chdir(fullPath)
mainPath = os.getcwd()
while (i < end):
os.chdir(mainPath)
url = 'http://www.58pic.com/yuanchuang/%d.html' % i
try:
response=urllib2.urlopen(url,data=None,timeout=120)
# response=.get(url)
except urllib2.URLError as e:
print ('%s : %s' % (url,e.reason))
i += 1
continue
print (url)
html = response.read()
if html.strip()=='':
print ('not exist URL: %s',url)
continue
soup = BeautifulSoup(html, "html.parser")
liResult = soup.findAll('div',attrs={"class":"show-area-pic"})
for li in liResult:
imageEntityArray = li.findAll('img')
for image in imageEntityArray:
title = image.get('title')
href = image.get('src')
splitList = href.split('/')
fileSavePath = mainPath + '/' + str(splitList[len(splitList) - 1])
urllib.urlretrieve(href,fileSavePath)
print (fileSavePath)
i = i + 1
if __name__ == '__main__':
start = 19840500
getAllImageLink(start,start + 100)
</pre>

本文由365bet体育在线官网发布于关于计算机,转载请注明出处:iOS工程师怎么样选取python写网路爬虫,python写的

TAG标签:
Ctrl+D 将本页面保存为书签,全面了解最新资讯,方便快捷。