python 爬虫 ip池怎么做？ - 代理IP

2019年4月5 篇
2019年3月18 篇
2019年2月6 篇
2019年1月5 篇
2018年12月10 篇
2018年11月11 篇
2018年10月7 篇
2018年9月19 篇
2018年8月13 篇
2018年7月14 篇
2018年6月13 篇
2018年5月11 篇
2018年4月8 篇
2018年3月10 篇
2018年2月3 篇
2018年1月12 篇
2017年12月20 篇
2017年11月21 篇
2017年10月12 篇
2017年9月21 篇
2017年8月24 篇
2016年11月16 篇
2016年10月31 篇
2016年9月21 篇
2016年8月31 篇
2016年7月51 篇
2016年6月129 篇

python 爬虫 ip池怎么做？

提问时间：2016/8/15 11:42:31

如题怎么避免被封

1楼（未知网友）

如果爬虫对稳定性比较高的话，还是买代理ip吧，

2楼（匿名用户）

用多个tor实例

3楼（未知网友）

再写一个爬虫抓取某些网站的代理ip信息

4楼（未知网友）

稳定性要求不高的话，抓代理吧。可以不用自己造轮子，github上有很多现成的爬虫。
稳定性要求较高的话，还是买代理吧。网上的免费代理，单个请求的成功率能达到50%就很不错了，有些付费代理的成功率能达到80%以上。
如果是单机运行又不想抓代理或购买代理，可以试试tor，就是请求时间太久了。

5楼（站大爷用户）

买站大爷的代理IP

6楼（未知网友）

买吧，免费的都非常不稳定
刚找到之前写的一个抓xici代理的代码，可以参考下，没完全写完，懒得改了，主要是因为可用的ip比例很低，意义不大。

#仅仅写了mysql的,使用了pymysql。请确保自己的数据库中无名为proxy的数据库，否则会出错。
import pymysql
import requests
from bs4 import BeautifulSoup
import time
import re

def dbstatus(user,pwd):#检测数据库proxy状态，无则创建。
conn=pymysql.connect(host='127.0.0.1',user=user,passwd=pwd,charset='utf8')
cur=conn.cursor()
conn.select_db('information_schema')
dbstat=cur.execute('SELECT * from SCHEMATA where SCHEMA_NAME="proxy"')
if dbstat==0:
print('数据库不存在，正在创建数据库...')
cur.execute("CREATE DATABASE proxy DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci")#添加proxy数据库，默认utf8
conn.select_db('proxy')
print('创建temp表...')
cur.execute("""
CREATE TABLE `temp` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ip` varchar(255) COLLATE utf8_bin NOT NULL,
`port` varchar(255) COLLATE utf8_bin NOT NULL,
`place` varchar(255) COLLATE utf8_bin,
`iptype` varchar(255) COLLATE utf8_bin NOT NULL,
`time` varchar(255) COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`)
)ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
""")
print('创建temp表成功')
print('创建iptable表...')
cur.execute("""
CREATE TABLE `iptable` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ip` varchar(255) COLLATE utf8_bin NOT NULL,
`port` varchar(255) COLLATE utf8_bin NOT NULL,
`place` varchar(255) COLLATE utf8_bin,
`iptype` varchar(255) COLLATE utf8_bin NOT NULL,
`time` varchar(255) COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`)
)ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
""")
print('创建iptable表成功')
print('创建完成')
cur.close()

def xicidaili():#将西刺代理的数据入库
nows=time.time()
home='http://www.xicidaili.com'
guonei='http://www.xicidaili.com/nn/'
guowai='http://www.xicidaili.com/wn/'
def getiplist(url):
count=0
sql="INSERT INTO temp(ip,port,place,iptype,time) VALUES (%s,%s,%s,%s,%s)"
headers_base = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,zh-TW;q=0.2',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36',
}
conn=pymysql.connect(host='127.0.0.1',user='root',passwd='root',db='proxy',charset='utf8')
cur=conn.cursor()
s = requests.session()
res = s.get(url, headers=headers_base)
ipsoup=BeautifulSoup(res.content.decode('utf8'), "html.parser")
nextpagetgae=ipsoup.find_all(attrs={'class':'next_page'})
while nextpagetgae!=[]:
iptable=ipsoup.find(attrs={'id':'ip_list'})
iplist=iptable.find_all('tr')
for i in iplist[1:]:
ipinfo=i.find_all(name='td')
ip=ipinfo[1].string
port=ipinfo[2].string
place=ipinfo[3].string
iptype=ipinfo[5].string
updatetime='20'+ipinfo[9].string
t=(int(updatetime[:4]),int(updatetime[5:7]),int(updatetime[8:10]),int(updatetime[11:13]),int(updatetime[14:16]),0,0,0,0)
updatetimes=time.mktime(t)
if nows+28800-updatetimes>86400*1:#如果更新时间与现时间差超过X天则丢弃,28800是为了纠正东八区的8小时时差,86400是一天
conn.commit()
cur.close()
print('导入%d条数据'%count)
return
if place==None:
place=ipinfo[3].a.string
cur.execute(sql,[ip,port,place.strip(),iptype,updatetime])
count+=1
conn.commit()
if nextpagetgae[0].get('href')==None:
break
nextpage=home+nextpagetgae[0].get('href')
ipsoup=BeautifulSoup(s.get(nextpage,headers=headers_base).content.decode('utf8'), "html.parser")
nextpagetgae=ipsoup.find_all(attrs={'class':'next_page'})
cur.close()
print('导入%d条数据'%count)
getiplist(guonei)
getiplist(guowai)

dbstatus('root','root')
xicidaili()

7楼（未知网友）

@黄哥的思路往下走，python爬取实时可用ip代理代码，之前写的，申明：最早是发布于我的开源中国博客上的。

8楼（未知网友）

抓代理。