```中的href是文章的连接,通过bs4,获取``````内部所有的```
```标签
使用css选择器方式搜索
```
import requests
from bs4 import BeautifulSoup
def main():
'''
访问首页,并抓取信息列表的链接
'''
global session
session=requests.session()
with session.get(url) as resp:
if resp.status_code == 200:
soup = BeautifulSoup(resp.text,"html.parser")
#css选择器,返回list
l=soup.select('#main-news h4 a')
print(l)
```
结果为:
```
[
2018年宁夏处分区管干部104人 追回外逃人员4人
,
金隅集团20亿元公司债券明日上市
, 福州地铁2号线将于5月1日前开通载客试运营,
阿拉伯经济峰会仅三国首脑出席,东道主黎巴嫩呼吁叙利亚难民回家
]
```
注意上面的超链接开头没有```http:```
## 3. 获取a标签的href属性
```
import requests
from bs4 import BeautifulSoup
def main():
'''
访问首页,并抓取信息列表的链接
'''
global session
session=requests.session()
with session.get(url) as resp:
if resp.status_code == 200:
soup = BeautifulSoup(resp.text,"html.parser")
l=soup.select('#main-news h4 a')
print(l)
if l:
for item in l:
print(type(item),type(item['href']),item.get('href'))
time.sleep(1) #休眠1秒钟,频繁快速访问服务器会被拒绝
get_article('http:'+item.get('href'))#拼装http:
```
注意这里用到get_article()函数,但此时我们还没有定义该函数
## 4. 分析文章详细页面,抓取标题、内容
定义函数get_article(),传入文章详细页面的链接,通过requests向服务器发送请求,然后通过bs4分析网页,抓取标题、内容,并准备要下载内容中的图片,移除用不到的内容,将标题、内容写入到html中
如下:
```
def get_article(url):
'''
访问链接,分析出文章标题、内容、下载图片
'''
global session
print("\n\n文章url",url)
try:
resp=session.get(url)
if resp.status_code==200:
soup = BeautifulSoup(resp.text,"html.parser")
title=soup.select('h1')[0].text.strip()
article=soup.select('article')[0]
print('标题',title)
# print('内容',article.prettify())
article2=download_img(article) #下载文章内容中的图片
article2.select('#backsohucom')[0].decompose() #移除返回
article2.select('p[data-role="editor-name"]')[0].decompose() #移除责任编辑
write_html(url[url.rfind('/')+1:],title,article2.prettify()) #将标题、内容写入html中
# print('修改图片路径后的内容:',article)
print('-'*20,'\n\n\n')
except BaseException as e:
raise
```
### 5. 下载文章内容的图片
传入文章内容,搜索出所有的```
```标签,根据```src```属性下载图片,下载到img文件夹中,同时还要修改文章内容```
```标签的```src```属性,让其指向本地的图片
```
def download_img(article):
'''
下载图片
'''
try:
imgs=article.select('img')
print('本文图片数量:',len(imgs))
for item in imgs:
url=item['src']
print('图片url:',url)
if not url.startswith('http:'):
url='http:'+url
name=url[url.rfind('/')+1:]
resp=session.get(url)
# 获取的数据是二进制的图片
img = resp.content
# 将图片数据写入本地文件
with open( './img/'+name,'wb' ) as f:
f.write(img)
# 修改文章内容的img标签的src属性,指向本地图片
item['src']='../img/'+name
except BaseException as e:
raise
```
### 6. 将标题、文章内容写入html
```
def write_html(filename,title,article):
'''
生成html
'''
with open('./html/'+filename+'.html','w',encoding='utf-8') as f:
f.write(title)
f.write('\n
\n')
f.write(article)
```
### 7. 访问下一页
在该网页中,采用了下拉加载数据,通过谷歌浏览器分析,我们得知,在下拉滚动条时,会发送链接关键字为feed的请求,参见下面截图:
[](http://www.malaoshi.top/upload/0/0/1EF2eaU4c80D.png)
在上面链接中,page参数就是页码,从2开始
**响应内容:** 不是标准的json数据,需要做额外处理
代码如下:
```
url2='http://v2.sohu.com/public-api/feed?scene=CATEGORY&sceneId=1460&page=%d&size=20&callback=jQuery1124049905598042107147_1547826301316&_=1547826301366'
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
,'Accept': '*/*'
,'Accept-Encoding': 'gzip, deflate'
,'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8'
,'Connection': 'keep-alive'
,'Host': 'v2.sohu.com'
,'Referer': 'http://www.sohu.com/c/8/1460'
,'Cookie': 'SUV=180508174953J48I; gidinf=x099980109ee0dcaa6cfc80400002e3662edd8b4046d; beans_mz_userid=fybEf0w6qeeq; IPLOC=CN2200; reqtype=pc; t=1548041648603'
}
def next_page():
'''
访问下一页
'''
global session
for item in range(2,30):
time.sleep(1)
try:
print('\n\n获取第%d页列表-------------------------\n'%item)
url=url2%item
resp=session.get(url,headers=headers)
if resp.status_code==200:
text=resp.text
# print(text[-1:])
if text[-1:]!=';':
print(text)
print('内容不完整------------')
continue
text=text[text.find('['):text.rfind(']')+1]
print(text)
j=json.loads(text)
print('\n\n列表数量:',len(j))
for d in j:
time.sleep(2)
print('\n\n',d['id'],d['authorId'])
get_article('http://www.sohu.com/a/%d_%d'%(d['id'],d['authorId']))
except BaseException as e:
raise
```
##完整代码
```
#coding='utf-8'
import requests
from bs4 import BeautifulSoup
import time
import json
url='http://www.sohu.com/c/8/1460'
url2='http://v2.sohu.com/public-api/feed?scene=CATEGORY&sceneId=1460&page=%d&size=20&callback=jQuery1124049905598042107147_1547826301316&_=1547826301366'
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
,'Accept': '*/*'
,'Accept-Encoding': 'gzip, deflate'
,'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8'
,'Connection': 'keep-alive'
,'Host': 'v2.sohu.com'
,'Referer': 'http://www.sohu.com/c/8/1460'
,'Cookie': 'SUV=180508174953J48I; gidinf=x099980109ee0dcaa6cfc80400002e3662edd8b4046d; beans_mz_userid=fybEf0w6qeeq; IPLOC=CN2200; reqtype=pc; t=1548041648603'
}
session=None
def main():
'''
访问首页,并抓取信息列表的链接
'''
global session
session=requests.session()
with session.get(url) as resp:
if resp.status_code == 200:
soup = BeautifulSoup(resp.text,"html.parser")
l=soup.select('#main-news h4 a')
print(l)
if l:
for item in l:
time.sleep(1)
print(type(item),type(item['href']),item.get('href'))
get_article('http:'+item.get('href'))
next_page()
def get_article(url):
'''
访问链接,分析出文章标题、内容、下载图片
'''
global session
print("\n\n文章url",url)
try:
resp=session.get(url)
if resp.status_code==200:
soup = BeautifulSoup(resp.text,"html.parser")
title=soup.select('h1')[0].text.strip()
article=soup.select('article')[0]
print('标题',title)
# print('内容',article.prettify())
download_img(article)
article2.select('#backsohucom')[0].decompose()
article2.select('p[data-role="editor-name"]')[0].decompose()
write_html(url[url.rfind('/')+1:],title,article2.prettify())
# print('修改图片路径后的内容:',article)
print('-'*20,'\n\n\n')
except BaseException as e:
raise
def write_html(filename,title,article):
'''
生成html
'''
with open('./html/'+filename+'.html','w',encoding='utf-8') as f:
f.write(title)
f.write('\n
\n')
f.write(article)
def download_img(article):
'''
下载图片
'''
try:
imgs=article.select('img')
print('本文图片数量:',len(imgs))
for item in imgs:
url=item['src']
print('图片url:',url)
if not url.startswith('http:'):
url='http:'+url
name=url[url.rfind('/')+1:]
resp=session.get(url)
# 获取的数据是二进制的图片
img = resp.content
# 将图片数据写入本地文件
with open( './img/'+name,'wb' ) as f:
f.write(img)
# 修改文章内容的img标签的src属性,指向本地图片
item['src']='../img/'+name
except BaseException as e:
raise
def next_page():
'''
访问下一页
'''
global session
for item in range(2,30):
time.sleep(1)
try:
print('\n\n获取第%d页列表-------------------------\n'%item)
url=url2%item
resp=session.get(url,headers=headers)
if resp.status_code==200:
text=resp.text
# print(text[-1:])
if text[-1:]!=';':
print(text)
print('内容不完整------------')
continue
text=text[text.find('['):text.rfind(']')+1]
print(text)
j=json.loads(text)
print('\n\n列表数量:',len(j))
for d in j:
time.sleep(2)
print('\n\n',d['id'],d['authorId'])
get_article('http://www.sohu.com/a/%d_%d'%(d['id'],d['authorId']))
except BaseException as e:
raise
if __name__ == "__main__":
main()
```
原文出处:http://www.malaoshi.top/show_1EF2eVQdskM9.html