肿瘤康复网,内容丰富有趣,生活中的好帮手!
肿瘤康复网 > python爬取新浪新闻意义_爬取新浪新闻

python爬取新浪新闻意义_爬取新浪新闻

时间:2021-12-30 14:28:39

相关推荐

[Python] 纯文本查看 复制代码import requests

import os

from bs4 import BeautifulSoup

import re

# 爬取具体每个新闻内容

def getNews(url,title):

response = requests.get(url)

response.encoding = 'utf-8'

soup = BeautifulSoup(response.text,'lxml')

news_Date = soup.find('div',class_='date-source').span.string

news_Source = soup.find('div',class_='date-source').a.string

news_Content = soup.find('div',id='article').get_text()

rep = pile("[\s+\.\!\/_,$%^*(+\"\']+|[+<>?、~*()]+")

title = rep.sub('', title)

title = title.replace(':', ':')

folder = os.path.join(os.getcwd(),'news\\') # 这里有个坑:python中字符串的最后一个字符是斜杠会导致出错。

if not os.path.exists(folder):

os.mkdir(folder)

file_name = folder + title + '.txt'

#print(file_name)

with open(file_name,'w',encoding="utf-8") as file:

file.write(news_Date)

file.write(news_Source)

file.write(title)

file.write(news_Content)

# 获取各个新闻标题和链接

def getNews_title(url):

response = requests.get(url)

response.encoding = 'utf-8'

soup = BeautifulSoup(response.text,'lxml')

tag = soup.find('div',class_='ct_t_01')

for tag in tag.find_all('a',attrs={"target":"_blank"}):

news_site = tag.get('href')

news_title = tag.get_text()

getNews(news_site,news_title)

#运行程序

url = "/"

getNews_title(url)

如果觉得《python爬取新浪新闻意义_爬取新浪新闻》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。