其他

Python爬虫如何抓取新闻网站：实战教程

悠悠楠杉

2025-11-21

0 评论

7 阅读

正在检测是否收录...

11/21

进入详情页后，目标是提取四个关键字段：标题、关键词、描述和正文。标题通常位于<h1>标签内；描述可能在<meta name="description">中；正文则多集中在某个具有特定class的<div>容器里。

python
def extractarticle(url): res = requests.get(url, headers=headers) artsoup = BeautifulSoup(res.text, 'lxml')

title = art_soup.find('h1').get_text(strip=True) if art_soup.find('h1') else "未知标题"

desc_tag = art_soup.find('meta', attrs={'name': 'description'})
description = desc_tag['content'] if desc_tag else ""

content_div = art_soup.find('div', class_='article-content')
paragraphs = content_div.find_all('p') if content_div else []
full_text = '\n'.join([p.get_text(strip=True) for p in paragraphs])

# 截取前1000字左右
body = full_text[:1000] if len(full_text) > 1000 else full_text

关键词提取方面，可借助jieba进行中文分词与TF-IDF权重计算：

python
import jieba.analyse

keywords = jieba.analyse.extract_tags(body, topK=5, withWeight=False)

最终将这些信息组织成字典格式，便于存储为JSON或写入数据库。

数据保存与后续处理

抓取完成后，可将结果写入本地文件：

python
import json

data = {
"title": title,
"keywords": keywords,
"description": description,
"body": body,
"source_url": url
}

with open("newsdata.json", "w", encoding="utf-8") as f: json.dump(data, f, ensureascii=False, indent=2)

若需长期运行，建议结合schedule库实现定时任务，或使用Scrapy框架构建更稳定的分布式爬虫系统。

整个过程看似简单，实则需要不断调试选择器、应对页面变化、处理编码异常。真正的爬虫高手，不仅懂代码，更懂内容结构与网络伦理。掌握这项技能，意味着你拥有了从互联网中“淘金”的能力——但请始终记得：技术向善，方能行远。

朗读

版权属于：

至尊技术网

本文链接：

https://www.zzwws.cn/archives/38948/（转载时请注明本文出处及文章链接）

作品采用：

《署名-非商业性使用-相同方式共享 4.0 国际 (CC BY-NC-SA 4.0)》许可协议授权

至尊技术网

Python爬虫如何抓取新闻网站：实战教程

数据保存与后续处理

人生倒计时

最新回复

标签云