其他

爬取今日头条Ajax请求，爬取今日头条数据

悠悠楠杉

2025-06-22

0 评论

1 阅读

正在检测是否收录...

06/22

1. 安装必要的库

首先，确保安装了requests和lxml库。可以使用pip安装：

bash pip install requests lxml

2. 编写爬虫代码

```python
import requests
from bs4 import BeautifulSoup
import re

def getarticledata(url):
# 发送请求
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')

# 提取文章标题、关键词、描述和正文内容
title = soup.find('h1', class_='c-single-title-main').get_text() if soup.find('h1', class_='c-single-title-main') else 'No title'
keywords = re.findall(r'class="c-single-info-tags" data-text="([^"]+)"', str(soup))[0] if len(re.findall(r'class="c-single-info-tags" data-text="([^"]+)"', str(soup))) > 0 else 'No keywords'
description = soup.find('div', class_='c-single-summary').get_text() if soup.find('div', class_='c-single-summary') else 'No description'
content = ''
for p in soup.find_all('p'):  # 查找段落，并连接为正文内容
    content += p.get_text() + '\n\n'
    if len(content) > 1000:  # 限制正文内容不超过1000字
        break
if len(content) < 1000:  # 如果未达到1000字，则截取到合适位置
    for div in soup.find_all('div'):  # 继续从<div>中获取更多文本，尝试获取更多内容
        if 'c-single-content' in div['class']:  # 判断是否为文章正文内容
            content += div.get_text() + '\n\n'
            if len(content) > 1000:  # 限制总长度为1000字左右
                break
            break  # 假设只从第一个<div class="c-single-content">中获取足够的内容
return {
    'title': title,
    'keywords': keywords,
    'description': description,
    'content': content[:1000]  # 确保内容不超过1000字
}

def generatemarkdown(data): markdowncontent = f"# {data['title']}\n"
markdowncontent += f"## 关键词\n{data['keywords']}\n" markdowncontent += f"## 描述\n{data['description']}\n"
markdowncontent += f"## 正文\n{data['content']}\n" return markdowncontent

示例URL（请替换为实际文章URL）

url = 'https://www.toutiao.com/a6788488526732659749/' # 示例URL，需要替换为实际的文章URL
articledata = getarticledata(url)
markdowndata)
print(markdown_output) # 打印Markdown格式的输出结果
```
注意：上述代码仅为示例，实际使用时请确保遵守网站的robots.txt规则和版权规定，并尊重网站的使用条款。此外，由于今日头条的反爬虫机制，可能需要使用更复杂的策略如设置代理、处理JavaScript渲染等。上述代码假设了静态内容的直接访问。

朗读

分享

分享到QQ 分享到微博

赞（0）

版权属于：
至尊技术网

本文链接：
https://www.zzwws.cn/archives/30549/（转载时请注明本文出处及文章链接）

作品采用：
《署名-非商业性使用-相同方式共享 4.0 国际 (CC BY-NC-SA 4.0)》许可协议授权

至尊技术网

爬取今日头条Ajax请求，爬取今日头条数据

1. 安装必要的库

2. 编写爬虫代码

示例URL（请替换为实际文章URL）

人生倒计时

最新回复

标签云