其他

Python高效抓取网页表格数据：Pandas.read_html实战指南，python抓取网页内容到excel

悠悠楠杉

2025-12-11

0 评论

42 阅读

正在检测是否收录...

12/11

正文：

在数据分析和爬虫领域，网页表格数据的抓取一直是高频需求。传统方法往往需要手动解析HTML或依赖第三方库，而Pandas提供的read_html函数，能以极简代码实现高效抓取。本文将带你深入实战，掌握这一神器的使用技巧。

一、为什么选择read_html？

相比BeautifulSoup或Scrapy等工具，pandas.read_html的核心优势在于：
1. 零代码解析：自动识别<table>标签并转换为DataFrame
2. 内置依赖：依赖html5lib/lxml等解析库，无需额外安装
3. 一行代码搞定：从URL到结构化数据只需一个函数调用

import pandas as pd  
tables = pd.read_html("https://example.com/stock")  
print(tables[0].head())  # 输出第一个表格的前5行

二、实战四步法

1. 基础抓取：从URL到DataFrame

直接传入网页地址即可抓取公开表格（需注意反爬限制）：

url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP"  
gdp_tables = pd.read_html(url, attrs={"class": "wikitable"})

2. 精准定位：attrs与match参数

当页面含多个表格时，可通过属性或文本匹配精准定位：
- attrs：匹配表格的HTML属性（如class/id）
- match：筛选包含特定文本的表格

# 抓取class为"data"且包含"2023"字样的表格  
target_table = pd.read_html(url, attrs={"class": "data"}, match="2023")

3. 处理复杂结构：多级表头与缺失值

遇到合并单元格或复杂表头时：
- 使用header参数指定表头行
- skiprows跳过干扰行
- na_values标记缺失值

df = pd.read_html(url, header=[0,1], skiprows=2, na_values=["N/A"])

4. 动态页面应对：结合Selenium

对于JS渲染的页面，先用Selenium获取HTML源码：

from selenium import webdriver  
driver = webdriver.Chrome()  
driver.get(url)  
html = driver.page_source  
tables = pd.read_html(html)

三、五大常见问题解决方案

编码错误：指定encoding参数（如encoding="utf-8"）
超时问题：设置timeout=30并配合异常捕获
反爬限制：添加请求头模拟浏览器访问
性能优化：优先使用lxml解析器（需安装lxml库）
数据清洗：结合pandas的dropna()、fillna()等方法

四、高级技巧：API对接与自动化

将抓取逻辑封装为函数，实现定时任务：

def scrape_table(url):  
    try:  
        tables = pd.read_html(url, flavor="lxml")  
        return tables[0].dropna(axis=1)  
    except Exception as e:  
        print(f"抓取失败: {e}")  

# 定时执行（需配合APScheduler等工具）  
schedule.every(1).hour.do(scrape_table, url=target_url)