StaleElementReferenceException 不再是问题：Google Colab 上的 Selenium 技巧

亿牛云爬虫专家

于 2024-07-01 13:52:26 发布

阅读量446

点赞数 5

分类专栏： seleuium python 爬虫代理文章标签： selenium 测试工具 Google Colab python 网络爬虫爬虫代理代理IP

本文链接：https://blog.csdn.net/ip16yun/article/details/140099156

版权

爬虫代理同时被 3 个专栏收录

223 篇文章 3 订阅

订阅专栏

python

124 篇文章 0 订阅

订阅专栏

seleuium

45 篇文章 0 订阅

订阅专栏

背景介绍

在现代网页数据抓取领域，Selenium 是一款强大的工具，它使得自动化浏览和数据提取变得异常简单。然而，当面对动态页面时，许多爬虫开发者常常会遇到一个令人头疼的问题——StaleElementReferenceException。这一异常的出现，往往会让我们的爬虫任务陷入停滞。今天，我们将在 Google Colab 环境中，结合代理 IP 技术，深入探讨如何有效解决这一问题，并以澎湃新闻的热点新闻页面为示例，进行实际操作。

问题陈述

StaleElementReferenceException 异常通常发生在尝试访问页面上已经发生变化或被更新的元素时。简单来说，当页面重新加载或部分内容更新时，之前定位到的元素引用就会失效，导致此异常的抛出。这对于动态页面的数据抓取尤为常见，且难以预测。

解决方案

为了解决这一问题，我们需要采取一些预防和恢复措施。具体步骤如下：

显式等待（Explicit Waits）：等待元素加载或更新完毕，再进行下一步操作。
捕获异常并重试：在捕获到StaleElementReferenceException异常时，重新定位元素并重试操作。
代理 IP 技术：使用亿牛云爬虫代理来分散请求压力，避免频繁刷新页面。

以下是详细的实现代码，演示如何在 Google Colab 上使用 Selenium 和代理 IP 技术，并抓取澎湃新闻的热点新闻：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import StaleElementReferenceException
import time

# 设置代理信息 亿牛云爬虫代理加强版
proxy = "http://username:password@www.16yun.cn:8100"

# 配置Selenium使用代理
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'--proxy-server={proxy}')

# 启动浏览器
driver = webdriver.Chrome(options=chrome_options)

def fetch_hot_news(url):
    driver.get(url)
  

    try:
        # 显式等待，直到热点新闻元素加载完成
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//div[contains(@class, 'news-content')]/h2/a"))
        )
        # 返回热点新闻标题
        hot_news = [elem.text for elem in driver.find_elements(By.XPATH, "//div[contains(@class, 'news-content')]/h2/a")]
        return hot_news
  

    except StaleElementReferenceException as e:
        print("捕获到StaleElementReferenceException，重试中...")
        time.sleep(1)  # 等待一秒后重试
        return fetch_hot_news(url)
  

    except Exception as e:
        print(f"发生其他异常：{e}")
  

    finally:
        driver.quit()

# 测试函数
url = "https://www.thepaper.cn/"
hot_news = fetch_hot_news(url)
print(hot_news)

案例分析

在上面的代码中，我们首先配置了 Selenium 使用亿牛云的代理 IP。通过代理服务器，我们能够分散请求流量，减少对目标网站的访问压力，从而降低频繁更新页面的可能性。接下来，我们定义了一个 fetch_hot_news 函数，用于抓取澎湃新闻网站上的热点新闻标题。
在函数内部，我们使用显式等待确保热点新闻元素加载完毕，并在捕获到 StaleElementReferenceException 异常时，等待一秒后重新尝试抓取数据。这一措施有效地避免了因为元素更新导致的抓取失败。