Python 自學第十五天：網路爬蟲 Web Crawler - 操作 Cookie、連續抓取頁面

發表於 2019-12-23 分類於 Python

網路爬蟲基本步驟

程式模擬瀏覽器，建立網路連線。
觀察 HTML 網頁原始碼結構。
使用 BeautifulSoup4 模組解析原始碼，撰寫程式邏輯，取得自己想要的部分。
如果有需要，可以儲存成 Text 或 CSV 檔案。

上次的應用是：取得 ptt - movie 版最新頁面的所有文章標題。

這次嘗試：取得 ptt - 八卦版 (需要 Cookie) 中，最新頁面的所有文章標題。

觀念

Cookie 是網站存放在使用者瀏覽器中的一些資料，目的是為了增加使用者體驗，例如：上次關閉頁面時，捲軸的停留位置、臉書登錄資訊，使用者不需要每次都登入 (這就是為什麼沒登出會被朋友盜帳號) 等等。

八卦版的特殊之處在於，進版畫面會先出現一個「是否超過十八歲」的分級詢問：

在網頁中按滑鼠右鍵 >> 檢查 >> 打開「開發人員工具」>> 點擊 Application >> 點擊 Cookies：可以看到只有四筆資料。

點擊「年滿十八」的按鈕後，出現一個 Name:over18、Value:1 的資料。這個新增的資料，會讓下一次進入網站的瀏覽器，不會再出現「分級詢問」頁面，直接出現「文章列表」頁面。

因此，如果 Python 程式模擬瀏覽器連線，卻沒有加上 over18=1 的資訊，程式接收到的其實是「分級詢問」頁面，就無法抓取任何文章標題。

重新整理頁面，並且觀察 Request Headers，可以看到瀏覽器在向網站提出要求時，會附帶 cookie 參數。所以在下面的程式碼中，Python 模擬瀏覽器時，Request物件 中 headers 參數需要附加 cookie 資訊。

實作

將建立連線、解析原始碼、取得頁面所有文章標題的功能包裝成一個函式，如下：

# 連線至 批踢踢實業坊 - 八卦版
# https://www.ptt.cc/bbs/Gossiping/index.html
import urllib.request as req
import bs4

# 包裝函式
def getTitle(url):
    Url = req.Request(url, headers={
        "cookie": "over18=1",         # cookie 資訊
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
    })
    with req.urlopen(Url) as response:
        data = response.read().decode("utf-8")
    root = bs4.BeautifulSoup(data, "html.parser")
    text_titles = root.find_all("div", class_="title")
    for title in text_titles:
        if title.a != None:
            print(title.a.string)


# 使用函式
url = "https://www.ptt.cc/bbs/Gossiping/index.html"
getTitle(url)

應用二：連續抓取最新三個頁面

連續抓取八卦版最新三個頁面。

觀念

想一想

觀察使用者是如何利用頁面上的按鈕換頁，然後進入下一個頁面，使用者得以瀏覽新的文章列表。

答案

頁面上有一個 ‹ 上頁 按鈕，再一次叫出「開發人員工具」，觀察該節點的特殊之處。發現該按鈕和其他按鈕的差別在標籤內容，也就是 ‹ 上頁 字樣。而且 href 屬性的值就是我們想要的新頁面的網址的後半部分。

實作

修改一下上一個應用範例的程式碼，讓函式取得頁面所有文章標題後，找到 <a> 標籤，回傳新的網址。

# 連線至 批踢踢實業坊 - 八卦版
# https://www.ptt.cc/bbs/Gossiping/index.html
import urllib.request as req
import bs4

# 包裝函式
def getTitle(url):
    Url = req.Request(url, headers={
        "cookie": "over18=1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
    })
    with req.urlopen(Url) as response:
        data = response.read().decode("utf-8")
    root = bs4.BeautifulSoup(data, "html.parser")
    text_titles = root.find_all("div", class_="title")
    for title in text_titles:
        if title.a != None:
            print(title.a.string)
    
    # 利用特殊的「‹ 上頁」字樣，找到 <a> 標籤，回傳新的網址
    nextlink = root.find("a", string="‹ 上頁")
    return nextlink["href"]


# 使用函式
url = "https://www.ptt.cc/bbs/Gossiping/index.html"
n = 0
while n < 3:
    # 回傳網址只有後半部分，記得加上前半部分
    url = "https://www.ptt.cc" + getTitle(url)
    n += 1

節點[“標籤屬性”]：可以得到該屬性的值。class 屬性輸出的值會是 list 資料型態。
1
2
3
4
5
6
7
print(nextlink)
print(nextlink["href"])
print(nextlink["class"])

# <a class="btn wide" href="/bbs/Gossiping/index38927.html">‹ 上頁</a>
# /bbs/Gossiping/index38927.html
# ['btn', 'wide']

應用三：連續抓取頁面至最舊

因為實際上有三萬多頁，怕跑太久，從最舊之第六頁開始往回抓取，直至最舊頁面。

觀念

想一想

觀察最舊頁面中的 ‹ 上頁 按鈕和其他頁面中的 ‹ 上頁 按鈕有什麼不一樣。

答案

最舊頁面中的 ‹ 上頁 按鈕無法點擊，因為 class 屬性中多了 disabled 的值，

實作

修改上一個應用範例的程式碼，當 <a> 標籤的 class 屬性中有 disabled 時，回傳 None；沒有 disabled 時，回傳網址。

# 連線至 批踢踢實業坊 - 八卦版
# https://www.ptt.cc/bbs/Gossiping/index6.html
import urllib.request as req
import bs4

# 包裝函式
def getTitle(url):
    Url = req.Request(url, headers={
        "cookie": "over18=1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
    })
    with req.urlopen(Url) as response:
        data = response.read().decode("utf-8")
    root = bs4.BeautifulSoup(data, "html.parser")
    text_titles = root.find_all("div", class_="title")
    for title in text_titles:
        if title.a != None:
            print(title.a.string)
    
    # 利用特殊的「‹ 上頁」字樣，找到 <a> 標籤
    nextlink = root.find("a", string="‹ 上頁")
    if "disabled" in nextlink["class"]:
        return None               # 有 disabled 時，回傳 None
    else:
        return nextlink["href"]   # 沒有 disabled 時，回傳網址


# 使用函式
url = "https://www.ptt.cc/bbs/Gossiping/index6.html"
isEnd = False
while isEnd != True:
    prev_page_url = getTitle(url)
    if prev_page_url == None:
        isEnd = True
        break
    url = "https://www.ptt.cc" + prev_page_url

參考資料：
彭彭的課程：Python 網路爬蟲 Web Crawler 基本教學

網路爬蟲基本步驟

應用一：Request 戴上 cookie 資訊

觀念

實作

應用二：連續抓取最新三個頁面

觀念

想一想

答案

實作

應用三：連續抓取頁面至最舊

觀念

想一想

答案

實作