Python 自學第十四天：初探網路爬蟲 Web Crawler

發表於 2019-12-22 更新於 2019-12-23 分類於 Python

為什麼要學網路爬蟲？

資訊爆炸的世代，每天會有上千萬新的網頁資料誕生，如果你想分析一間科技公司的成長能力，需要研究的資料有：營收成長率、主營業務增長率、股本比重、固定資產比等等，還不只包含當年當季度，可能需要收集連續十年的一堆資料。如果你想買運動彩券，則需要收集：多年 NBA 不同隊伍的比數，球員的上場時間、進球率、罰球進球率等等。

利用網路爬蟲，自動化地重複執行收集資料和萃取、處理資訊的步驟，可以大大地節省我們的時間，還不會有運算上的錯誤發生 (除非你自己程式邏輯寫錯囉～)

要寫出網路爬蟲的程式，當然要先對網頁有一些基本的理解和認識。還不懂 HTML 和 CSS 的人，可以先了解 HTML 中的標籤和屬性，並且閱讀下方參考資料中的「爬蟲是怎麼辦到的？」段落。

參考資料：
網路爬蟲淺談

取得 HTML 網頁原始碼

基礎連線

如同上一篇文章取得網路公開資料 Open Data 提到的，以 urllib.request 模組中的 urlopen() 函式模擬瀏覽器，向目標網址建立連線：傳送要求 (Request) 和參數給網站伺服器，並且接收網站伺服器的回應 (Response)，然後 urlopen() 函式會將 Response 包裝成物件，回傳給程式中的變數。

使用如下的程式碼，可以讀取到 HTML 網頁原始碼：

import urllib.request as req

url = "https://ithelp.ithome.com.tw/users/20111390/ironman/1791"
with req.urlopen(url) as res:
    page_data = res.read().decode("utf-8")
print(page_data)

read()：讀取 HTML 網頁原始碼

但是上面的程式碼遇到如下的網址，會報錯：

1
2
3

url = "https://www.ptt.cc/bbs/movie/index.html"

# rllib.error.HTTPError: HTTP Error 403: Forbidden

403 Forbidden 可以簡單的理解為沒有權限訪問此站，伺服器收到要求 (Request) 但拒絕提供服務。因為，模擬瀏覽器時，模擬的不夠像，缺少一些參數。

建立 Request 物件，給予參數

Python 模擬瀏覽器，向網站伺服器傳送要求 (Request) 時，需要夾帶一些參數 (Request Headers)。

這裡我們需要開啟「開發人員工具」找到 user-agent 的參數，點擊滑鼠右鍵 >> 檢查：

在「開發人員工具」中：

點擊 Network。
重新整理網頁，重新對網站伺服器送出要求。
點擊伺服器回傳回來的內容。
找到 Request Headers。
找到 user-agent，複製它的值。

如下程式碼，使用 Request 類別建立一個夾帶參數的 Request物件：headers 指定字典格式的資料，附加 User-Agent鍵值對。再以 urlopen() 建立連線，不會報錯：

import urllib.request as req

url = "https://www.ptt.cc/bbs/movie/index.html"

# 建立 Request 物件，給予參數
URL = req.Request(url, headers={
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
})

# 以 Request 物件建立連線
with req.urlopen(URL) as res:
    page_data = res.read().decode("utf-8")
print(page_data)

BeautifulSoup4 解析網頁原始碼

安裝

BeautifulSoup4 不是內建的模組，請先安裝。

1	$ pip install beautifulsoup4

引用與解析

請注意，引用時是用 bs4 而不是 beautifulsoup4。

建立 BeautifulSoup物件，將上節讀取到的 HTML 網頁原始碼 page_data 傳入，再指定 Python 內建 HTML 解析器 "html.parser"。

解析完後的 BeautifulSoup物件，是一個樹的結構，每一個標籤都被解析成節點，每一個節點都是物件，可以透過 bs4物件.變數 和 bs4物件.函式() 呼叫屬性。

1 2	import bs4 parsed_data = bs4.BeautifulSoup(page_data, "html.parser")

BeautifulSoup(str[, parser])：BeautifulSoup 是一個類別，用來建立樹最上層節點 (root) 的實體物件，幫助解析 HTML 網頁原始碼。

str：是要被解析的 HTML 網頁原始碼字串。
parser：用來指定解析器，告訴 BeautifulSoup 要怎麼解析 HTML 字串。可以不加，BeautifulSoup 會自動搜尋模組中的解析器，順序為：lxml >> html5lib >> Python 內建解析器 (html.parser)。

參考資料：
Beautiful Soup 4.4.0 文档 - 指定文档解析器
Beautiful Soup 4.4.0 文档 - 安装解析器

樹最上層節點 (root) 物件常用變數

變數	功能
head	取得原始碼 <head> 的全部內容。
body	取得原始碼 <body> 的全部內容。
title	取得標籤名稱為 <title> 的元素。

# 連線至 iT邦幫忙 - python 入門到分析股市，作者：Summer
# https://ithelp.ithome.com.tw/users/20111390/ironman/1791
import bs4
parsed_data = bs4.BeautifulSoup(page_data, "html.parser")
print(parsed_data.title)

# 輸出結果：
# <title>python 入門到分析股市 :: 2019 iT 邦幫忙鐵人賽</title>

遍歷樹 Navigating the tree

樹的遍歷，可以不重複地存取樹的所有節點，找到想要的節點。

子節點物件常用變數

變數	功能
標籤名稱 h1, p, a …	取得標籤名稱為 <h1>、<p>、<a> 等等的子節點，或稱元素。若有多個，只取第一個。其實 <head>、<body>、<title> 都是標籤名稱的一種。
string	取得該節點中，唯一的標籤內容。如果有很多標籤內容，`bs4` 不知道該選擇哪一個，回傳 `None`。
strings	取得該節點中，所有的標籤內容。 generator 類型，可以疊代。
stripped_strings	取得該節點中，所有的標籤內容，去除空格和空行。 generator 類型，可以疊代。
children	取得該節點的，所有下一層子節點。 list_iterator 類型，可以疊代。
parent	取得該節點的上一層父節點。
previous_sibling, next_sibling	取得與該節點同一層的前一個、後一個節點。

舉例

# 連線至 iT邦幫忙 - python 入門到分析股市，作者：Summer
# https://ithelp.ithome.com.tw/users/20111390/ironman/1791
import bs4
parsed_data = bs4.BeautifulSoup(page_data, "html.parser")

print(parsed_data.h1)
# 輸出結果：
# <h1 class="header__logo pull-left"><a href="/"><img alt="iT邦幫忙" class="img-responsive" src="https://ithelp.ithome.com.tw/storage/image/logo.svg"/></a></h1>


for li in parsed_data.ul.children:
    print(li)
# 輸出結果：
# <li class="menu__item">
# <a class="menu__item-link menu__item-link--pl" href="https://ithelp.ithome.com.tw/questions">技術問答</a>
# </li>
# 
# <li class="menu__item">
# <a class="menu__item-link" href="https://ithelp.ithome.com.tw/articles?tab=tech">技術文章</a>
# </li>
# 
# ...
# 
# <li class="menu__item menu__item--ironman">
# <a class="menu__item-link hidden-xs" href="https://ithelp.ithome.com.tw/2020ironman?sc=nav" target="_blank">鐵人賽</a>
# </li>


first_li = parsed_data.li
print(first_li)
print(repr(first_li.next_sibling))
# 輸出結果：
# <li class="menu__item">
# <a class="menu__item-link menu__item-link--pl" href="https://ithelp.ithome.com.tw/questions">技術問答</a>
# </li>
# '\n'

repr(str)：回傳可輸出形式的字串。常見用於輸出空白字串上。

補充

其實子節點，如：parsed_data.a，也都可以使用 .head，不會報錯，但是可能沒有東西回傳，例如：

# a 節點中沒有 head，只有 img
print(parsed_data.a.head)
print(parsed_data.a.img)

# None
# <img alt="iT邦幫忙" class="img-responsive" src="https://ithelp.ithome.com.tw/storage/image/logo.svg"/>

如果希望輸出結果以縮排格式顯示，加 .prettify()，例如：

print(parsed_data.h1.prettify())

# <h1 class="header__logo pull-left">
#  <a href="/">
#   <img alt="iT邦幫忙" class="img-responsive" src="https://ithelp.ithome.com.tw/storage/image/logo.svg"/>
#  </a>
# </h1>

參考資料：
Beautiful Soup 4.4.0 文档 - 遍历文档树

搜尋樹 Searching the tree

透過不同的屬性，篩選出想要的節點。

find 系列函式及其參數

函式	功能
find()	找到符合條件的第一個節點。
find_all()	找到符合條件的所有節點。 list 類型，可以疊代。

find_all(name, attrs, recursive, string, limit, **kwargs)
find(name, attrs, recursive, string, **kwargs )

name：標籤名稱，以字串傳入。
attrs：標籤的屬性和值，以字典方式傳入。
recursive：是否搜尋該節點的所有子孫節點，預設 True。選擇 False，只搜尋第一層子節點。
string：標籤內容，以字串傳入。
limit：限制搜尋結果，以數字傳入。
**kwargs：keyword arguments 關鍵字參數，常見的標籤屬性可以直接以關鍵字傳入。如：id、href、class_ (因為 class 在 Python 中是保留字，這邊記得加底線。)

因為 find 系列函式的參數滿多的，建議除了 name 之外，可以在傳入參數的時候，都指定參數名稱，比較不會出錯。

舉例

# 連線至 iT邦幫忙 - python 入門到分析股市，作者：Summer
# https://ithelp.ithome.com.tw/users/20111390/ironman/1791
import bs4
parsed_data = bs4.BeautifulSoup(page_data, "html.parser")

div_1 = parsed_data.find_all("div", class_="profile-header__name")
a_1 = parsed_data.find_all(
    "a", attrs={"href": "https://ithelp.ithome.com.tw/questions"})
print(div_1)
print(a_1)

# 輸出結果：
# [<div class="profile-header__name">
#             Summer <span class="profile-header__account">(summer0531)</span>
# </div>]
# [<a class="menu__item-link menu__item-link--pl" href="https://ithelp.ithome.com.tw/questions">技術問答</a>]

參考資料：
Beautiful Soup 4.4.0 文档 - 搜索文档树

網路爬蟲基本應用

程式模擬瀏覽器，建立網路連線。
觀察 HTML 網頁原始碼結構。
使用 BeautifulSoup4 模組解析原始碼，撰寫程式邏輯，取得自己想要的部分。
如果有需要，可以儲存成 Text 或 CSV 檔案。

應用

取得 ptt - movie 版最新頁面的所有文章標題。因為文章有被刪除的可能，不能直接拿取 <a> 標籤中的內容，找上一層的 <div> 標籤，如圖：

# 連線至 批踢踢實業坊 - movie 版
# https://www.ptt.cc/bbs/movie/index.html
import urllib.request as req
import bs4

# 建立連線
url = "https://www.ptt.cc/bbs/movie/index.html"
URL = req.Request(url, headers={
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
})
with req.urlopen(URL) as res:
    data = res.read().decode("utf-8")

# 解析原始碼
parsed_data = bs4.BeautifulSoup(page_data, "html.parser")

# 取得所有文章標題
text_titles = parsed_data.find_all("div", class_="title")
for title in text_titles:
    if title.a != None:
        print(title.a.string)

# 輸出結果：
# Re: [  雷] 星戰9──討論那一幕
# [公告] 板規 2019/08/24
# [公告] 板規新增每日發文上限規定
# Fw: [公告] 請使用安全的連線方式連線本站
# [公告] 獎季發文限制放寬

參考資料：
彭彭的課程：Python 網路爬蟲 Web Crawler 基本教學