Heroku 上使用 webdriver 爬蟲抓資料 @ kevin的部落格

問題:

利用selenium模組的 webdriver來進行爬蟲，但webdriver在heroku上面不支援，必須要靠buildpack來幫忙處理，主要問題是使用網路上爬文來的資料說xvfb-google-chrome這個buildpack在heroku-16 stack並不相容

解決

所以要解決的話有兩個辦法，一個是把現在的stack 轉為 heroku-14 stack，另外一個是再找其他的buildpack，我就使用轉為heroku-14 stack這個方法，因為最近再寫的linebot想進行些更進階的爬蟲，必須要動態抓取網頁程式碼，所以就不得以要用到selenium模組的 webdriver方法來幫忙，其實我是想要爬GOOGLE圖片搜尋時，抓到圖片的實際位置而且網址最後一個是以.jpg結尾，來幫我完成一些事情，下面是我片段的程式碼，這次也搞了一個多禮拜才解決，但也越來越熟悉了。

import json
from selenium import webdriver
def get_image_link(search_query):
    img_urls = []
    chrome_options = webdriver.ChromeOptions()
    chrome_options.binary_location = os.getenv('GOOGLE_CHROME_BIN',None)
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--no-sandbox')
    driver = webdriver.Chrome(chrome_options=chrome_options,executable_path=os.getenv('CHROMEDRIVER_PATH',None))
#    driver = webdriver.Chrome(executable_path='/app/.chromedriver/bin/chromedriver')
    t = search_query[:-4]+'餐點價格'
    url = 'https://www.google.com/search?q=' + t 
    driver.get(url)
    imges = driver.find_elements_by_xpath('//div[contains(@class,"rg_meta notranslate")]')
    count = 0
    for img in imges:
        img_url = json.loads(img.get_attribute('innerHTML'))["ou"]
        print(str(count)+'--->'+str(img_url))
        if img_url.startswith('https') == False:
            continue
        img_urls.append(img_url)
        if count > 1:
            break
        count = count + 1
    driver.quit()
    return img_urls

結果圖:

我只要打餐廳名稱加上menu後，我的linebot自動會幫我爬蟲抓到圖片，並且回傳給end-user，個人覺得還蠻喜歡，因為常常想到要吃什麼還要開啟網頁google查菜單，而我這linebot只要打下幾個字就可以抓到傳給你看看最新的菜單，但有時候還是會有錯誤，持續修改~

參考:

1.heroku的webdriver 使用說明

https://devcenter.heroku.com/articles/heroku-ci#known-issues

2.heroku的轉換webdriver 使用說明

https://devcenter.heroku.com/articles/cedar-14-stack

3.需要架在heroku上面的buildpack和變數設定

需要加入的兩個buildpack分別是如下兩個:

1.https://github.com/heroku/heroku-buildpack-chromedriver
2.https://github.com/heroku/heroku-buildpack-xvfb-google-chrome

需要加入的環境變數為如下兩個:

1.CHROMEDRIVER_PATH

/app/.chromedriver/bin/chromedriver

2.GOOGLE_CHROME_BIN

/app/.apt/usr/bin/google-chrome

最後還需要再requirement.txt檔加上selenium==3.8.0，這邊搞了我很久，一開始沒打上版本，會很不穩定常常崩潰，爬文爬到說一定要指定selenium==3.8.0，因為這個版本的selenium是最穩定的樣子

https://github.com/haruspring-jokt/tenkibot/issues/4

這位熱心的日本網友講的

https://stackoverflow.com/questions/41059144/running-chromedriver-with-python-selenium-on-heroku

(感謝觀看)

kevin的部落格

KV 發表在痞客邦留言(3) 人氣()

E-mail轉寄

kevin的部落格

Blog記錄各種學習心得以及遇到問題其解決方式，有錯誤歡迎指教，希望能夠做中學，最後分享給遇到相同問題的夥伴

Heroku 上使用 webdriver 爬蟲抓資料

留言列表

站方公告

活動快報

【全民...

Frends

popular articles

categories

up-to-date articles

message

動態訂閱

文章精選

文章搜尋

新聞交換(RSS)

誰來我家

參觀人氣

QR Code

POWERED BY