Wget递归爬取整个网站的内容

GNU Wget is a computer program that retrieves content from web servers. It is part of the GNU Project. Its name derives from World Wide Web and get. It supports downloading via HTTP, HTTPS, and FTP

— Wikipedia

Wget很强大,对于递归爬取网站内容,用它还是不错的,总比你再去自己写一个爬取工具来得快。

下面是一个使用Wget爬取内容的Bash程序,该程序参考了:http://www.linuxjournal.com/content/downloading-entire-web-site-wget

#!/bin/bash

###
# Get website contents recursively
#
# @author YanWen <i@yanwen.email>
# @modified 2017-11-30
# @references http://www.linuxjournal.com/content/downloading-entire-web-site-wget
###

###
# wget web site (only support one domain)
#
# @param string href
# @param string domain
# @return none
###
function wget_get() {
  # Test the wget command
  if command -v wget > /dev/null 2>&1; then
    if [ "$#" -ne 2 ]; then
      echo 'Params is not enough or too much'
      return
    else
      href=$1; domain=$2

      echo "wget $href in domains $domain..."
      wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains $domain --no-parent $href

        #--recursive
        #--no-clobber       # NOTE: Don't overwrite any existing files
        #--page-requisites  # NOTE: Get all the elements that compose the page
        #--html-extension
        #--convert-links    # NOTE: Convert links so that they work locally
        #--restrict-file-names=windows  # NOTE: Modify filenames so that they will work in Windows as well
        #--domains $domain  # NOTE: Don't follow links outside the domains
        #--no-parent $href  # NOTE: Don't follow links outside the directory
      echo "Done"
    fi

  else
    echo 'Error: No wget'
  fi
}

# Run
if [ "$#" -ne 2 ]; then
  echo 'Usage: ./wget_get [href] [domain]'
  exit
else
  wget_get $1 $2
fi

当然,这个简单程序只能递归爬取单个指定域名下的内容,若要限定多个域名,还需要修改。

它的强大之处在于,对于CSS、JS等资源也可以橹下来。

参考:

  1. linuxjournal.com – downloading-entire-web-site-wget
  2. Wikipedia – Wget

作者: YanWen

Web 开发者

发表评论

Fill in your details below or click an icon to log in:

WordPress.com 徽标

You are commenting using your WordPress.com account. Log Out /  更改 )

Google photo

You are commenting using your Google account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )

Connecting to %s