Master the art of downloading large files on Linux from the terminal

wget is a free command line tool that downloads files from the terminal non-interactively; it can work in the background without needing the user to log on. Downloading from the browser may require the user's constant presence for downloads to finish, which is not always ideal when dealing with large file downloads.

Advantages of using `wget`

wget is built to cater to slow and/or unstable networks. The wget utility retries when a download fails and supports downloading from where the last download failed (if the server the user is downloading from supports Range headers).
Furthermore, the utility allows the use of proxy servers which can aid in faster retrievals and lighter network loads.

wget supports recursive downloads. A recursive download refers to the process of downloading a file and then using that file to download additional files, repeating this process until all necessary files have been downloaded. This is often used in downloading a website or other collection of files from the internet, where the initial file (e.g., a HTML page) contains links to other files (e.g., images, stylesheets, etc.) that are also needed to render the content fully. Therefore, wget can be tweaked to work as a website crawler.

Last but not least, wget supports downloads using the FTP, HTTP, and HTTPS protocols, all internet protocols used to transfer data between computers. FTP (File Transfer Protocol) is a standard network protocol that transfers files from one host to another over a TCP-based network. FTP allows clients to upload, download, and manage files on a remote server. HTTP (Hypertext Transfer Protocol) is a standard protocol for sending and receiving information on the World Wide Web. It is the foundation of data communication on the web and is used to transmit data from a server to a client, such as a web browser. HTTPS (HTTP Secure) is a variant of HTTP that uses a secure SSL/TLS connection to encrypt data sent between a server and a client. This makes it more difficult for someone to intercept and read the data as it is transmitted. HTTPS is commonly used to protect sensitive information, such as login credentials and financial transactions.

The wget comes installed on most Linux distributions. To install it on Debian-based distributions use:

sudo apt install wget

The wget syntax is:

wget [options] [URL]

wget invokes the wget utility
[options] instruct wget on what to do with the URL provided. Some options have a long-form and short-form version.
[URL] file or folder to be downloaded. This can also be a website link to download

Download Options

We will cover the following download options

Retrying when a connection fails.
Saving downloaded files with a different name.
Continuation of partially downloaded files.
Download to a specific folder.
Using proxies.
Limit download speeds.
Extracting from multiple URLs.
Mirror a webpage.
Extract entire websites.
Extract a website as if you were Googlebot.

Retrying when a connection fails.

By default wget retries 20 times except for fatal errors. Examples of fatal errors are connection refused or 404 (not found) errors.

  # Long form 
  wget --tries=number http://example.com/myfile.zip

  # Short form
  wget -t number http://example.com/myfile.zip

The number can be specified as 0 or inf for infinite retries

  # Long form 
  wget --tries=0 http://example.com/myfile.zip

  # Short form
  wget -t 0 http://example.com/myfile.zip

Saving downloaded files with a different name.

The default behavior is to save the file with the name derived from the URL. Using the output file option, you can specify a new name for the file.
```
  # Long form
  wget -O newfilename.zip http://example.com/myfile.zip
  # short form
  wget --output-file=newfilename.zip http://example.com/myfile.zip
```
In the code above, the file downloaded myfile.zip will be saved as newfilename.zip .

However, it is important to note that the use of -O is not intended to mean simply “use the name newfilename instead of myfilezip as provided in the URL;”. Rather, it is analogous to shell redirection that works like :
```
  wget -O - http://example.com/myfile.zip > newfilename.zip
```
newfilename.zip will be truncated, and all downloaded content will be saved there.

Continuation of partially downloaded files.

The continue option allows the continuation of a partially downloaded file. This can be a file that was downloaded by another wget instance or another program.
```
  # Long form
  wget --continue http://example.com/myfile.zip

  # Short form
  wget -c http://example.com/myfile.zip
```
Assuming there was a partial download with the name myfile.zip, wget assumes the local copy is the first portion of the file and attempts to retrieve the rest of the file from the server using an offset equal to the length of the local file. In the event, the file found locally, and that on the server are of equal size, wget prints an explanatory message

Consequently, should the file on the server be smaller than the one locally, the assumption made is the file on the server is an updated version. However, if the file on the server is bigger than locally due to a new version, the download will continue and the result becomes a garbled file.

The continue option only works with FTP servers and HTTP servers that support the Range header.

Download to a specific folder

To download a file with wget and save it to a specific folder, the -P or --directory-prefix option followed by the path to the directory where files are to be saved are used.

For example:

# Long Form
wget -directory-prefix=/path/to/directory http://example.com/myfile.zip

# Short form
wget -P /path/to/directory http://example.com/myfile.zip

Using proxies

To use a proxy with wget, the --proxy option followed by the proxy URL are used.

For example:

wget --proxy=http://myproxy.example.com:8080 http://example.com/myfile.zip

This downloads the file myfile.zip from http://example.com using the HTTP proxy at http://myproxy.example.com:8080.

If the proxy requires authentication, use the --proxy-user and --proxy-password options to specify the username and password.

For example:

wget --proxy=http://myproxy.example.com:8080 --proxy-user=myusername --proxy-password=mypassword http://example.com/myfile.zip

The --proxy-type option can also be used to specify the type of proxy being used. Valid values include http, socks4, and socks5.

For example:

wget --proxy=http://myproxy.example.com:8080 --proxy-type=socks5 http://example.com/myfile.zip

If all downloads require going through a proxy every time, modify the ~/.wgetrc file. You can do this with nano or your favorite text editor

nano ~/.wgetrc

Add the following lines to the file

use_proxy = on
http_proxy =  http://username:password@proxy.server.address:port/
https_proxy =  http://username:password@proxy.server.address:port/

Limit download speeds

The --limit-rate option is used to limit the download speed of wget. For example, to limit the download speed to 100KB/s:

wget --limit-rate=100k http://example.com/file.zip

The maximum download speed in bytes per second can be specified by using the B suffix. For example, to limit the download speed to 100KB/s:

wget --limit-rate=100000B http://example.com/file.zip

Extracting multiple URLs

To download multiple files with wget, specify multiple URLs on the command line, separated by spaces.

For example:

wget http://example.com/file1.zip http://example.com/file2.zip http://example.com/file3.zip

This will download the files file1.zip, file2.zip, and file3.zip . If you have a list of URLs in a file, you can use the -i or --input-file option to tell Wget to read the URLs from the file.

For example:

# Long form
wget --input-file=url-list.txt
# Short form
wget -i url-list.txt

Mirror a webpage

To create a mirror of a webpage, you can use the wget command-line utility with the -m or --mirror option. This option tells wget to download the webpage and all the files it references (such as images, stylesheets, etc.), recursively and save them to your local machine.

For example:

# Long form
wget --mirror http://example.com

# Short form
wget -m http://example.com

This downloads the webpage at http://example.comand all the files it references, and save them to the current working directory. The files will be saved in a directory structure that mirrors the structure of the website, with the home page saved as index.html and other pages saved in subdirectories as necessary.

You can also use the -k or --convert-links option to tell wget to convert the links in the downloaded HTML pages to point to the local copies of the files so that the downloaded copy of the website can be browsed offline.

For example:

# Long form
wget --mirror --convert-links http://example.com

# Short form
wget -m -k http://example.com

Extract entire websites

To download an entire website with wget, you can use the --recursive option to follow links and download all the files it finds, recursively.

For example:

wget --recursive http://example.com

Extract the website as Googlebot

To extract a website as if you were Googlebot using Wget, you can use the --user-agent option to set the user agent string to "Googlebot". For example:

wget --user-agent="Googlebot" http://example.com

The website will be downloaded using the user agent string "Googlebot", which tells the server that the request is coming from Google's web crawler. This can be useful if you want to see how Googlebot would render a website or if you want to download a website while bypassing any restrictions that may be in place based on the user agent.

You can also combine options demonstrated, for example:

wget --user-agent="Googlebot" -O example.html http://example.com

Resources:

https://www.gnu.org/software/wget/manual/wget.html#Examples

How to download large files on Linux from the terminal

Advantages of using `wget`

Download Options

Retrying when a connection fails.

Saving downloaded files with a different name.

Continuation of partially downloaded files.

Download to a specific folder

Using proxies

Limit download speeds

Extracting multiple URLs

Mirror a webpage

Extract entire websites

Extract the website as Googlebot

Comments

DevOps

alias commands in Linux

More from this blog

How to transfer files from Linux to Windows: The SMB network file-sharing protocol

alias commands in Linux

Building microservices using Terraform, Ansible, Docker, Docker Compose, and Github Actions.

How to publish an npm package.

Command Palette

Advantages of using wget

Download Options

Retrying when a connection fails.

Saving downloaded files with a different name.

Continuation of partially downloaded files.

Download to a specific folder

Using proxies

Limit download speeds

Extracting multiple URLs

Mirror a webpage

Extract entire websites

Extract the website as Googlebot

Comments

DevOps

alias commands in Linux

More from this blog

Advantages of using `wget`