Using rsync and cron to automate incremental backups
En Español  

There are two kinds of people in the world: those who backup, and those who never experienced severe data loss.

Data loss is a serious concern to both individuals and companies that rely in the use of computers for their every day life or operations. Those who have a Unix based system count with powerful tools to prevent this, such as rsync to backup the information, and cron to make the backup process automated. In previous posts I wrote about the basics of rsync and it's usage as a daemon, as well as the basics of cron. In this post the focus is on the backup functionality of rsync, and its conjoint usage with cron to automate the backup process.

Topics
Full-backup and incremental backup.
The basics of the --backup parameter
Using a backup directory
Using the parameters --backup and --delete together
rsync and cron for automated incremental backups.
Rotatory incremental backup with rsync and cron
Example case:
Footnotes

Full-backup and incremental backup.

A full backup, as its name implies, is a copy of each and every one of the files and directories (and subdirectories) into a different location. This is a good way to backup your files (and the initial step of any incremental and differential backup system), however, as the number and the size of your files grows larger, it becomes burdensome and time consuming to do this full backup every single day, or whatever period of time you require based on the importance of the information and the frequency at which it changes. And yet, you should backup your information as often as required, specially if you are working with critical information.

Here is where incremental backups come in. As I wrote before in Update the contents of a folder, rsync allow us to transfer only the most recent changes to our files as well as the new files that were created since the last copy, by using its delta-transfer algorithm. When we enable the backup functionality of rsync, rsync copy the files that are about to be modified to either a new similarly named file but with a suffix, or to a directory dedicated to keep the backup information, and if we use the --delete parameter, it also copies the files that are about to be deleted. Then, once all of this files have been copied into the backup directory or they has been given a new name, rsync updates the files with the new information, and if the --delete parameter was used, the files that were previously removed from from the source folder are deleted in the destination folder. Using the --backup parameter enables the delta-transfer algorithm, so the use of the -u parameter is not necessary.

In this manner, we end with a fully updated folder, as well as a backup of the files that were modified since the last time that they were copied, using a smaller data transfer and without using a large area for a second full backup. Shall we need to revert back to a previous backup, all that we need to do is to copy back the files from the backup over the new updated ones and delete any files that were created after the time of the backup that we are restoring.

The basics of the --backup parameter

To create a backup of the files that are about to be modified we use the parameter --backup (or -b in it's short form), e.g.:

rsync -a --backup source/ destination/
rsync -ab source/ destination/

By default, the suffix ~ is appended to the name of the file that is saved as a backup, i.e., when we update file01.txt, this file will be copied into a file called file01.txt~, and then file01.txt will be updated with the new content. To change the suffix of the file that is created as a backup, we use the --suffix parameter, e.g.:

rsync -ab --suffix=.old source/ destination/

This will add the suffix .old to the backup files, i.e., file01.txt would be copied into file01.txt.old, and then file01.txt would be updated with the new information. We can add the date of the backup to the suffix by using the command date together with the suffix parameter:

rsync -ab --suffix=_`date +%F` source/ destination/

This would append the date in the format YYYY-MM-DD to the name of the file, e.g., if today were February 12, 2010, and we backup file01.txt, the resulting file would be called: file01.txt_2010-02-12

To restore from the backup if we only attached a suffix, we could use the following commands:

for name in *[suffix];
do
cp $name `basename $name [suffix]`;
done

You can use mv instead of cp if you don't want to leave the backup file. As example of this, if we used --suffix=.old with rsync:

for name in *.old;
do
cp $name `basename $name .old`;
done

As a note, if you use the suffix for the backups, it is recommended to add ~ at the end of the suffix anyway, so it is easier to differentiate between normal files and backup files, as well as handle the backups shall we need to restore the files.

Using a backup directory

While using a suffix to backup the contents of a folder is enough in some cases, for more complex file and directory structures, we can create the backup of this files and the whole directory and subdirectory structure in a different folder instead of just rename the files. Another benefit of this is a better organization of the backups. We specify the directory where we want to store the backup by using the parameter --backup-dir, this use said folder in the destination directory, unless otherwise specified. If the specified directory doesn't already exists, rsync will create it. For example:

rsync -ab --backup-dir=backup source/ destination/

This will create a folder called backup in the destination folder, and backup in there any files that are about to be modified. If we want the backup folder to be created in a different location, we can use:

rsync -abv --backup-dir=/home/juan/Backups/example_backup source/ destination/

And this would have the backup stored in a directory called example_backup inside of /home/juan/Backups

Using the parameters --backup and --delete together

As it was previously mentioned in this post, we can also delete the files from the destination folder that were already deleted in the source folder, this would synchronize the folders. With the --backup parameter, this files that are to be deleted in the destination folder would be copied into the backup directory as well, or they would be given a suffix if this is what you are using, before rsync deletes them from the folder. e.g.:

rsync -ab --delete source/ destination/

But be careful with the parameter --delete. If we are storing backups in the destination folder, or in a directory inside of the destination folder, the --delete parameter is going to delete old backups, as they are not in the source folder. Or attempt to as in the following situation:

Say, we already have a folder called backup inside of the destination directory, and we use rsync again, using --backup-dir=backup one more time. As rsync is going to attempt to delete every file and folder that is not in the source directory, it would backup the backup folder, which would create a backup folder inside our already existing backup folder, and then it would attempt to delete the backup folder and fail because it is using it to backup files. Run this multiple times and you may end up with a backup folder inside a backup folder inside a backup folder, in resume, an undesirable mess.

If we are creating the backup folder inside of the destination folder, take the precaution of create exclusion rules to leave the backup folder(s) out of the rsync that is creating backups. For example, in this particular case:

rsync --ab --backup-dir=backup --delete --exclude=backup source/ destination/

Some of this may not look as useful right now, but since the goal is to automate the backup process with cron, this --backup-dir, --delete --exclude parameters will become very handy for this.

rsync and cron for automated incremental backups.

This section assumes that you know how to specify the time in a crontab, for more information about this, you can check my previous post Using cronjobs to automate tasks.

If we were to make a full daily backup, we could use the following crontab:

@daily rsync -au --delete source/ destination/

This will synchronize the destination folder with the source folder, and we would do have a full backup at destination. However, we would only have the files in the exact same status as they were in source since the last copy. We can't go back at any other point in time. To leave a copy of the files as they were before we update the files, this is, an incremental backup, we should use:

@daily rsync -ab --backup-dir=old_`date +%F` --delete --exclude=old_* source/ destination/

Now we would have the full backup in the destination directory, and a backup of the files as they were before this last backup in a directory. If this backup occurred on February 12, 2010, then the folder would be called old_2010-02-12. This would of course create a different backup folder for each and every day that it is run, and to restore the files to it's state on a certain day, we would need to sequentially copy the files from the most recent to the day that we want to restore over the full backup folder, and delete any new files that were created after that day.

Rotatory incremental backup with rsync and cron

A rotatory incremental backup refers to the fact that we are reutilising a certain number of backup folders for the incremental backups. Instead of, for example, add the whole date to the name of a folder when we are doing a daily backup, we could use only the day of the week. This would create the folder old_0, old_1 ... old_6 and then, it would use again the folder old_0 at the start of the next week. We would remove this folder if it exists, and recreate it for the new backup. In this manner we could go back in time up to 7 days. The obvious advantage of this is that we would not end up with a massive number of backup folders (and a lot of old data) since we would be reutilising the space and the names of the folders. Seven days may be too short, we could use the number of the day in the month to be able to go back about 30 days in time, or the number of the day in the year to be able to go back up to 365 days in time.

Example case

A rotatory daily incremental backup in a remote computer that have rsync over SSH access, that allows me to go back in time up to a year.

We start by setting up the environment for this. We need a remote computer that allows us to use rsync over SSH of course. I am going to use Dreamhost as an example since it gives me 50GB to do whatever I want with them (apart from the unlimited space for hosting websites), and it does give me this sort of access, so I use it for backup information that is not extremely private (I encrypt that private information when I back it up).

Since I am using Dreamhost as an example, I have to say that the backup users doesn't actually have a full SSH shell available, even though the hosting users that I create does, but we can do use rsync over SSH, and we can use SFTP. If you have a full SSH shell available where you perform the backups, you can follow the guide Passwordless SSH using digital signatures instead of the procedure described here to create authorized_keys via SFTP.

So, lets get started.

First of all, I create an entry in ~/.ssh/config as seen in Defining SSH servers, and name it remote_backup. The entry looks like this:

Host remote_backup
HostName server_name.dreamhost.com
User my_username

After this, I create the digital signature if I don't already have one:

ssh-keygen -t rsa -b 2048

And I log in via SFTP, I create the .ssh folder, and I copy my newly generated or my already existent id_rsa.pub into .ssh/authorized_keys in the remote server, and since I am already there, before I exit I create the folders Backups and my_work that I am going to use:

sftp remote_backup
(Type the password. After log in we are in a SFTP shell and we use this commands)
mkdir .ssh
cd .ssh/
put -p /home/juan/.ssh/id_rsa.pub authorized_keys
cd ..
mkdir Backups
mkdir Backups/my_work
exit

Remember to use your own information (host, usernames). If all went right, you can now simply type SFTP remote_backup and it would give you the SFTP prompt right away, without the need to type a password.

Since I need to delete folders that I am going to recycle as well as perform the backup, I am going to create a script that does both, and call this script daily with cron. The first time that this script is run it is going to create a full backup, as the destination directory in the host will be empty. In the case of Dreamhost, which I am using in the example, I will delete the folder that I am going to recycle via SFTP. But I will add the command to do this on a SSH shell in case you have it available, just uncomment it if you need it and comment the SFTP area. I am going to call this file backup_auto.sh and store it in my Scripts folder, and give it execution privileges:

touch ~/Scripts/backup_auto.sh
chmod u+x ~/Scripts/backup_auto.sh

And then open the file in your favorite text editor, and add the following lines:

#!/bin/sh

SUFFIX=$(date +%j)

# Delete the folder in the remote host via SSH
# ssh remote_backup 'ls Backups/my_work/backup_'$SUFFIX' && rm -r Backups/my_work/backup_'$SUFFIX

# Delete the folder in the remote host via SFTP
sftp -b /dev/fd/0 remote_backup <<EOF
cd Backups/my_work
rmdir backup_$SUFFIX
exit
EOF

# Update the information, creating a backup folder and ignoring the rest of
# the backup folders
rsync -ab --recursive --files-from='to_backup.txt' --backup-dir=backup_$SUFFIX --delete --filter='protect backup_*' /home/lordjuan/ remote_backup:Backups/my_work/

And now, I create a file called to_backup.txt, and add the files and folders that I want to backup in the remote host, the paths are relative to the folder that we specify as the source folder, in my case, that would be home, so this files and folders reside in that folder. This file must be in the same folder that the script that we just made, otherwise --files-from won't find it:

touch ~/Scripts/to_backup.txt

Open it in your favorite editor and add the files and folders that you want to backup, in my case this would be:

.vimrc
.vim/
Projects/
Documents/Work/
Scripts/

At this point, we can test the script with:

sh ~/Scripts/backup_auto.sh

The final step is edit our crontab file and make it run the script daily:

crontab -e

Add at the bottom a line like this:

@daily /home/juan/Scripts/backup_auto.sh

And this would be all, feel free to send me an e-mail if you find any factual errors or if something was confusing.

Footnotes

This example uses Dreamhost because it is what I have available at the moment, other than local folders.

As I said in previous posts about rsync, the amount of options in this command is huge, and a post covering every aspect of rsync would be just so long, as rsync can be customized to handle virtually any particular need that someone may have. The purpose of this guide is to make it simple to use it for what I believe are common particular needs. Any of the articles covering rsync may be expanded in the future, or corrected if I find any factual errors.