djangoproject.com | python.org | nginx.org
version seven.
  http://demongin.org
demongin.org - PostgreSQL Backup Tips for n00bz

PostgreSQL Backup Tips for n00bz

A crash course in the basics of automating PostgreSQL backups using basic bash, simple SQL and some intermediate-level python.


Sunday, 2010-03-28 | AlmostEffortless, Careerism, Programming

Full disclosure: I'm no postgres expert. In fact, I'm pretty much a total noob myself.

Which, ironically, makes me uniquely qualified to write the Postgres tutorial that follows. An old hand is exactly the guy who you don't want to write a guide about common newbie problems because, as the old writers' saw goes, you do your best writing when you write what you know and the guy who has been working with PG since the 80's definitely isn't going to have problems with his pg_hba.conf or basic SQL commands.

But I have recently had problems like that, and I managed to solve them. About two weeks back, I had two databases on two servers that were only getting backed up when the whole filesystem got backed up: MySQL is still my bread and butter--it powers most of the applications I support--and while I had a robust enough scheme for making, date-stamping and backing up dumps from my various MySQL databases, I had nothing of the sort in place to protect my Postgres data. And while this was acceptable (due, primarily, to the non-mission-critical nature of the data contained within those databases), it was far from optimal.

I ended up writing my program in python, but what follows will be language-neutral: I plan to mostly talk about things that happen between you (the sysadmin) and your filesystem. I will stop to make some remarks at the end that will be python-specific, but the main points I have to make are more widely applicable in that all you need to know to make use of them is a little bit of bash.

Selecting all Databases

The first thing you're going to need, if you're going to be backing up multiple databases on multiple servers, is a way to automate the query by which you determine which databases you're going to need to dump: every whole-database backup starts with a list of databases.

In MySQL, this is drop-dead easy: a quick "show databases" will get you all the information you need. In Postgres, the pg_database table is going to be your friend. Connect as the postgres user (or someone with equivalent access) and take a look at it: it's chock full of the Good Stuff That Kids Go For.

When you're writing your automated backup, you can retrieve a list of all of your postgres databases thus:
select datname from pg_database;
It is probably an incredibly bad habit that I should not spread to others (but hey: that's what newbies do!), but I like to create a "root" user in my postgres databases. It's an old habit, one of those ones that dies hard, from my MySQL days and, even though I know it's wrong, it makes administering my postgres databases a little bit easier: rather than doing # su postgres -c psql and acting as the built-in postgres user whenever I need to shuffle things around in my database, I just keep a .pgpass (more on that in a second) in my UNIX root's homedir and do a quick psql whenever I need to get into my database.

It also comes in handy to have "root" user in your database when you're automating backups: you can execute your backup script as UNIX root, execute the necessary database functions without switching users and so on.

In order to create this guy and give him the appropriate permissions, you can toss off the following easy-peasy bash, enter a password and be done with it:
# su postgres -c "createuser -daP root"

.pgpass

The final thing that will save you some time when you're whipping up an automated database dumping scheme from scratch is know how to work a .pgpass file. You can (and probably should) read the full documentation here, but what it boils down to is summarized in the following bullet points:
  1. The file belongs in the home directory of the UNIX user who will be doing the backup.
  2. The file must not be world- or group-readable.
  3. The basic syntax is
    dbhostname:port:database:dbusername:password
  4. You can substitute an asterisk (i.e. wildcard) for any of those values except "dbusername" and "password".
So basically, since your automatic backup will probably be executed by your (UNIX) root user, you're going to want to whip up a .pgpass in his homedir with a little bash like this:
# echo "localhost:5432:*:root:xxxxxxxxxx" > .pgpass
Then you're going to modify the permissions on that file like this:
# chmod 0600 ~/.pgpass
And that's that. Your access and permissions are good to go and you're ready to use the scripting language of your choice to set up an automatic backup.

And when you do get around to scripting that backup, you're probably going to want to gzip or bzip your dumps: as most backups end up traveling across some kind of network, it helps to have them be as compressed as possible. The simplest bash one-liner for gzipping up your SQL dumps is probably the one where you throw them all into one big archive like the man page recommends:
cat database1.sql database2.sql | gzip -9 > databaseDump.gz 
But when I wrote my program (in python), I did the actual dump commands with subprocess and created individual archives for my dumps. Here's a snippet:
outputFile = open(dbName+todaysDate+".sql.gz", "wb")
command1 = ["/usr/bin/pg_dump",dbName]
command2 = ["gzip","-9"]
p1 = subprocess.Popen(command1, stdout=subprocess.PIPE)
p2 = subprocess.Popen(command2, stdin=p1.stdout, stdout=outputFile)
p1.wait()
p2.wait()
outputFile.close()
This, as I mentioned, leaves you with a collection of individual archives that you can then stash in some place safe until some programmer's hasty application deploy nukes yesterday's production data and you get to play hero.