Converting a MS SQL Server database to PostgreSQL

Posted by mop Wed, 11 Aug 2004 16:46:00 GMT

Herewith some notes from my attempt to migrate a database instance from Microsoft SQL Server 7 to PostgreSQL 7.3. My journey began with Ian Harding’s how-to, and it’s a good place to start.

export from SQL Server

The bcp utility is a quick and flexible command line utility that extracts raw table data (or query result) to a file. It works pretty much as advertised, with the only tricky parts being the treatment of nulls and character encoding. Ian suggested using the -k parameter which forces bcp to use a null character (x00) to represent an empty field, it’s probably the right thing to do. Unfortunately bcp does not distinguish between empty fields(i.e. value is null) and fields containing an empty string. Character encoding can be dealt with in two ways: the -c parameter will force all text data into ASCII text, or the -w parameter will encode text as UTF-16. The two-byte representation would be a no-brainer, except that the data swells to (almost) twice the original size.

Here’s what I used for each table in the database:

 bcp dbname..tablename out ’filename’ -w -k -t "<f-end>" -r "<record-end>" -b 1000

where -b is the number of rows per transaction, and the -t and -r parameters indicate the field and record delimiters. The key when choosing delimiters is to avoid conflicts with field values.

mangle the exported data

Here’s what I did to the export of each table (after moving files to a Linux box):

 # transform to 8 bit encoding
 recode utf-16..utf-8 $1

 # TODO check for literal ’
’, ’	’

 # replace back slash with forward slash
 perl -pi -e ’s!\!/!g’ $1

 # replace tabs with literal ’	’
 perl -pi -e ’s/	/\t/g’ $1
 # replace line breaks with literal ’
’
 perl -pi -e ’s/
/\n/g’ $1

 # replace field delimiter with tabs
 perl -pi -e ’s/<f-end>/	/g’ $1
 # replace record delimiters with line break
 perl -pi -e ’s/<record-end>/
/g’ $1

 # remove Windoze line feeds
 perl -pi -e ’s/
//g’ $1

 # remove nulls
 perl -pi -e ’s/x00//g’ $1

Here’s the step by step explanation:

  • bcp exports Unicode using UTF-16, PostgreSQL expects UTF-8; and UTF-8 is easier to move around via SCP
  • the backslash character is significant when importing into PostgreSQL, and I couldn’t think of a reason to keep them in a field value
  • PostgreSQL uses the backslash to encode tabs and line breaks within fields values
  • obviously tabs and line breaks are used as delimiters
  • just housekeeping, I don’t think the line feeds cause a problem
  • nulls seem to confuse PostgreSQL’s import process

Notes: next time around I’ll use sed instead of perl, but I was too lazy to check the syntax of recode for stream ops; you’ll see that the null characters inserted by bcp to represent empty fields are being stripped - could be that we don’t need the nulls in the exports, or that we should keep them in the export and convince PostgreSQL that they are significant.

create PostgreSQL schema

I used brute force. It would be nice to build a schema.sql script with ant and makedata.

import into PostgreSQL

The COPY command allows table data to be imported from a local file. The only tricky part is the interpretation of null field values; following Ian’s lead I’ve specified the empty string:

 COPY tablename FROM ’filename’ WITH NULL AS ’’;

This approach worked for all tables except those that contained empty strings in columns defined as ’NOT NULL’. I kluged these tables by altering the schema: "... ALTER COLUMN xxx DROP NOT NULL".

update

It’s probably also a good idea to run the maintenance.sql script to clean up some tables before extracting. Smaller is better.

Posted in ,  | no comments | no trackbacks

Comments

Trackbacks

Use the following link to trackback from your own site:
/articles/trackback/134

Comments are disabled