Posted by mop
Wed, 11 Aug 2004 16:46:00 GMT
Herewith some notes from my attempt to migrate a database instance from Microsoft SQL Server 7 to PostgreSQL 7.3. My journey began with Ian Harding’s how-to, and it’s a good place to start.
export from SQL Server
The bcp utility is a quick and flexible command line utility that extracts raw table data (or query result) to a file. It works pretty much as advertised, with the only tricky parts being the treatment of nulls and character encoding. Ian suggested using the -k parameter which forces bcp to use a null character (x00) to represent an empty field, it’s probably the right thing to do. Unfortunately bcp does not distinguish between empty fields(i.e. value is null) and fields containing an empty string. Character encoding can be dealt with in two ways: the -c parameter will force all text data into ASCII text, or the -w parameter will encode text as UTF-16. The two-byte representation would be a no-brainer, except that the data swells to (almost) twice the original size.
Here’s what I used for each table in the database:
bcp dbname..tablename out ’filename’ -w -k -t "<f-end>" -r "<record-end>" -b 1000
where -b is the number of rows per transaction, and the -t and -r parameters indicate the field and record delimiters. The key when choosing delimiters is to avoid conflicts with field values.
mangle the exported data
Here’s what I did to the export of each table (after moving files to a Linux box):
# transform to 8 bit encoding
recode utf-16..utf-8 $1
# TODO check for literal ’
’, ’ ’
# replace back slash with forward slash
perl -pi -e ’s!\!/!g’ $1
# replace tabs with literal ’ ’
perl -pi -e ’s/ /\t/g’ $1
# replace line breaks with literal ’
’
perl -pi -e ’s/
/\n/g’ $1
# replace field delimiter with tabs
perl -pi -e ’s/<f-end>/ /g’ $1
# replace record delimiters with line break
perl -pi -e ’s/<record-end>/
/g’ $1
# remove Windoze line feeds
perl -pi -e ’s/
//g’ $1
# remove nulls
perl -pi -e ’s/x00//g’ $1
Here’s the step by step explanation:
- bcp exports Unicode using UTF-16, PostgreSQL expects UTF-8; and UTF-8 is easier to move around via SCP
- the backslash character is significant when importing into PostgreSQL, and I couldn’t think of a reason to keep them in a field value
- PostgreSQL uses the backslash to encode tabs and line breaks within fields values
- obviously tabs and line breaks are used as delimiters
- just housekeeping, I don’t think the line feeds cause a problem
- nulls seem to confuse PostgreSQL’s import process
Notes: next time around I’ll use sed instead of perl, but I was too lazy to check the syntax of recode for stream ops; you’ll see that the null characters inserted by bcp to represent empty fields are being stripped - could be that we don’t need the nulls in the exports, or that we should keep them in the export and convince PostgreSQL that they are significant.
create PostgreSQL schema
I used brute force. It would be nice to build a schema.sql script with ant and makedata.
import into PostgreSQL
The COPY command allows table data to be imported from a local file. The only tricky part is the interpretation of null field values; following Ian’s lead I’ve specified the empty string:
COPY tablename FROM ’filename’ WITH NULL AS ’’;
This approach worked for all tables except those that contained empty strings in columns defined as ’NOT NULL’. I kluged these tables by altering the schema: "... ALTER COLUMN xxx DROP NOT NULL".
update
It’s probably also a good idea to run the maintenance.sql script to clean up some tables before extracting. Smaller is better.
Posted in Linux, programming | no comments | no trackbacks
Posted by mop
Thu, 29 Jul 2004 03:47:00 GMT
Herewith a record of my reaction to four Java IDEs.
I’ve used Eclipse at various times to do some code quality analysis (managing import statements mostly). In the past it has been cumbersome to use, especially for our Ant/CVS based projects. It’s getting better. Of course my hardware is better than it used to be; IDEs certainly benefit from a beefy CPU and big monitor. The CVS integration is pretty slick even when dealing with branches and remote repositories. The integration with Ant is still a bit tricky, although I’m not sure there is a nice way to mesh Ant builds with a built-in compiler. The built-in compiler appears similar to Jikes (maybe same code base?), perhaps there are plug-ins that offer a more in-depth analysis. I’ve yet to give the debugger in Eclipse a run-through.
Overall rating: 7/10.
Note that I was prejudiced against JBuilder because it’s a commercial product without providing an obvious advantage over the open source alternatives. After filling out several pledge-my-first-born registration forms and downloading 100MB of files, the installation process was painless. Creating projects is intuitive and Borland is the only IDE I know of that is explicit with respect to character encoding -- smart. Unfortunately I could not get JBuilder to check out the example project from CVS, so I did limited testing. The web site implied that the IDE would integrate with Borland’s OptimizeIt tool -- I couldn’t verify this because of the CVS issue, but I did try OptimizeIt outside the IDE. Profiling is probably a good idea if you have oodles of time and a lot of patience (I have neither). The standard version of JBuilder is $500/seat, the enterprise version (EJBs etc) is $3500.
Overall rating: 5/10.
I’ve heard IntelliJ referred to as the Cadillac of Java IDEs. It has the nicest visuals of any of the other candidates -- nice widgets, decent sized icons, and pleasant colours are important if you’re staring at them all day. IntelliJ integrates will with Ant and CVS, although there is a bug in the CVS client that corrupts jar files on checkout. The big selling feature of IntelliJ are the code inspection tools: detection of unused methods and members, complexity measurements, scope analysis, etc.. Most of the issues identified by the inspection tools are presented with suggested fixes. Impressive but overwhelming. IntelliJ IDEA is $500/seat.
Overall rating: 8/10
The IDE backed by Sun, and I hear there are some mind-altering plug-ins on the way from Tim Bray, but it’s not ready for prime time yet. The file system paradigm used for projects was not a good fit with our CVS project, and the compiler was unable to build for lack of memory. I have heard that the NetBeans debugger is head and shoulders above the rest.
Overall rating: 5/10.
Posted in programming | no comments | no trackbacks
Posted by mop
Fri, 23 Jul 2004 19:28:00 GMT
I recall one of my stats profs saying once: ’not all random phenomena can be modelled with a normal distribution, but if your measurements are not normal then you are measuring the wrong thing’. He said it in jest, me thinks, but he had a point.
What happened was, I have been measuring response times for a busy web application. Forgetting most of what I learned in school, I’ve been recording the mean average and peak response times, i.e. average time and peak time it takes to service HTTP requests. (I use a fixed size sample, not samples from a fixed period, see Calculating a moving average.)
Recognizing that peak and average values are not entirely representative of server behaviour, I started by examining the distribution of response times for an arbitrary time period (1000 requests, about 10 minutes worth). It’s obviously not a traditional normal distribution, eh? If it were a normal distribution, we could estimate the mean average and variation at be confident that we’ve accurately characterized the distribution of response times. But it’s not normal so the fun begins.
Like any good engineer, I know just enough to bluff my way through a cocktail party, and enough to know where to look. The NIST has a great online resource: NIST/SEMATECH e-Handbook of Statistical Methods. Skip to the Tools and Aids section and you’ll find a gallery of probability distributions. If I squint a little, any one of Weibull, Lognormal, Power Lognormal could be an appropriate match for our distribution. The Power Lognormal curve is probably the closest, so on to the next step.
Remember, we’re trying to model our response times in a way that will allow us to characterize the distribution at any time with simple metrics like mean average and variance. The characteristic formula for Power Lognormal distributions is

which is expressed in terms of the normal distribution. So, describing our distribution would involve mean, variance plus the power and shape parameters. Four metrics, hmmm.
Here’s where I realize that I’m probably measuring the wrong thing.
Posted in programming | no comments | no trackbacks
Posted by mop
Sun, 18 Jul 2004 17:41:00 GMT
What I need is an efficient generalized approach to calculating a moving average. I hope to measure the time required to complete an operation (many ops per second) and record a moving average of these times for at least two periods. I expect that the periods may be sufficiently large to cause storage issues if all measurements for the period are stored.
As an example, lets say that 100 operations are performed per second, and that I’ll be calculating a 30 minute moving average. That’s 180,000 data points. Of course, I don’t need to store all the measurements in order to calculate a simple mean average, just keep a running sum and a count. Now let’s say I want to record this moving average once every ten minutes -- storing a sum and a count is no longer enough.
There are a few ways I can think of to deal with the problem:
- change the definition of the moving average to include a fixed number of measurements instead of a fixed period
- make an assumption about the number of measurements in a thirty minute period, then the average could be approximated for each new data point using μ = (μ⋅n + x)/(n + 1)
- again, make an assumption about the number of measurements in a period, but adjust the assumption as the rate of operations changes
- store all the data points for at least thirty minutes ;-)
The first option is appealing for a number of reasons: it is simple, accurate, and allows for more more in-depth statistical measurements. It does however call for a judicious choice of the sample size. In our example, we likely would not choose a sample size equivalent to 30 minutes of measurements. Assuming that 30 minutes was an arbitrary measure (it is) I’ll use an equally arbitrary sample size.
Posted in programming | no comments | no trackbacks
Posted by mop
Thu, 08 Jul 2004 03:01:00 GMT
Not sure where I’d like this type of documentation to exist, but it should be preserved for posterity. So... here’s some background on In-Touch’s object-relational framework: Makedata.
Makedata is a Java app that creates *.java files that encapsulate the RDBMS <--> Java mapping. Each object/table is defined in a simple text file () which Makedata transforms into three files:
- table.sql
- a SQL script that creates the table and indices
- tableRawData.java
- a java object representing the raw data (members plus accessor methods, very bean-like)
- tableRawRowBuffer.java
- a java object that knows how to read and write the raw data to/from the database
The generated classes are generally found in the ./target/generated-datafiles/ directory. The Ant task ’datac’ invokes makedata.
Each *RawData class is a sub-class of intouch.data.Data. Each *RawRowBuffer class is a sub-class of intouch.sql.RowBuffer. Generally we extend each of the generated classes and these sub-classes of *RawData and *RawRowBuffer become containers for business logic.
Another important class is intouch.data.View which defines a bit-field representing each fields in a data object. A view of the table/data is defined by turning bits on and off.
We also use these classes to represent table joins. There are two abstract classes intouch.data.ViewData and intouch.data.ViewDataRowBuffer which are extended to define joins. Again, for each object-rdbms map there are two classes, one for data and one for db read/write.
Posted in In-Touch, programming | no comments | no trackbacks
Posted by mop
Sun, 04 Jul 2004 13:46:00 GMT
I’m thinking a lot about measuring performance these days. Not benchmarks, stress testing or profiling, but real time performance. The largest web site we support expects to grow substantially in the coming months and predicting the hardware and software requirements is tough without real time metrics. What I’ve quickly realized is that measuring is the easy part (relative, eh).
We’re running multiple application servers (Apache + Java) behind a single load balancer and in front of a single instance of SQL Server. It’s not difficult to extract metrics from each component in the stack: Apache stats are available via mod_status, the load balancer uses Linux Virtual Server so ipvsadm -l yields lots of info, and database metrics are as easy as select count(*) from .... For now, I’m using MRTG to manage these metrics.
There are still some important metrics missing (running average of response time, SQL TPS) but already it’s apparent that we’ll need a way to distill all the data. ISPs must have the same problem, many servers, many graphs. And what Google must have to do with all their servers, yikes. MRTG does have the ability to trigger an alarm when a metric exceeds a pre-defined threshold, might be worth pursuing.
Posted in programming | no comments | no trackbacks
Posted by mop
Thu, 01 Jul 2004 22:42:00 GMT
I’ve had my first in-depth look at Python and the Java implementation of the Python syntax: Jython. The goal was to create some scripts to generate simple reports on the status of a Java web app and back-end database.
A good thing: Python supports explicit character encoding of the program source. A bad thing: still not sure if there are any Python boolean primitives (Update: found the grammar, no boolean primitives yet).
Python’s syntax comes reasonably natural -- significant white space is less awkward than I expected. I have not explored the object model yet; instead I’ve been relying on (existing) Java classes to do the hard work.
Combining a well-defined scripting language with Java is a powerful idea. The scripts are short and readable, and the process was very efficient. There was re-use within the scripts themselves and they are not particularly fast, but that’s ok when writing quick and dirty reports.
If you’re interested, there is a lengthy comparison of the Java apple and Python orange, probably out of date. I stumbled on this while looking for the Python equivalent of javadoc (G: python + javadoc). I think the answer is pydoc.
Posted in programming | no comments | no trackbacks
Posted by mop
Tue, 29 Jun 2004 03:45:00 GMT
More fun with character encoding... this time with Perl. I’ve been down the encoding path with Java, discovering on the way some of the flaws in the IO libraries (see esp. FileWriter and FileReader).
Similar to Java, Perl does support Unicode in it’s string representation, UTF-8 actually. It’s nice, ’cause Perl regular expressions can act on Unicode strings -- as long as you’re consistent in using UTF-8 characters. The problems start when using modules that don’t handle anything other than 7-bit ASCII. Same mistakes as you’ll find in Java libraries. To be fair, the Perl modules are centrally organized via CPAN, but not designed/written/QA’d by the owners of Java, so we should cut the Perl module authors some slack.
But you’d think that a Perl module related to XML would be different. The XML:CSV module is a handy tool that uses the Text::CSV_XS library to parse comma delimited (whatever that is) records and spit out proper XML. Unfortunately the CSV_XS module silently fails when it encounters non-ascii data.
Perl itself does a decent job of explicit handling of character ecoding when doing IO. Perl started as a text processing engine, and text based IO is still the bread and butter for the Perl sandwich. There is even an extensive description of Perl’s Unicode and encoding if you go looking. Specify the encoding for any file or stream, before or after it’s been opened, and Perl transparently converts as required.
Ok, so we know that Perl uses Unicode internally, and that some modules expect ASCII only. My immediate problem involves XML so I’m quite happy to use 7-bit ASCII plus entities for the extended characters. Turns out this is easily done with the Encode module. Problem solved, I think.
Update: of course, all this applies to recent versions of Java/Perl. In particular, the clever conversion from ISO-8859-1 to ASCII doesn’t work with the version of Perl in Debian woody. yuck.
Posted in web, programming | no comments | no trackbacks