Sun Jan 1 20:23:00 PST 2006
- Next message: [Slony1-general] LATIN1 -> UTF8 conversion
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello:
As were getting closer to moving to 8.1 and UTF8 i decided to do some
testing. I'm running slony 1.1.5-rc2. The provider database is
encoded in LATIN1/postgresql 8.0. The slave is UTF8/postgresql
8.1. Out of the box, slony fails to transfer the data. When I
first start slon, the initial copy fails with this error:
postmaster log:
ERROR: invalid UTF-8 byte sequence detected near byte 0xae
CONTEXT: COPY developer_converted_to_entity, line 71, column
developer_name: "Spatializer? "
slony log:
2006-01-01 13:51:25 MST ERROR remoteWorkerThread_1: copy from stdin
on local node - PGRES_FATAL_ERROR ERROR: invalid UTF-8 byte sequence
detected near byte 0xae
What's happening is that the data is coming from the provider in
LATIN1 and that raw data is given to the subscriber which is
expecting UTF8 -- the LATIN1 binary data is obviously not UTF8 and
the subscriber node query fails during conversion.
I've come up with a simple solution that is working well, but
probably not desired for typical users.
What I've done is change slon_connectdb() in dbutils.c to always set
the client encoding to UTF8. This pushes any encoding issues to the
postmaster on the provider and subscriber. The provider database
will output in UTF8; if the provider database isn't UTF8, results
will be converted to UTF8. The reverse happens on the subscriber.
It's a pretty simple solution, but it causes additional overhead to
be added to every slony event and a certain trust that postgresql
knows how to properly convert to and from different encodings. For
normal environments that are running the same encoding on the
provider and subscriber it's an extra conversion that's not
required. For situations where the provider and subscriber are both
UTF8, the overhead is a non issue.
This obviously will only work if the subscriber's encoding is a
superset of the provider. If you have a provider that's UTF8 and a
subscriber that's LATIN1 then things will still fail when a non
latin1 character is in the provider database.
If there is interest in having this be some configurable option, I'm
happy to help out. One idea i had was to have a flag for the slon
process called something like "-e <transferencoding>" that would tell
slon what encoding to use for transfer of the data. If it wasn't
supplied, then the no client_encoding would be set and it would
default to the previous behavior.
Here's a patch if anyone else is interested in using slony this way:
--- /usr/local/src/slony1-1.1.5-rc2/src/slon/dbutils.c 2005-03-08
15:52:37.000000000 -0700
+++ dbutils.c 2006-01-01 20:50:57.000000000 -0700
@@ -99,6 +99,26 @@
conn = slon_make_dummyconn(symname);
conn->dbconn = dbconn;
+ /* set the client encoding to UTF8 as a common ground for
+ transfering data. The postmaster will be responsible
+ for dealing with encoding issues */
+ PGresult *res;
+ SlonDString query;
+
+ dstring_init(&query);
+
+ slon_mkquery(&query, "set client_encoding to utf8");
+ res = PQexec(dbconn, dstring_data(&query));
+ if ( ! ((PQresultStatus(res) == PGRES_TUPLES_OK) ||
+ (PQresultStatus(res) == PGRES_COMMAND_OK)) )
+ {
+ perror("slon_connectdb: failed to set encoding to
utf8");
+ slon_abort();
+ }
+ PQclear(res);
+
+ slon_log(SLON_DEBUG1, "client encoding set to utf8\n");
+
return conn;
}
On Nov 19, 2005, at 10:18 PM, cbbrowne at ca.afilias.info wrote:
>> I've used slony in the past to migrate from 7.4 to 8.0, and i plan on
>> using it to migrate from 8.0 to 8.1 once the 8.1 issues with slony
>> get hammered out. Additionally, though, I'm curious if slony can be
>> used to migrate from one database encoding to another. We are
>> currently ready to migrate our database and applications from Latin1
>> encoding to UTF8 encoding. Is this something slony supports? Are
>> there caveats that I should be aware of? I can run pg_dump with --
>> encoding=UTF8 on my Latin1 database and load it successfully into my
>> UTF8 database and have the 8bit Latin1 characters correctly converted
>> into ther UTF8 counterparts.
>
> This falls into the category of "things not explicitly tried yet."
>
> With the fixes in 1.1.2, we now have reasonably proper support for
> UTF8,
> though that has only really been tested via copying from the same
> encoding
> into the same encoding.
>
> I'd suggest trying it out; if it works well, that would be worth
> adding to
> the documentation. If there are wrinkles, that's also something
> others
> might care to know.
>
> I'd suggest testing it out with a reasonably diverse set of tests;
> make
> sure you have:
>
> a) Examples of each special character so you know each character is
> translated properly;
>
> b) Be sure you test both with data that is present at subscription
> time
> (which is added via a bulk COPY) and with additional data inserted/
> updated
> *after* the subscription is set up (which is introduced via INSERT/
> UPDATE,
> which is a quite distinct mechanism from COPY).
>
> Actually, if you can pass on some relevant sample of data, I'd be more
> than happy to see if I can turn it into a test in the "Testbed: The
> New
> Generation" ;-) Assuming it all works, that approach would make it
> part
> of the regression tests!
>
>
--------------------------------------------
MobyGames
http://www.mobygames.com
The world's largest and most comprehensive
gaming database project
- Next message: [Slony1-general] LATIN1 -> UTF8 conversion
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list