[Slony1-general] LATIN1 -> UTF8 conversion

Sun Jan 1 20:23:00 PST 2006

Hello:

As were getting closer to moving to 8.1 and UTF8 i decided to do some  
testing.  I'm running slony 1.1.5-rc2.  The provider database is  
encoded in LATIN1/postgresql 8.0.  The slave is UTF8/postgresql  
8.1.   Out of the box, slony fails to transfer the data.  When I  
first start slon, the initial copy fails with this error:

postmaster log:
ERROR:  invalid UTF-8 byte sequence detected near byte 0xae
CONTEXT:  COPY developer_converted_to_entity, line 71, column  
developer_name: "Spatializer? "

slony log:
2006-01-01 13:51:25 MST ERROR  remoteWorkerThread_1: copy from stdin  
on local node - PGRES_FATAL_ERROR ERROR:  invalid UTF-8 byte sequence  
detected near byte 0xae

What's happening is that the data is coming from the provider in  
LATIN1 and that raw data is given to the subscriber which is  
expecting UTF8 -- the LATIN1 binary data is obviously not UTF8 and  
the subscriber node query fails during conversion.

I've come up with a simple solution that is working well, but  
probably not desired for typical users.

What I've done is change slon_connectdb() in dbutils.c to always set  
the client encoding to UTF8.  This pushes any encoding issues to the  
postmaster on the provider and subscriber.   The provider database  
will output in UTF8; if the provider database isn't UTF8, results  
will be converted to UTF8.   The reverse happens on the subscriber.    
It's a pretty simple solution, but it causes additional overhead to  
be added to every slony event and a certain trust that postgresql  
knows how to properly convert to and from different encodings.  For  
normal environments that are running the same encoding on the  
provider and subscriber it's an extra conversion that's not  
required.  For situations where the provider and subscriber are both  
UTF8, the overhead is a non issue.

This obviously will only work if the subscriber's encoding is a  
superset of the provider.  If you have a provider that's UTF8 and a  
subscriber that's LATIN1 then things will still fail when a non  
latin1 character is in the provider database.

If there is interest in having this be some configurable option, I'm  
happy to help out.  One idea i had was to have a flag for the slon  
process called something like "-e <transferencoding>" that would tell  
slon what encoding to use for transfer of the data.   If it wasn't  
supplied, then the no client_encoding would be set and it would  
default to the previous behavior.

Here's a patch if anyone else is interested in using slony this way:

--- /usr/local/src/slony1-1.1.5-rc2/src/slon/dbutils.c  2005-03-08  
15:52:37.000000000 -0700
+++ dbutils.c        2006-01-01 20:50:57.000000000 -0700
@@ -99,6 +99,26 @@
         conn = slon_make_dummyconn(symname);
         conn->dbconn = dbconn;
+       /* set the client encoding to UTF8 as a common ground for
+          transfering data.  The postmaster will be responsible
+          for dealing with encoding issues */
+       PGresult *res;
+       SlonDString query;
+
+       dstring_init(&query);
+
+       slon_mkquery(&query, "set client_encoding to utf8");
+       res = PQexec(dbconn, dstring_data(&query));
+       if ( ! ((PQresultStatus(res) == PGRES_TUPLES_OK) ||
+              (PQresultStatus(res)  == PGRES_COMMAND_OK)) )
+       {
+               perror("slon_connectdb: failed to set encoding to  
utf8");
+               slon_abort();
+       }
+        PQclear(res);
+
+       slon_log(SLON_DEBUG1, "client encoding set to utf8\n");
+
         return conn;
}


On Nov 19, 2005, at 10:18 PM, cbbrowne at ca.afilias.info wrote:

>> I've used slony in the past to migrate from 7.4 to 8.0, and i plan on
>> using it to migrate from 8.0 to 8.1 once the 8.1 issues with slony
>> get hammered out.   Additionally, though, I'm curious if slony can be
>> used to migrate from one database encoding to another.  We are
>> currently ready to migrate our database and applications from Latin1
>> encoding to UTF8 encoding.  Is this something slony supports?  Are
>> there caveats that I should be aware of?  I can run pg_dump with --
>> encoding=UTF8 on my Latin1 database and load it successfully into my
>> UTF8 database and have the 8bit Latin1 characters correctly converted
>> into ther UTF8 counterparts.
>
> This falls into the category of "things not explicitly tried yet."
>
> With the fixes in 1.1.2, we now have reasonably proper support for  
> UTF8,
> though that has only really been tested via copying from the same  
> encoding
> into the same encoding.
>
> I'd suggest trying it out; if it works well, that would be worth  
> adding to
> the documentation.  If there are wrinkles, that's also something  
> others
> might care to know.
>
> I'd suggest testing it out with a reasonably diverse set of tests;  
> make
> sure you have:
>
> a) Examples of each special character so you know each character is
> translated properly;
>
> b) Be sure you test both with data that is present at subscription  
> time
> (which is added via a bulk COPY) and with additional data inserted/ 
> updated
> *after* the subscription is set up (which is introduced via INSERT/ 
> UPDATE,
> which is a quite distinct mechanism from COPY).
>
> Actually, if you can pass on some relevant sample of data, I'd be more
> than happy to see if I can turn it into a test in the "Testbed: The  
> New
> Generation" ;-)  Assuming it all works, that approach would make it  
> part
> of the regression tests!
>
>

--------------------------------------------
MobyGames
http://www.mobygames.com
The world's largest and most comprehensive
gaming database project