Christopher Browne cbbrowne
Mon Jun 6 17:18:39 PDT 2005
Chris Newland wrote:

>Hi All,
>
>I have Slony 1.1.0 beta 5 working fine in a 3 node configuration with nodes
>2 and 3 subscribing directly to node 1.
>  
>
What platform?  I'm kind of thinking NOT Linux; see my commentary way
further down on that...

>When I try to enable log shipping on node 3 using the following slon command
>the slon process quits unexpectedly.
>
>/usr/local/pgsql/bin/slon -a/usr/local/pgsql/data my_cluster
>"dbname=mydbslave user=postgres host=<node 3 ip>"
>
>The slon command is run as the postgres user who has write permission on
>/usr/local/pgsql/data.
>
>Some log files are created but they all seem to be empty of any SQL
>statements:
>
>-- Slony-I log shipping archive
>-- Node 2, Event 416
>-- at... Fri Jun  3 11:51:12 2005
>
>start transaction;
>
>------------------------------------------------------------------
>-- End Of Archive Log
>------------------------------------------------------------------
>commit;
>
>  
>
It is not odd for there to be SOME empty log files.  Indeed, supposing
there are no updates going into the system, there will still
periodically be some SYNC events.  The default for sync_interval_timeout
is 60000ms, which means that there will be a timeout, and hence a SYNC,
once per minute even if there is no activity going on at all. 
<http://linuxdatabases.info/info/slon.html>

>I'm using PostgreSQL 8.0.3 built from source.
>
>Here is the output from slon:
>  
>
Summarizing a bit...

The log indicates that the node collects up a whole ream of SYNCs from
nodes 1 and 2, and starts work on node #2's events...

I take it that node #3 is subscribing to node #2?  The fact that the
work is on remoteWorkerThread_2 points in that direction...

>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1: queue event 1,594 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_1: queue event 1,595 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,414 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,415 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: Received event 2,414
>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,416 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,417 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,418 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,419 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,420 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,421 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,422 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,423 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,424 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,425 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,426 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: SYNC 414 processing
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,427 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,428 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteListenThread_2: queue event 2,429 SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: no sets need syncing
>for this event
>2005-06-03 11:51:12 BST DEBUG1 remoteWorkerThread_1: connected to data
>provider 1 on 'dbname=mydb
>host=<node 1 ip> user=postgres'
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: Received event 2,415
>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: SYNC 415 processing
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: no sets need syncing
>for this event
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_1: syncing set 1 with 15
>table(s) from provider 1
>2005-06-03 11:51:12 BST DEBUG3 remoteWorkerThread_1: activate helper 1
>2005-06-03 11:51:12 BST DEBUG4 remoteWorkerThread_1: waiting for log data
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: Received event 2,416
>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: SYNC 416 processing
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: no sets need syncing
>for this event
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1_1: allocate lines
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1_1: have 10 line buffers
>2005-06-03 11:51:12 BST DEBUG2 remoteHelperThread_1_1: 0.005 seconds delay
>for first row
>2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1_1: got 1 log rows
>2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1_1: 1 log buffers
>delivered
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1_1: no more log rows
>2005-06-03 11:51:12 BST DEBUG2 remoteHelperThread_1_1: 0.005 seconds until
>close cursor
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1_1: change helper thread
>status
>2005-06-03 11:51:12 BST DEBUG4 remoteHelperThread_1_1: send DONE/ERROR line
>to worker
>2005-06-03 11:51:12 BST DEBUG3 remoteHelperThread_1_1: waiting for workgroup
>to finish
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: Received event 2,417
>SYNC
>2005-06-03 11:51:12 BST DEBUG2 remoteWorkerThread_2: SYNC 417 processing
>2005-06-03 11:51:12 BST DEBUG1 slon: done
>2005-06-03 11:51:12 BST DEBUG2 slon: exit(0)
>
>Nothing unusual in data/serverlog
>
>Does anybody have any ideas?
>  
>
You should have files in the spool directory for syncs 415 and 416;
evidently something "died" when working on #417.

Are 415/416 there?  Do they have any data in them?

If they're empty, this begs the question:  Is that right or wrong?  I'd
expect slon to get that right...

You should be able to look at the table sl_event on node 2...

select * from _my_cluster.sl_event where ev_origin = 2 and ev_seqno
between 415 and 417;

Supposing 415 and 416 are 'empty' SYNCs, I'd expect ev_xip to be blank,
which indicates that there were no XIDs associated with those SYNCs.

I expect that there's *something* associated with SYNC #417.

The above sequence of logging is consistent with the notion that during
SYNC #417, one of the queries being constructed (generally using
slon_mkquery() and/or slon_appendquery) blew up due to some mismatch of
values being passed in versus what was expected.  What happens, at that
point, is that the worker thread #2 "blows up," likely with something
akin to a segmentation fault, which is then hidden because the main
thread catches this and then terminates without giving terribly much notice.

What I'd expect to see, at this point, is that if you restart that slon,
you'll find that it similarly logs a whole bunch of SYNC events from
nodes 1 and 2, starts processing SYNC #417 again, and then falls down
much the same way.

Finding which call to slon_mkquery()/slon_appendquery() may be a bit of
a trial and error thing; I'd go thru function sync_event() and add some
logging for each of the calls to those functions in order to find which
one is blowing up.  Essentially, you'd see which was the LAST one that
succeeded, allowing identification of which one failed.

You may want to pull the latest CVS edition; it is worth noting that
remote_worker.c version is at version 1.85, whereas Beta #5 uses 1.84.

There *ARE* changes significant to log shipping in v1.85; Neil Conway
did some cleanup of initialization of some of the log shipping
functions.  I'd particularly expect that to be relevant if you're on a
platform other than Linux, such as FreeBSD, as Linux has often been a
bit easier-going about variable initialization than other platforms.


More information about the Slony1-general mailing list