[Slony1-general] Slony-I in strange state after attempted unsubscribe/resubscribe

Tue Jan 3 08:17:00 PST 2012

On 12-01-03 05:49 AM, Martijn van Oosterhout wrote:
> Hoi,

What version of Slony are you using?
What version of Postgresql are you using?

>
> We have a Slony-I setup in a slightly weird situation. What happened
> was that the server did a huge delete (millions of rows) in a single
> transaction which caused the replication to start to run behind. For
> some reason in this state it takes forever to apply the change because
> the query to find out what it needs to apply does a sort or something
> because it doesn't want to apply the whole set at once. A single SYNC
> takes 10 minutes.
>

I suspect you hit bug #167 (fixed in 2.1.0) that makes sync operations 
slow when sl_log_x gets big.

> In any case, the way we fixed it before was to unsubscribe and
> resubscribe the set, because resyncing the whole database is quicker
> than waiting for the deletes to complete. However, this time it broke
> in a new way. The result is that slony thinks it is properly
> subscribed, but the database data has not been resynced, so you get
> some bastard combination of old and new data. Logs below.
>
> Two questions:
> 1. Is there a way we could have detected the unsubscribe failed
> (slonik gave no error, but we didn't ask). If so, we can add that to
> the procedure as something to check

The proper procedure should be

unsubscribe(id=1, receiver=2);
wait for event(origin=2, confirmed=all, wait on=2);
subscribe set(set id=1, provider=1,receiver=2,forward=yes);

In 2.1 and above slonik should automatically perform the 'wait for event'.

The slonik command 'unsubscribe set' gets submitted to the node being 
unsubscribed (node 2).   If that command failed then slonik should have 
caught the error.

> 2. Seems like a bug to me, but pilot error is not impossible.
>
> Thanks in advance,
>
> 2012-01-01 14:24:27 CETINFO   cleanupThread:    0.002 seconds for cleanupEvent()
> 2012-01-01 14:24:27 CETINFO   cleanupThread:    0.006 seconds for vacuuming
> 2012-01-01 14:30:55 CETINFO   remoteWorkerThread_1: SYNC 5000990612
> done in 733.067 seconds
> 2012-01-01 14:30:55 CETINFO   remoteWorkerThread_1: syncing set 1 with
> 50 table(s) from provider 1
> NOTICE:  Slony-I: log switch to sl_log_2 complete - truncate sl_log_1
> CONTEXT:  PL/pgSQL function "cleanupevent" line 99 at assignment
> 2012-01-01 14:35:22 CETINFO   cleanupThread:    0.066 seconds for cleanupEvent()
>
> --- unsubscribe happens here ---
> 2012-01-01 14:37:58 CETCONFIG unsubscribeSet: sub_set=1
> 2012-01-01 14:37:58 CETCONFIG storeListen: li_origin=1 li_receiver=2
> li_provider=1
> --- this is the weird error ---
> 2012-01-01 14:44:09 CETERROR  remoteWorkerThread_1: "update
> "_pp_config_rep".sl_setsync set     ssy_seqno = '5000990613',
> ssy_snapshot =  '20587190:20590885:20587190',     ssy_action_list = ''
> where ssy_setid in (1) and ssy_seqno<  '5000990613'; " ERROR:  could
> not serialize access due to concurrent update
> 2012-01-01 14:44:09 CETERROR  remoteWorkerThread_1: SYNC aborted

If the receivers remoteWorkerThread_1 is aborted because of the 
unsubscribe then I don't see this as an issue.

> 2012-01-01 14:44:21 CETCONFIG remoteWorkerThread_1: update provider
> configuration
> 2012-01-01 14:44:21 CETCONFIG remoteWorkerThread_1: helper thread for
> provider 1 terminated
> 2012-01-01 14:44:21 CETCONFIG remoteWorkerThread_1: disconnecting from
> data provider 1
> 2012-01-01 14:44:21 CETCONFIG storeSubscribe: sub_set=1 sub_provider=1
> sub_forward='t'
> 2012-01-01 14:44:21 CETCONFIG storeListen: li_origin=1 li_receiver=2
> li_provider=1
> 2012-01-01 14:44:21 CETCONFIG remoteWorkerThread_1: update provider
> configuration
> --- and here it appears slony thinks it is properly subscribed, but
> the database has not been resynced ---

What do you base this statement on?   I do not see anything in the logs 
you pasted showing me that slon2 has received the SUBSCRIBE_SET or 
ENABLE_SUBSCRIPTION event.

What is sl_subscribe on node 2 at this point in time?
Does node 2 have a record in sl_event of it having processed those two 
events?

> NOTICE:  Slony-I: Logswitch to sl_log_1 initiated
> CONTEXT:  SQL statement "SELECT "_pp_config_rep".logswitch_start()"
> PL/pgSQL function "cleanupevent" line 101 at PERFORM
> 2012-01-01 14:46:29 CETINFO   cleanupThread:    0.015 seconds for cleanupEvent()
>
>