Tue Oct 12 19:04:05 PDT 2004
- Previous message: [Slony1-general] Slony stops replicating during nightly periodic + small patch
- Next message: [Slony1-general] .cleanuplistener() does not exist
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 10/12/2004 12:57 PM, Jacques Caron wrote:
> Hi,
>
> First of all, many thanks for the great work on slony!
>
> I use slony 1.0.2 to replicate two Postgresql 7.4.3 databases running on
> FreeBSD 5.2.1-p9, and see that slony stops replicating every night (with a
> couple minor exceptions) during the periodic process that does the backups,
> vacuuming, etc. I use the standard 502.pgsql script that comes with the
> postgresql port on FreeBSD (not quite sure whether it's part of the port or
> the original source tree of Postgresql), which basically does a pg_dump and
> a vacuum analyze.
>
> Every night, I get this on stdout from slon:
> ERROR remoteListenThread_1: timeout for event selection
> And this on stderr:
> sched_mainloop: select(): Bad file descriptor
This problem will be fixed in 1.0.3. The function storing the events is
trying to grab a too strong lock on the sl_event table, causing it to
wait for the pg_dump to finish, and everyone else using that table is
waiting for that one ... and so the whole thing goes kablooy.
Jan
>
> Setting debug level to 4 does not give much more information, just says
> after the timeout that the remoteListenThread is done.
>
> Trying to figure out the whole scheduling mechanism, I found this little
> issue: in scheduler.c, a temporary copy of the fdsets for select is made
> first, and then some checks are done to remove some FDs which may not be
> needed any more from the global fdsets. I believe this must be an
> oversight, and is the reason for the select error, which in turn sets
> sched_status to an error value, and causes sched_msleep to return with an
> error value and the remote listener thread to stop.
>
> I moved the copy further down (just before the select) and last night slony
> did not stop replicating even though it logged several of the "timeout for
> event selection" errors. Probably should wait a couple more periodic runs
> to claim victory, but I believe the patch should at the very least not
> cause any problems and solve a few, so here it is (including a couple of
> typo fixes):
>
> %diff -u scheduler.c.orig scheduler.c
> --- scheduler.c.orig Mon Oct 11 17:00:30 2004
> +++ scheduler.c Tue Oct 12 18:54:09 2004
> @@ -452,21 +452,8 @@
> struct timeval timeout;
>
> /*
> - * Make copies of the file descriptor sets for select(2)
> - */
> - FD_ZERO(&rfds);
> - FD_ZERO(&wfds);
> - for (i = 0; i < sched_numfd; i++)
> - {
> - if (FD_ISSET(i, &sched_fdset_read))
> - FD_SET(i, &rfds);
> - if (FD_ISSET(i, &sched_fdset_write))
> - FD_SET(i, &wfds);
> - }
> -
> - /*
> * Check if any of the connections in the wait queue
> - * have reached there timeout. While doing so, we also
> + * have reached their timeout. While doing so, we also
> * remember the closest timeout in the future.
> */
> tv = NULL;
> @@ -560,6 +547,19 @@
> }
>
> /*
> + * Make copies of the file descriptor sets for select(2)
> + */
> + FD_ZERO(&rfds);
> + FD_ZERO(&wfds);
> + for (i = 0; i < sched_numfd; i++)
> + {
> + if (FD_ISSET(i, &sched_fdset_read))
> + FD_SET(i, &rfds);
> + if (FD_ISSET(i, &sched_fdset_write))
> + FD_SET(i, &wfds);
> + }
> +
> + /*
> * Do the select(2) while unlocking the master lock.
> */
> pthread_mutex_unlock(&sched_master_lock);
> @@ -776,7 +776,7 @@
>
>
> /* ----------
> - * sched_add_fdset
> + * sched_remove_fdset
> *
> * Remove a file descriptor from one of the global scheduler sets and
> * adjust sched_numfd accordingly.
>
> Hope that helps,
>
> Jacques.
>
>
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at gborg.postgresql.org
> http://gborg.postgresql.org/mailman/listinfo/slony1-general
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck at Yahoo.com #
- Previous message: [Slony1-general] Slony stops replicating during nightly periodic + small patch
- Next message: [Slony1-general] .cleanuplistener() does not exist
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list