Greg Ward wrote:
This whole conversation is getting way too complicated for my tastes.
Can't we find a simpler solution? The whole client/server model sounds
really nasty --
At this time I'm still convinced that it's the most reliable and
portable way to solve the problem at hand.
who starts the server?
Who starts simulations on remote systems as it is now?
As long as your processes all run on the same machine, no ambient
server is necessary, and eg. rpiece will continue to work just
fine. After all, we're not going to take the current file sharing
functionality away. We'll just experiment with additional options
that are less vulnerable to OS bugs and other platform issues.
Once you run jobs on more than one machine, you need to start
them manually (or through scripts/other tools) anyway, even if
you use rpiece. So on that front, nothing will really change.
Thinking of it, the server might also be useful to coordinate
the rpiece processes, removing yet another NFS lock dependency.
What happens if the server dies or gets overwhelmed?
Probably the same that happens now when the NFS server reboots
or the network clogs. The individual simulation processes will
stall, until the server is available again.
How portable will it be between architectures?
All these things make me nervous.
Sockets are fully portable across all platforms that Radiance
currently supports, and then some.
What if we try to stay closer to the current model, just modifying it
so it doesn't depend on an NFS lock manager. Here's what I suggest:
1) Instead of calling fcntl with F_SETLKW, each ambient process
periodically checks for the existence of a lock file on the NFS
filesystem (named after the ambient file perhaps with an added suffix
".lok").
And you think sockets are nasty?
One of the few things that I know about lockfiles is that they're
not exactly simple to get right.
- What happens if the lockfile owner gets killed?
Will the others be able to figure out that the lock file is
stale and override it? Or will that require human intervention?
I think that NFS locks get purged when the owner dies, so we
might lose quite some convenience here.
- What happens process b checks for the file after process a
does, but before process a actually creates the new file?
There's no way to make checking and creating an atomic operation,
so that this situation must be handled explicitly and gracefully.
In a big simulation, it may well be that a dozen processes are
competing for the lock file several times a second for hours or
even days. Race conditions *will* happen.
I'm not saying that it can't be done. But in the best case, I'd
expect a foolprof solution to be similarly involved and complex,
but less portable as using a server process.
Since I'm familiar with client/server concepts, I'm willing to
implement one. If anyone volunteers to design a reliable file
based solution, I'm certainly not objecting to having that
available as an alternative option. In fact, I'd be happy to see
as many synchronization methods as possible implemented, so we
can test them all. And after we know what works, we can leave the
best two or three in the core distribution, for the user to
select the one that is most appropriate in their environment.
P.S. I took some flack for my apparent lack of knowledge regarding C
pointer arithmetic after the last post. So I screwed up! At least I
eventually figured out that I screwed up -- don't I get some credit for
that? I didn't think the error was so obvious, myself.
It was certainly one of a rare breed. Not many people dare to
juggle with pointers as you do... 
I didn't go into all
the various modules and programs and make sure that the local functions
all had prototypes as well.
That's probably the way I would have started as well. I didn't
really expect you to declare victory so soon. Good to know that
you're not superhuman either... 
If someone wants to go in and add all the correct
parameter lists and casts everywhere, they're welcomed to do so once we
get CVS up and running.
Compiling on Windows (at least with VC) is a pain without
prototypes, as the real errors get drowned in all the warnings.
So I may end up converting them along the way when trying to
establish cross-platform compatibility.
I remember now, I did find one bug during ANSI-fication
One down, n to go... 
If I understand you correctly, then you only converted a few
percent of the code, so that finding more than one or two would
have been quite a surprise.
-schorsch
···
--
Georg Mischler -- simulations developer -- schorsch at schorsch com
+schorsch.com+ -- lighting design tools -- http://www.schorsch.com/