Hi all interested,
Cross posting to dev as this is probably a more appropriate space for this conversation.
OK, let's see who I can irritate the most...
As a refresher, there have been numerous threads on this topic on Radiance Dev (in no order other than my searching through my mail):
* Before we give up on lock files...
* multiprocessor systems, Radiance and you
* as well as others if you want to delve in to the depths of the pre
radiance-online mailing list archives
In general, I recall that there are a couple of directions to go:
* network filesystem locking - such as NFS or Samba, where we are
dependant on either the locking mechanism actually working (eg
NFS) or the filesystem (Samba) being installed
* client/server - probably more hairy from a implementation
standpoint as well as from a porting point of view. Although,
perhaps guaranteeing the best performance for selected os'?
Not to rehash old stuff, but could one of the more knowledgeable developers (Greg, Georg, Peter, Carsten...?) give us a refresher on what the options are and perhaps some idea of the time that would be needed to implement a workings solution? Locking is a recurring problem. It would be nice to figure a consensus solution (ie what direction to pursue) and then a strategy for implementation (ie resources, person(s), money...), so perhaps we as a community could figure out how to move this forward (if as always there is enough interest).
I must admit to having run into this wall on a variety of occasions. NFS (v3) on linux is "supposed" to lock correctly (sync mode on the mount/fstab), as a test there is a test suite from Sun (www.connectathon.org) that is supposed to test the nfs server. I remember running this test suite in the past and getting positive results on linux. Nevertheless, I have found it extremely difficult to get working results with a networked image render (eg rpiece distributed over multiple cpu nodes). Either there end up being problems with ambient values between image cells and/or with locking of the syncfile for distributing image cells to different machines. I even implemented a client/server in perl at one point to try to fight this problem with the syncfile (with partial success as I recall and perhaps more if my time would allow). Not to cause offense... But is it possible that the locking code in Radiance needs to be checked itself?
In brief follow-up to Lar's comments about openmosix/mosix. As understand it the msf filesystem, is supposed to implement locking correctly. There are also other more sophisticated network filesystems such as GFS (Systina, I think and commercial), OpenGFS and many others. However these all require separate special install and perhaps modification of the kernel or installation of a modified kernel, and there is serious question as too whether these are portable to other os's such as MS version whatever (as the main offender of portability).
Note also that I tried openmosix at one point. One problem that I found is that if you start multiple large (eg memory size) jobs on the master node then this can lead to excessive paging and since the master node tries to start the jobs at the same time into its own memory space prior to migrating them off to other nodes in the cluster. So if your job requires 1 Gig of memory to hold the scene and you want to run 10 jobs on 5 dual processor nodes with each node having 2 Gig of memory, if you start all the jobs on one node then you are hosed. If you start them on individual nodes, then you should be using a different clustering solution since this completely negates the value of the migration algorithms in openmosix. Now it has been a while since I used OpenMosix, so perhaps things are different...
Note also that named pipes do not work (at least back in mid 2003, you can see my brief inquiry to the openmosix list and Moshe Bar's even briefer reply back in April of 2003) on OpenMosix. So if you want to do memory sharing on multiprocessor nodes you have to roll your own batch job distributor.
-Jack de Valpine
Georg Mischler wrote:
Lars O. Grobe wrote:
The most straightforward solution to our problem would probably
be to use lock files, as Greg suggested in earlier discussions.
Unfortunately nobody has found the time yet to actually implement
that. If anyone wants to volunteer, please move the discussion of
your proposal to the dev-list.
as I won't be able to help on the implementation, I won't bring this to
the dev-list for now However, I guess the only needed feature of
the shared fs used is a working byte range locking, right? So I will
find out if the fs provided by openmosix (mfs) has this feature, which
would make a set of mosix nodes a great radiance installation.
Ambient files are only written at the end, so file locking
and byte range locking have the same effect.
We also need a solution that works on all platforms and on
all file systems. Requiring third party software just to get
reliable file sharing is clearly out of the question.