NFS and rpiece

Hello everybody,

I'm really sorry to raise this topic again, but I've been doing
some tests recently with a cluster, and got into this strange problem.

Not using OpenMosix this time, just redhat (after nearly 7 years of
debian), nfs mounts and pdsh distributing rpiece through ssh.

I launch the big script, wait, got the image done, looks good,
the ambient file is good too but ...

... but some of the picture tiles, randomly, got rendered twice,
three times, four times, so I loose quite a bit of processing power.
Just to explain better, the syncronisation file, after the picture
is completed, looks something like this:

   1 12
   0 0

   0 11
   0 10
   0 9
   0 9
   0 8
   0 8
   0 8
   0 8
   0 8
   0 8
   0 8
   0 8
   0 0
   0 7
   0 1
   0 6
   0 4
   0 5
   0 2
   0 3

So tile 0.9 has been done twice, and tile 0.8 has even been done 8
times!!!

I haven't found an explanation, yet. Has anybody come across a similar
issue?

I've tried as well directly with ssh, but still the same problem.

Thanks in advance for any input on this.

Have a nice week-end.

···

--
Francesco
____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses

Hi Anselmo,

Can you tell us how you have nfs configured, that is what options on the server and what options on the clients?

-Jack

Francesco Anselmo wrote:

···

Hello everybody,

I'm really sorry to raise this topic again, but I've been doing
some tests recently with a cluster, and got into this strange problem.

Not using OpenMosix this time, just redhat (after nearly 7 years of debian), nfs mounts and pdsh distributing rpiece through ssh.

I launch the big script, wait, got the image done, looks good, the ambient file is good too but ...

... but some of the picture tiles, randomly, got rendered twice, three times, four times, so I loose quite a bit of processing power. Just to explain better, the syncronisation file, after the picture
is completed, looks something like this:

   1 12
   0 0

   0 11
   0 10
   0 9
   0 8
   0 0
   0 7
   0 1
   0 6
   0 4
   0 5
   0 2
   0 3

So tile 0.9 has been done twice, and tile 0.8 has even been done 8
times!!!

I haven't found an explanation, yet. Has anybody come across a similar
issue?

I've tried as well directly with ssh, but still the same problem.

Thanks in advance for any input on this.

Have a nice week-end.

--
Francesco ____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

Sorry, I meant to says Hi Francesco!!!

Jack de Valpine wrote:

···

Hi Anselmo,

Can you tell us how you have nfs configured, that is what options on the server and what options on the clients?

-Jack

Francesco Anselmo wrote:

Hello everybody,

I'm really sorry to raise this topic again, but I've been doing
some tests recently with a cluster, and got into this strange problem.

Not using OpenMosix this time, just redhat (after nearly 7 years of debian), nfs mounts and pdsh distributing rpiece through ssh.

I launch the big script, wait, got the image done, looks good, the ambient file is good too but ...

... but some of the picture tiles, randomly, got rendered twice, three times, four times, so I loose quite a bit of processing power. Just to explain better, the syncronisation file, after the picture
is completed, looks something like this:

   1 12
   0 0

   0 11
   0 10
   0 9
   0 8
   0 0
   0 7
   0 1
   0 6
   0 4
   0 5
   0 2
   0 3

So tile 0.9 has been done twice, and tile 0.8 has even been done 8
times!!!

I haven't found an explanation, yet. Has anybody come across a similar
issue?

I've tried as well directly with ssh, but still the same problem.

Thanks in advance for any input on this.

Have a nice week-end.

--
Francesco ____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

Hi Jack,

Sorry, I meant to says Hi Francesco!!!

No problem for the name swap, I'm quite used to it :slight_smile:
I've been called Anselmo for at least 13 years at school (!)

> Can you tell us how you have nfs configured, that is what options on
> the server and what options on the clients?

Details of the Redhat 9 NFS system are here:
http://www.redhat.com/docs/manuals/linux/RHL-9-Manual/ref-guide/ch-nfs.html

NFSv3 is the default, so I guess it is the one used (I'm not the
administrator of the cluster), but it is also possible to use NFSv2.

I need to check what's in the /etc/exports file, now that I think about
it. Changing sync/async and wdelay/no_wdelay could have some effect.

From the client side, initially the options were

rw,soft,bg
I think now they are more like
rsize=8192, wsize=8192, auto, user, rw

Thanks,

Francesco

Hi Francesco,

Details of the Redhat 9 NFS system are here:
http://www.redhat.com/docs/manuals/linux/RHL-9-Manual/ref-guide/ch-nfs.html

NFSv3 is the default, so I guess it is the one used (I'm not the
administrator of the cluster), but it is also possible to use NFSv2.
  

I am quite sure that you must use NFSv3.

I need to check what's in the /etc/exports file, now that I think about
it. Changing sync/async and wdelay/no_wdelay could have some effect.
  

I have struggled with NFS based file locking (on Linux) on and off for quite some time. Most of our work usually uses smp nodes so this is normally not an issue.
But I have been looking back into this more recently for some cluster based rendering. I would seriously suggest taking a look at:

    http://nfs.sourceforge.net/

Which has a fairly extensive set of information on NFS.

For locking to work correctly I am pretty sure that you need to specify the (sync) option on the server and/or the client side. However NFSv3 is supposed to do sync by default as apposed to v2 which did aysnc. The latter is faster however leads to a wide variety of issues. It is quite likely that you also need to specify the (noac) option, which means that the client is not caching file status information instead it always goes to the server to check the file status. Both of these options together make things slow... To improve performance I believe that you can do a couple of things: 1) increase number of nfsd processes available to handle requests and 2) specify larger rsize and wsize (assuming a fast network transport).

>From the client side, initially the options were rw,soft,bg
I think now they are more like rsize=8192, wsize=8192, auto, user, rw

I read somewhere, that (hard,intr) is preferable but I will have to look this up.

Please note that I have been trying to read up on this as well as other shared filesystems, I have not had a chance to test a lot of this out. I would be interested to hear what you learn.

-Jack

Hi Jack,

Thanks for your extremely useful input on this topic.

I have struggled with NFS based file locking (on Linux) on and off for
quite some time. Most of our work usually uses smp nodes so this is
normally not an issue.
But I have been looking back into this more recently for some cluster
based rendering. I would seriously suggest taking a look at:
        http://nfs.sourceforge.net/
Which has a fairly extensive set of information on NFS.

Thanks for pointing me to this URL, I should have checked it before.
Effectively it is the single most useful resource about NFS on linux :slight_smile:
so much better than the man pages.

For locking to work correctly I am pretty sure that you need to
specify the (sync) option on the server and/or the client side.
However NFSv3 is supposed to do sync by default as apposed to v2 which
did aysnc. The latter is faster however leads to a wide variety of
issues. It is quite likely that you also need to specify the (noac)
option, which means that the client is not caching file status
information instead it always goes to the server to check the file
status. Both of these options together make things slow... To improve
performance I believe that you can do a couple of things: 1) increase
number of nfsd processes available to handle requests and 2) specify
larger rsize and wsize (assuming a fast network transport).

I will ask to change the configuration next Monday, and hopefully
things should improve.

Please note that I have been trying to read up on this as well as
other shared filesystems, I have not had a chance to test a lot of
this out. I would be interested to hear what you learn.

Yes, I'll keep you updated on my trials&errors.

I have been using OpenMosix with oMFS so far, and it has been working
properly, especially after the DFSA has been introduced.
(for more info about oMFS and DFSA there is a short section in
http://howto.x-tend.be/openMosixWiki/index.php/FAQ)

I understand that the OpenMosix development is moving towards using
GFS (Global File System) now:

Thanks again, and have a nice week-end.

Francesco

Hi Francesco,

Francesco Anselmo wrote:

Hi Jack,

Thanks for your extremely useful input on this topic.

Well let's reserve judgment and see how things work out... ;->

I have struggled with NFS based file locking (on Linux) on and off for
quite some time. Most of our work usually uses smp nodes so this is
normally not an issue. But I have been looking back into this more recently for some cluster
based rendering. I would seriously suggest taking a look at:
        http://nfs.sourceforge.net/
Which has a fairly extensive set of information on NFS.
    
Thanks for pointing me to this URL, I should have checked it before.
Effectively it is the single most useful resource about NFS on linux :slight_smile:
so much better than the man pages.

Yes, but I am not sure that it make it that much easier to figure out how to do what we need to do.

For locking to work correctly I am pretty sure that you need to
specify the (sync) option on the server and/or the client side.
However NFSv3 is supposed to do sync by default as apposed to v2 which
did aysnc. The latter is faster however leads to a wide variety of
issues. It is quite likely that you also need to specify the (noac)
option, which means that the client is not caching file status
information instead it always goes to the server to check the file
status. Both of these options together make things slow... To improve
performance I believe that you can do a couple of things: 1) increase
number of nfsd processes available to handle requests and 2) specify
larger rsize and wsize (assuming a fast network transport).
    
I will ask to change the configuration next Monday, and hopefully
things should improve.

Please note that I have been trying to read up on this as well as
other shared filesystems, I have not had a chance to test a lot of
this out. I would be interested to hear what you learn.
    
Yes, I'll keep you updated on my trials&errors.

I have been using OpenMosix with oMFS so far, and it has been working
properly, especially after the DFSA has been introduced.
(for more info about oMFS and DFSA there is a short section in
http://howto.x-tend.be/openMosixWiki/index.php/FAQ)

I understand that the OpenMosix development is moving towards using
GFS (Global File System) now:
http://www.redhat.com/whitepapers/rha/gfs/GFS_INS0032US.pdf

I have experimented with OpenMosix (3 or 4 years ago it seems). I have had some reasonable success with another cluster system called OpenSSi (www.openssi.org), which has a cluster wide filesystem called cfs. I am currently looking at something called warewulf (www.warewulf.org) and thinking about running diskless clients. I have had some discussions with the people on the list there and some of the things that have been suggested are: lustre (www.lustre.org) and pvfs2 (www.pvfs.org). These are cluster oriented distributed filesystems, though I do not no how they manage file locking and cache coherence across all nodes. People on this group have also suggested that NFS might be ok on a small cluster assuming a well configured server.

When you are using OpenMosix and Radiance are you letting OpenMosix migrate automatically or are you specify which node you want a given job to run on. Do you find it distributes rpiece jobs correctly?

···

Thanks again, and have a nice week-end.

Francesco

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction