Optimizing Radiance for cluster rendering

Iebele · April 15, 2012, 10:45am

Hi All,

Nobody responded to the mail below yet. In the meanwhile I tried to build Radiance with Intels icc compiler. Binaries compiled with icc (many warnings) work for a simple scene, but fail when I render complex scenes (Segmentation fault). So I went back to gcc compiled binaries, but still I wonder if somebody can give me hints on what compiler to use, which flags, etc..

Support from the cluster engineers suggested I should make local copies of my files to the scratch discs of the nodes where I start Radiance processes. This, because otherwise networked i/o would slow down the process, and the cluster in general. Concerning the output of the rpict/ranimate process I understand what to do. But concerning the input files, I always thought that Radiance loads all input files (geometry, image patters etc) in memory only one time for each input file. If the latter is true, I think it does not make much difference to load scene data from my home directory over the network, or first copy the input files (about 3 GB) to scratch disc and load them in rpict/ranimate thereafter. The input files have to be copied over the network anyways. Or am I wrong here?

Concerning the binaries, I have a question alike: would it be better to make a local install of the binaries for each node?

Any hints are most appreciated

Iebele

oconv.c(322): (col. 5) remark: PARTIAL LOOP WAS VECTORIZED.

···

Op 7 apr. 2012, om 01:20 heeft Iebele het volgende geschreven:

Hi group,

I'm setting up Radiance on a computer cluster with lots of nodes. Tweeking gcc flags is not my best quality(understatement), so I bring it here.
Playing a bit with options which I found on Marc's benchmark page, I got rendering times per core over twice as long as on my 2.2 Ghz macbook. Doesn't make me happy
Below I've pasted the cpuinfo from a node on the cluster. Does anyone in the group has an idea what flags I should give to gcc to optimize Radiance?
The flags I've used so far - are: -march=native -m64 -msse -msse2 -funroll-loops -ffast-math -O3 -Dlinux -D_FILE_OFFSET_BITS=64

Cheers,

iebele

cat /proc/cpuinfo:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU L5420 @ 2.50GHz
stepping : 10
cpu MHz : 2493.445
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips : 4987.90
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

Lars_Grobe · April 15, 2012, 5:12pm

Hi,

I never compiled using icc, but on the cluster topic I have some comments.

First, I would make sure that you use frozen octrees. Compiling the same
scene from scratch on each node does not make sense.

As you say, copying the scene to the computing nodes before running the
processes does not make much difference, though that may depend on the
cluster archicecture you are building up on. I had an Openmosix cluster,
where I could start all processes on one node, which were then
automatically "migrated away" to other available nodes (no shared
memory, each process had a full set of scene data in that case). As
rpict/rtrace processes typically load the scene data once before
spawning sub-processes (by copying the process data), there is not
really more io involved then copying the data separately before starting
the processes. If you start processes over ssh, sharing data by nfs
mounts, it may help to have some delays so that not all nodes try to
fetch the data at the same moment.

My guess is that the one critical point in sharing data is access to
ambient cache and image data. You are probably aware of this.

Cheers, Lars.

R_Fritz1 · April 16, 2012, 3:37pm

Hi All,

Nobody responded to the mail below yet. In the meanwhile I tried to build Radiance with Intels icc compiler. Binaries compiled with icc (many warnings) work for a simple scene, but fail when I render complex scenes (Segmentation fault). So I went back to gcc compiled binaries, but still I wonder if somebody can give me hints on what compiler to use, which flags, etc..

There's some compiler interaction with the code which no-one has yet dug into that makes icc compilation of Radiance fail for some compilation settings. For the moment, stick with gcc.

Concerning the binaries, I have a question alike: would it be better to make a local install of the binaries for each node?

Depends on the cluster. I suspect not much, though; the program binaries are going to be cached in RAM in the nodes.

···

On 2012-04-15 10:45:18 +0000, Iebele said:

--
Randolph M. Fritz

Jack_de_Valpine · April 16, 2012, 3:54pm

Hi,

I think that binaries should be centrally located. It is not that big a problem to load over a network for this.

Likewise I think that scene files (octrees, etc) should be centrally located. It is just easier to manage.

I think a main problem is each process/node/core writing back image data. This can really slow things down on the nfs server if image pieces are being written back into one image. I think that it is better to have each process write out its output to its own file, which can be on the nfs server, and then assemble the pieces as a post process. Ambient cache data can still be shared and written to by multiple processes although I would recommend using one process to pre-populate the cache initially. And if as Andy mentions you are using something like rtcontrib then the shared ambient cache is a non-issue.

As far as compiler optimizations. While this could be nice, I think the real benefit is spreading things out over the cluster... that is hopefully where you get more opportunity to speed things up as opposed to some incremental increase due to compiler optimization.

-Jack

···

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/16/2012 11:37 AM, Randolph M. Fritz wrote:

On 2012-04-15 10:45:18 +0000, Iebele said:

Hi All,

Nobody responded to the mail below yet. In the meanwhile I tried to build Radiance with Intels icc compiler. Binaries compiled with icc (many warnings) work for a simple scene, but fail when I render complex scenes (Segmentation fault). So I went back to gcc compiled binaries, but still I wonder if somebody can give me hints on what compiler to use, which flags, etc..

There's some compiler interaction with the code which no-one has yet dug into that makes icc compilation of Radiance fail for some compilation settings. For the moment, stick with gcc.

Concerning the binaries, I have a question alike: would it be better to make a local install of the binaries for each node?

Depends on the cluster. I suspect not much, though; the program binaries are going to be cached in RAM in the nodes.