Radiance on 64-bit

Zack_Rogers · January 8, 2004, 6:33pm

Hello again,

We are considering getting a new G5! I am wondering if there is any work happening on creating a 64-bit version of Radiance? If not, what are people's thoughts on how extensive this conversion would be?

Thanks!
Zack

···

--
Zack Rogers
Staff Engineer
Architectural Energy Corporation
2540 Frontier Avenue, Suite 201
Boulder, CO 80301 USA

tel (303)444-4149 ext.235
fax (303)444-4304

Georg_Mischler · January 8, 2004, 7:48pm

Zack Rogers wrote:

Hello again,

We are considering getting a new G5! I am wondering if there is any
work happening on creating a 64-bit version of Radiance? If not, what
are people's thoughts on how extensive this conversion would be?

Doesn't the G5 run 32 bit binaries as well?
I'm not sure what the benefit would be, btw. Are you thinking of
some aspect in particular?

There are many elements in a program that can be either 32 or
64 bits wide (or any other size). Among others those are the
sizes of ints (increasing their range), floats (increasing their
precision), and pointers (increasing the address space).
But very often, those have no direct influence on performance.
The most noticeable thing to change might be the required amount
of memory. In the worst case, you'd have to install twice as much
RAM to get the same result.

What *would* help for the recent G# Macs is the use of the
altivec unit in the CPU. If someone finds a way to eg. unroll the
64 multiplications in multmat4 in src/common/mat4.c, so that they
get executed in parallel, then that might make a real difference.
Anyone wants to dig into PowerPC assembler?

-schorsch

···

--
Georg Mischler -- simulations developer -- schorsch at schorsch com
+schorsch.com+ -- lighting design tools -- http://www.schorsch.com/

Ferdinand_Schmid · January 8, 2004, 9:09pm

<snip>

Doesn't the G5 run 32 bit binaries as well?
I'm not sure what the benefit would be, btw. Are you thinking of
some aspect in particular?

Yes, the G5 is binary compatible with 32bit apps. What Zack was hoping
for is
increased math performance for long ints and floats. Not knowing the
source
code I can't tell what type of variables/math Radiance uses. But if
you can
combine a two step 32bit operation into a single 64bit operation then
you can
gain significant performance improvements.

There are many elements in a program that can be either 32 or
64 bits wide (or any other size). Among others those are the
sizes of ints (increasing their range), floats (increasing their
precision), and pointers (increasing the address space).
But very often, those have no direct influence on performance.
The most noticeable thing to change might be the required amount
of memory. In the worst case, you'd have to install twice as much
RAM to get the same result.

What *would* help for the recent G# Macs is the use of the
altivec unit in the CPU. If someone finds a way to eg. unroll the
64 multiplications in multmat4 in src/common/mat4.c, so that they
get executed in parallel, then that might make a real difference.
Anyone wants to dig into PowerPC assembler?

I think this answers Zack's and my question. Radiance isn't using
64bit words
and won't benefit from 64bit math.

Thanks,
Ferdinand

···

--On Thursday, January 08, 2004 02:48:38 PM -0500 Georg Mischler <[email protected]> wrote:

--
Ferdinand Schmid
Architectural Energy Corporation
Celebrating over 20 Years of Improving Building Energy Performance

Georg_Mischler · January 9, 2004, 2:13pm

Ferdinand Schmid wrote:

I think this answers Zack's and my question. Radiance isn't using
64bit words and won't benefit from 64bit math.

Well, it's entirely possible that making explicit use of wider
data paths *somewhere* in the G5 system architecture might result
in a performance advantage. But obviously, such an optimisation
would be highly platform specific. If you feel like researching
compiler switches and their potential benefits and dangers, be my
guest!

In the mean time I found that Apple provides an altivec optimized
BLAS implementation, which includes all common matrix and vector
operations. It looks like we could just replace the contents of
mat4.c and invmat4.c with macros invoking the equivalent functions
in that library to make optimal use of the special CPU capabilities.
I don't have a Mac myself (although I'm open to donations... ;),
but the conversion looks quite straightforward if anyone wants to
give it a try:

http://developer.apple.com/hardware/ve/vector_libraries.html

-schorsch

···

--
Georg Mischler -- simulations developer -- schorsch at schorsch com
+schorsch.com+ -- lighting design tools -- http://www.schorsch.com/

Greg_Ward1 · January 9, 2004, 5:06pm

From: Georg Mischler <[email protected]>
Date: January 9, 2004 6:13:23 AM PST

Ferdinand Schmid wrote:

I think this answers Zack's and my question. Radiance isn't using
64bit words and won't benefit from 64bit math.

Well, it's entirely possible that making explicit use of wider
data paths *somewhere* in the G5 system architecture might result
in a performance advantage. But obviously, such an optimisation
would be highly platform specific. If you feel like researching
compiler switches and their potential benefits and dangers, be my
guest!

I spent a little time on compiler options myself, since I have a dual-processor G5, and found that the best I could get was about a 10% improvement over the -O2 default setting without causing some failure in the system. The -fast option that does all the most aggressive optimizations for the G5 works on Mark Stock's benchmark with good effect (30% speed improvement or so), but causes an infinite loop somewhere in the code for other scenes. Regrettably, I don't have a good recommendation for a set of options to use with the G5 in lieu of -fast -- I tried a bunch of them and never got anything close to the performance improvement of this one option, and it isn't reliable.

In the mean time I found that Apple provides an altivec optimized
BLAS implementation, which includes all common matrix and vector
operations. It looks like we could just replace the contents of
mat4.c and invmat4.c with macros invoking the equivalent functions
in that library to make optimal use of the special CPU capabilities.
I don't have a Mac myself (although I'm open to donations... ;),
but the conversion looks quite straightforward if anyone wants to
give it a try:

http://developer.apple.com/hardware/ve/vector_libraries.html

Even easier, Apple offers a library of their own routines to get the job done. Just run "man Accelerate" to learn all about it. However, I'm not sure how much speedup you'll get by reimplementing the matrix routines -- they're pretty short vectors, and they're not called all that often in the code. I'd recommend looking in the ray tracing routines in "rt/raytrace.c" for places to optimize first, then perhaps "source.c" at the direct calculations. Radiance doesn't deal with a lot of long vectors, though, and the set up costs for vectors of length 4 or less usually cancels the savings from what I've heard.

-Greg

Ferdinand_Schmid · January 9, 2004, 6:01pm

Thanks Greg and Georg,

This is the type of information I was hoping to receive. Many of Zack's
simulations take several days to complete (on our current dual AthlonMP
1.8 GHz
systems). I wanted to identify options to improve performance. For
some type
of analyses (e.g. some of our CFD work) it makes sense to use 64bit
code to
more efficiently fit the problems into memory and to improve the math.

Apparently Radiance problems are not the best fit for a 64bit platform
at this
time. We may be better off with a small cluster of high clock speed
32bit
systems.

Thanks again for taking the time to guide us with your expert insight,
Ferdinand

···

--On Friday, January 09, 2004 09:06:20 AM -0800 Greg Ward <[email protected]> wrote:

From: Georg Mischler <[email protected]>
Date: January 9, 2004 6:13:23 AM PST

Ferdinand Schmid wrote:

I think this answers Zack's and my question. Radiance isn't using
64bit words and won't benefit from 64bit math.

Well, it's entirely possible that making explicit use of wider
data paths *somewhere* in the G5 system architecture might result
in a performance advantage. But obviously, such an optimisation
would be highly platform specific. If you feel like researching
compiler switches and their potential benefits and dangers, be my
guest!

I spent a little time on compiler options myself, since I have a
dual-processor G5, and found that the best I could get was about a
10% improvement over the -O2 default setting without causing some
failure in the system. The -fast option that does all the most
aggressive optimizations for the G5 works on Mark Stock's benchmark
with good effect (30% speed improvement or so), but causes an
infinite loop somewhere in the code for other scenes. Regrettably, I
don't have a good recommendation for a set of options to use with the
G5 in lieu of -fast -- I tried a bunch of them and never got anything
close to the performance improvement of this one option, and it isn't
reliable.

In the mean time I found that Apple provides an altivec optimized
BLAS implementation, which includes all common matrix and vector
operations. It looks like we could just replace the contents of
mat4.c and invmat4.c with macros invoking the equivalent functions
in that library to make optimal use of the special CPU capabilities.
I don't have a Mac myself (although I'm open to donations... ;),
but the conversion looks quite straightforward if anyone wants to
give it a try:

http://developer.apple.com/hardware/ve/vector_libraries.html

Even easier, Apple offers a library of their own routines to get the
job done. Just run "man Accelerate" to learn all about it. However,
I'm not sure how much speedup you'll get by reimplementing the matrix
routines -- they're pretty short vectors, and they're not called all
that often in the code. I'd recommend looking in the ray tracing
routines in "rt/raytrace.c" for places to optimize first, then
perhaps "source.c" at the direct calculations. Radiance doesn't deal
with a lot of long vectors, though, and the set up costs for vectors
of length 4 or less usually cancels the savings from what I've heard.

-Greg

_______________________________________________
Radiance-dev mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-dev

--
Ferdinand Schmid
Architectural Energy Corporation
Celebrating over 20 Years of Improving Building Energy Performance

Rob_Guglielmetti2 · January 9, 2004, 6:21pm

Ferdinand Schmid wrote:

Apparently Radiance problems are not the best fit for a 64bit platform
at this
time. We may be better off with a small cluster of high clock speed
32bit
systems.

The only problem with that is that the NFS locking is problematic. I believe Visarc Jack uses dual CPU machines and keeps all his rpiece simulations limited to the two CPUs in any one box, after trying many times do get good results across a cluster. John An recently reported some difficulty as well with this.

I'm not saying it cannot be done, simply that a lot of very experienced people on this list have seemingly given up on making big clusters run rpiece without error. But if you're bringing your dual CPU machine to its knees already, perhaps it's worth a look.

Your mileage may vary on this advice, since I only have a rudimentary understanding of rpiece and the NFS problems. I just thought I'd bring it up. I'd love to hear success stories with this issue...

···

----

Rob Guglielmetti

e. [email protected]
w. www.rumblestrip.org

Georg_Mischler · January 9, 2004, 6:54pm

Greg Ward wrote:

The -fast option that does all the most aggressive
optimizations for the G5 works on Mark Stock's benchmark with good
effect (30% speed improvement or so), but causes an infinite loop
somewhere in the code for other scenes.

Yes, certain types of optimization tend to have that effect
some times. Do you have any idea where exactly it was hanging?
On one hand, I hate to suggest platform specific optimizations,
but on the other hand this problem might point us to places
where we'd want to simplify our code anyway...

> In the mean time I found that Apple provides an altivec optimized
> BLAS implementation, which includes all common matrix and vector
> operations.
> ...
>
> http://developer.apple.com/hardware/ve/vector_libraries.html

Even easier, Apple offers a library of their own routines to get the
job done. Just run "man Accelerate" to learn all about it.

I suspect we're talking about the same thing.

However,
I'm not sure how much speedup you'll get by reimplementing the matrix
routines -- they're pretty short vectors, and they're not called all
that often in the code. I'd recommend looking in the ray tracing
routines in "rt/raytrace.c" for places to optimize first, then perhaps
"source.c" at the direct calculations.

Taking this a bit further, has anyone ever profiled Radiance?
I think this might give us interesting information that could be
useful for all platforms.

-schorsch

···

--
Georg Mischler -- simulations developer -- schorsch at schorsch com
+schorsch.com+ -- lighting design tools -- http://www.schorsch.com/

Greg_Ward1 · January 9, 2004, 7:22pm

From: Georg Mischler <[email protected]>
Date: January 9, 2004 10:54:31 AM PST

Greg Ward wrote:

The -fast option that does all the most aggressive
optimizations for the G5 works on Mark Stock's benchmark with good
effect (30% speed improvement or so), but causes an infinite loop
somewhere in the code for other scenes.

Yes, certain types of optimization tend to have that effect
some times. Do you have any idea where exactly it was hanging?
On one hand, I hate to suggest platform specific optimizations,
but on the other hand this problem might point us to places
where we'd want to simplify our code anyway...

Unfortunately, I have no idea where it's hanging. I didn't try forcing a quit, but I could do that and take a look at the traceback. I'm not sure it would provide the needed information with such a level of optimization, but it might.

In the mean time I found that Apple provides an altivec optimized
BLAS implementation, which includes all common matrix and vector
operations.
...

http://developer.apple.com/hardware/ve/vector_libraries.html

Even easier, Apple offers a library of their own routines to get the
job done. Just run "man Accelerate" to learn all about it.

I suspect we're talking about the same thing.

"Doh!" as Homer would say. That's what I get for not looking at your link.

However,
I'm not sure how much speedup you'll get by reimplementing the matrix
routines -- they're pretty short vectors, and they're not called all
that often in the code. I'd recommend looking in the ray tracing
routines in "rt/raytrace.c" for places to optimize first, then perhaps
"source.c" at the direct calculations.

Taking this a bit further, has anyone ever profiled Radiance?
I think this might give us interesting information that could be
useful for all platforms.

I haven't profiled Radiance for years and years. I did a lot of profiling during early development, but sort of fell out of the habit. The time spent by the code varies tremendously with the scene input -- number of light sources, data lookup, .cal files, etc. I've noticed that it can bottleneck in about a dozen places, depending on what's cooking.

-Greg

Georg_Mischler · January 10, 2004, 9:42pm

Greg Ward wrote:

I haven't profiled Radiance for years and years. I did a lot of
profiling during early development, but sort of fell out of the habit.
The time spent by the code varies tremendously with the scene input --
number of light sources, data lookup, .cal files, etc. I've noticed
that it can bottleneck in about a dozen places, depending on what's
cooking.

"grep '^{' src/*/*.c" suggests that there are more than 4000
function definitions in Radiance. If we can pick out only a dozen
of those to know where to look for bottlenecks, then that sounds
very promising!

Of course, we need to find a collection of scenes where each one
triggers one of those bottlenecks. That might actually be
something to put into the standard test suite...

-schorsch

···

--
Georg Mischler -- simulations developer -- schorsch at schorsch com
+schorsch.com+ -- lighting design tools -- http://www.schorsch.com/