Surprising behavior of RMTXOP

German_Molina · May 10, 2017, 12:52pm

Hello all,

I am working on a research project that requires a lot of matrix matrix
operations (i.e. add, multiply, substract). Since I am working with the
ouptuts of RFLUXMTX, I am using RMTXOP.

Since I have to do this lots of times, I designed my scripts so they are
efficient. Accordingly, avoiding matrix multiplication was a priority... so
I rearranged the equation:

Results = DC1*sky + DC2*sky + DC3*sky + DC3*sky

into

Results = (DC1+DC2+DC3+DC4)*sky

I thought the latter one would be much faster... but it takes double the
time! Is there any reason for that? I am somehow surprised since, I
understand, matrix multiplication is a O(n^3) algorithm while adding is a
O(n^2) one.

The DC matrices are the contribution from different windows to the
workplane (using the 3-phase method... VTD) and the sky is a vector, not a
matrix.

Best,

Germán

Douglas_L_Reeder · May 10, 2017, 2:17pm

German,

I don’t know the details of RFLUXMTX or RMTXOP, however if they were compiled w/ an SSE aware compiler that might explain it. SSE aware compilers can perform a multiply and an add in the same cycle. This overlapping of operations can be a big speed-up. That is also whythe dot product produces the highest operation rates in BLAS>

Doug

···

On May 10, 2017, at 6:52 AM, Germán Molina Larrain <[email protected]> wrote:

Hello all,

I am working on a research project that requires a lot of matrix matrix operations (i.e. add, multiply, substract). Since I am working with the ouptuts of RFLUXMTX, I am using RMTXOP.

Since I have to do this lots of times, I designed my scripts so they are efficient. Accordingly, avoiding matrix multiplication was a priority... so I rearranged the equation:

Results = DC1*sky + DC2*sky + DC3*sky + DC3*sky

into

Results = (DC1+DC2+DC3+DC4)*sky

I thought the latter one would be much faster... but it takes double the time! Is there any reason for that? I am somehow surprised since, I understand, matrix multiplication is a O(n^3) algorithm while adding is a O(n^2) one.

The DC matrices are the contribution from different windows to the workplane (using the 3-phase method... VTD) and the sky is a vector, not a matrix.

Best,

Germán
_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Greg_Ward · May 10, 2017, 4:09pm

Can you reduce this to the actual commands, noting the matrix dimensions for us?

Thanks!
-Greg

···

From: Germán Molina Larrain <[email protected]>
Date: May 10, 2017 5:52:47 AM PDT

Hello all,

I am working on a research project that requires a lot of matrix matrix operations (i.e. add, multiply, substract). Since I am working with the ouptuts of RFLUXMTX, I am using RMTXOP.

Since I have to do this lots of times, I designed my scripts so they are efficient. Accordingly, avoiding matrix multiplication was a priority... so I rearranged the equation:

Results = DC1*sky + DC2*sky + DC3*sky + DC3*sky

into

Results = (DC1+DC2+DC3+DC4)*sky

I thought the latter one would be much faster... but it takes double the time! Is there any reason for that? I am somehow surprised since, I understand, matrix multiplication is a O(n^3) algorithm while adding is a O(n^2) one.

The DC matrices are the contribution from different windows to the workplane (using the 3-phase method... VTD) and the sky is a vector, not a matrix.

Best,

Germán

German_Molina · May 10, 2017, 4:19pm

Hello All,

Thanks Douglas... I have not idea, really, how it is done internally. I do
not really mind this 'bug', but maybe something could be done to mke it
better??.

Greg,

the sky vector is MF4 (2305x1); the DC matrices are all (2048x2305). What I
was doing before was:

*Results = DC1*sky + DC2*sky + DC3*sky + DC3*sky*
rmtxop DC1 sky_vec > lux1
rmtxop DC2 sky_vec > lux2
rmtxop DC3 sky_vec > lux3
rmtxop DC4 sky_vec > lux4

rmtxop lux1 + lux2 + lux3 + lux4 > total_lux

*Results = (DC1+DC2+DC3+DC4)*sky*
rmtxop DC1 + DC2 + DC3 + DC4 > TotalDC
rmtxop TotalDC sky_vec > total_lux

When replacing, on the first case, rmtxop by dctimestep I got slightly
faster results.

Best,

Germán

···

2017-05-10 13:09 GMT-03:00 Gregory J. Ward <[email protected]>:

Can you reduce this to the actual commands, noting the matrix dimensions
for us?

Thanks!
-Greg

*From: *Germán Molina Larrain <[email protected]>

*Date: *May 10, 2017 5:52:47 AM PDT

Hello all,

I am working on a research project that requires a lot of matrix matrix
operations (i.e. add, multiply, substract). Since I am working with the
ouptuts of RFLUXMTX, I am using RMTXOP.

Since I have to do this lots of times, I designed my scripts so they are
efficient. Accordingly, avoiding matrix multiplication was a priority... so
I rearranged the equation:

Results = DC1*sky + DC2*sky + DC3*sky + DC3*sky

into

Results = (DC1+DC2+DC3+DC4)*sky

I thought the latter one would be much faster... but it takes double the
time! Is there any reason for that? I am somehow surprised since, I
understand, matrix multiplication is a O(n^3) algorithm while adding is a
O(n^2) one.

The DC matrices are the contribution from different windows to the
workplane (using the 3-phase method... VTD) and the sky is a vector, not a
matrix.

Best,

Germán

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Greg_Ward · May 10, 2017, 4:43pm

Hi Germán,

You could be suffering from memory allocation costs, not so much the operations themselves. The rmtxop program uses doubles and 3 components per matrix entry, so that's (3x2048x2305 x 8 bytes) or 108 MBytes for each of your matrices. When you multiply one such matrix by a sky vector, you only have the additional memory needed by the vector (54 KBytes). When you add matrices, rmtxop keeps two of them in memory at a time, or 216 MBytes of memory. That's not a lot for most PCs these days, but the allocation and freeing of that much space may take some time if malloc is not efficient.

I should have asked if your matrices are stored as binary data. If they are ASCII (text), then for sure the strtod()/atof() calls will dominate calculation time completely. Switching to a binary double matrix representation would be faster and a better measure of the time these operations actually take.

You would probably need to profile your code to get a clear idea of where it's spending its time.

Cheers,
-Greg

P.S. The dctimestep uses floats instead of doubles, so takes half the memory. It's also optimized for matrix multiplications where there are zero rows or columns, so it's substantially faster for annual sky matrix calculations.

···

From: Germán Molina Larrain <[email protected]>
Date: May 10, 2017 9:19:17 AM PDT

Hello All,

Thanks Douglas... I have not idea, really, how it is done internally. I do not really mind this 'bug', but maybe something could be done to mke it better??.

Greg,

the sky vector is MF4 (2305x1); the DC matrices are all (2048x2305). What I was doing before was:

Results = DC1*sky + DC2*sky + DC3*sky + DC3*sky
rmtxop DC1 sky_vec > lux1
rmtxop DC2 sky_vec > lux2
rmtxop DC3 sky_vec > lux3
rmtxop DC4 sky_vec > lux4

rmtxop lux1 + lux2 + lux3 + lux4 > total_lux

Results = (DC1+DC2+DC3+DC4)*sky
rmtxop DC1 + DC2 + DC3 + DC4 > TotalDC
rmtxop TotalDC sky_vec > total_lux

When replacing, on the first case, rmtxop by dctimestep I got slightly faster results.

Best,

Germán

2017-05-10 13:09 GMT-03:00 Gregory J. Ward <[email protected]>:
Can you reduce this to the actual commands, noting the matrix dimensions for us?

Thanks!
-Greg

From: Germán Molina Larrain <[email protected]>
Date: May 10, 2017 5:52:47 AM PDT

Hello all,

I am working on a research project that requires a lot of matrix matrix operations (i.e. add, multiply, substract). Since I am working with the ouptuts of RFLUXMTX, I am using RMTXOP.

Since I have to do this lots of times, I designed my scripts so they are efficient. Accordingly, avoiding matrix multiplication was a priority... so I rearranged the equation:

Results = DC1*sky + DC2*sky + DC3*sky + DC3*sky

into

Results = (DC1+DC2+DC3+DC4)*sky

I thought the latter one would be much faster... but it takes double the time! Is there any reason for that? I am somehow surprised since, I understand, matrix multiplication is a O(n^3) algorithm while adding is a O(n^2) one.

The DC matrices are the contribution from different windows to the workplane (using the 3-phase method... VTD) and the sky is a vector, not a matrix.

Best,

Germán

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Ian_Ashdown · May 10, 2017, 10:41pm

You could be suffering from memory allocation costs, not so much the

operations themselves. The rmtxop

program uses doubles and 3 components per matrix entry, so that's

(3x2048x2305 x 8 bytes) or 108 Mbytes

for each of your matrices. When you multiply one such matrix by a sky

vector, you only have the additional

memory needed by the vector (54 KBytes). When you add matrices, rmtxop

keeps two of them in memory

at a time, or 216 MBytes of memory. That's not a lot for most PCs these

days, but the allocation and freeing

of that much space may take some time if malloc is not efficient.

The problem here is not the amount of memory, but of the CPU's access to it.
When the CPU is accessing the arrays, the data is stored in a hierarchy of
caches. For a modern Intel Core i7, for example, there are typically four
L2 caches of 256 KB each and a slower L3 cache of 8 MB that is shared by the
CPU cores.

If the arrays can be stored in the L2 cache, the processor can usually run
at full speed. If not, then the CPU will typically have to wait while the
data is retrieved from the slower L3 cache.

The caches are on-chip. If the arrays exceed the L3 cache capacity, then the
data will need to be retrieved from the much slower main memory and
transferred over the memory bus. With 216 MB of array data to contend with,
this is most likely the culprit.

Performance optimization typically involves arranging the array data in
memory such that it can be loaded in cache lines, organizing the array
stride, splitting the arrays into subarrays for multithreaded processing,
and so on. However, these are mostly processor-family specific. (For Intel
CPUs, SSE and AVX instructions are also available to improve parallelism.
Optimizing compilers can help here, but hand coding may be needed for
optimal performance on specific processors.)

If the arrays are stored in ASCII rather than binary, you will typically see
a performance hit of several hundred times as the CPU spends most of its
time parsing the strings into floating-point data.

Ian Ashdown, P. Eng. (Ret.), FIES

Senior Scientist

SunTracker Technologies Litd.

www.suntrackertech.com <http://www.suntrackertech.com>

German_Molina · May 11, 2017, 6:55pm

Thanks to all!

I am not sure if I figured out why all this is working as it is working,
but I did change the format from ASCII to FLOAT and the time was
considerably (90% ?) reduced.

I may do some more tests in the future. Thanks to all!

Germán

···

2017-05-10 19:41 GMT-03:00 Ian Ashdown <[email protected]>:

> You could be suffering from memory allocation costs, not so much the
operations themselves. The rmtxop

> program uses doubles and 3 components per matrix entry, so that's
(3x2048x2305 x 8 bytes) or 108 Mbytes

> for each of your matrices. When you multiply one such matrix by a sky
vector, you only have the additional

> memory needed by the vector (54 KBytes). When you add matrices, rmtxop
keeps two of them in memory

> at a time, or 216 MBytes of memory. That's not a lot for most PCs these
days, but the allocation and freeing

> of that much space may take some time if malloc is not efficient.

>

The problem here is not the amount of memory, but of the CPU’s access to
it. When the CPU is accessing the arrays, the data is stored in a hierarchy
of caches. For a modern Intel Core i7, for example, there are typically
four L2 caches of 256 KB each and a slower L3 cache of 8 MB that is shared
by the CPU cores.

If the arrays can be stored in the L2 cache, the processor can usually run
at full speed. If not, then the CPU will typically have to wait while the
data is retrieved from the slower L3 cache.

The caches are on-chip. If the arrays exceed the L3 cache capacity, then
the data will need to be retrieved from the much slower main memory and
transferred over the memory bus. With 216 MB of array data to contend with,
this is most likely the culprit.

Performance optimization typically involves arranging the array data in
memory such that it can be loaded in cache lines, organizing the array
stride, splitting the arrays into subarrays for multithreaded processing,
and so on. However, these are mostly processor-family specific. (For Intel
CPUs, SSE and AVX instructions are also available to improve parallelism.
Optimizing compilers can help here, but hand coding may be needed for
optimal performance on specific processors.)

If the arrays are stored in ASCII rather than binary, you will typically
see a performance hit of several hundred times as the CPU spends most of
its time parsing the strings into floating-point data.

Ian Ashdown, P. Eng. (Ret.), FIES

Senior Scientist

SunTracker Technologies Litd.

www.suntrackertech.com

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev