A high performance dctmestep

I have implemented a highly optimized version of dctimestep, achieving a significant performance increase (an 8x to 34x speedup). The primary modifications are as follows:

  1. Memory-Mapped I/O: The daylight coefficient HDR images were combined into a single binary file and accessed via memory mapping (lazy loading) to reduce I/O overhead.
  2. Matrix Multiplication: The sum_images function was refactored to perform matrix multiplication, replacing the traditional sequential weighted accumulation of multiple images.
  3. Hardware Acceleration & Parallelism: The matrix multiplication was heavily optimized using:
  • SIMD: Processes 8 pixels (24 floating-point values) simultaneously in a single CPU cycle, rather than one by one.
  • Multi-threading: Leverages OpenMP to distribute computations across multiple CPU cores.
  • Zero-Skipping (Sparsity Optimization): Bypasses calculations for completely black (zero-value) pixels, drastically reducing processing time for sparse images.
  1. Sun Matrix Fallback: If the weather data is a sun matrix, the program dynamically falls back to the original method of weighted image accumulation.
    The complied binary files and source files was uploaded here.(dctimestep files - Google Drive).
    When compiling the files with gcc, the compiler flags -mavx2, -mfma, and -fopenmp must be included. For example: gcc -O3 -mavx2 -mfma -fopenmp cmatrix.c cmbsdf.c dctimestep.c -I../../src/common -L../../build/src/lib -lrtrad -lm -o dctimestep
1 Like

Hi Yongqing,

This is very cool! Which operating system(s) is this running under?

Best,
-Greg

Hi Greg,

I’ve tested the performance on Windows and Linux (Ubuntu 22.05), but I haven’t had the chance to test it on MacOS yet.

Great! I’d like to give it a try on MacOS, probably this week. I noticed that you must have worked from an older version of cmatrix.c, because there were a bunch of unrelated differences due to updates and changes I made over the past year or two.

Cheers,
-Greg

Yes, I implemented the modifications on top of version 5.4a. I’ll sync with the latest codebase shortly.

1 Like

If you have time at a later point, it would also be interesting to compare performance to the new pvsum command, which overlaps somewhat with dctimestep but adds some optimizations under Unix (not Windows). If I get this working under macOS, I will definitely be comparing the two.

Cheers,
-Greg

Hi Greg,

I’ve synced my code with the latest updates (Link: dctimestep files - Google Drive).
I’m currently working on the performance comparison with pvsum and will share the results soon.

A performance comparison between pvsum and the optimized dctimestep is presented below:


Fig.1 pvsum


Fig.2 optimized dctimestep

Additionally, I noticed that the hasFormat function cannot recognize formats like %04d, which causes an error (see Fig. 3). Therefore, I have replaced it with hasNumberFormat.


Fig.3 run error

image
Fig.4 hasFormat function

image
Fig.5 hasNumberFormat function

Oops – I had made a change just yesterday to the format detection routine and introduced a bug while attempting to fix another (more minor) one. I just checked in a fix to my fix – sorry about that.

Thanks for updating your code to the latest. I will see if I can run a comparison on my Mac mini m4 pro. If you can make your files available someplace, we could make sure it’s an apples-to-apples comparison. Otherwise, I’ll use something I have lying around.

Cheers,
-Greg

OK, I guess I meant ‘apples-to-intels comparison,’ which isn’t going to work as your changes don’t seem to compile on a non-Intel machine. I get:

immintrin.h:14:2: error: “This header is only meant to be used on x86 and x64 architecture”

and:

hresetintrin.h:42:27: error: invalid input constraint ‘a’ in asm
42 | asm (“hreset $0” :: “a”(__eax));

as well as a whole bunch of type errors.

I would like to include your code in the distribution, since it seems to be a huge performance improvement, but it may need to be an optional compile to avoid breaking the build on Apple Silicon and other incompatible systems.

Regarding your tests against pvsum, could you also try using the -m option set to your available RAM (in GBytes) and -N set to the number of physical cores? I’m wondering if the performance would be any better.

Cheers,
-Greg

Hi Greg,
I have modifide the code to make it work on Apple computer and sent it to you through email. Additionally, I also have used -m option and -N set to the number of physical cores, below is the test results: