A high performance dctmestep

1112 · April 20, 2026, 1:28pm

I have implemented a highly optimized version of dctimestep, achieving a significant performance increase (an 8x to 34x speedup). The primary modifications are as follows:

Memory-Mapped I/O: The daylight coefficient HDR images were combined into a single binary file and accessed via memory mapping (lazy loading) to reduce I/O overhead.
Matrix Multiplication: The sum_images function was refactored to perform matrix multiplication, replacing the traditional sequential weighted accumulation of multiple images.
Hardware Acceleration & Parallelism: The matrix multiplication was heavily optimized using:

SIMD: Processes 8 pixels (24 floating-point values) simultaneously in a single CPU cycle, rather than one by one.
Multi-threading: Leverages OpenMP to distribute computations across multiple CPU cores.
Zero-Skipping (Sparsity Optimization): Bypasses calculations for completely black (zero-value) pixels, drastically reducing processing time for sparse images.

Sun Matrix Fallback: If the weather data is a sun matrix, the program dynamically falls back to the original method of weighted image accumulation.
The complied binary files and source files was uploaded here.(dctimestep files - Google Drive).
When compiling the files with gcc, the compiler flags -mavx2, -mfma, and -fopenmp must be included. For example: gcc -O3 -mavx2 -mfma -fopenmp cmatrix.c cmbsdf.c dctimestep.c -I../../src/common -L../../build/src/lib -lrtrad -lm -o dctimestep

Greg_Ward · April 20, 2026, 8:02pm

Hi Yongqing,

This is very cool! Which operating system(s) is this running under?

Best,
-Greg

1112 · April 21, 2026, 2:42am

Hi Greg,

I’ve tested the performance on Windows and Linux (Ubuntu 22.05), but I haven’t had the chance to test it on MacOS yet.

Greg_Ward · April 21, 2026, 2:58am

Great! I’d like to give it a try on MacOS, probably this week. I noticed that you must have worked from an older version of cmatrix.c, because there were a bunch of unrelated differences due to updates and changes I made over the past year or two.

Cheers,
-Greg

1112 · April 21, 2026, 3:12am

Yes, I implemented the modifications on top of version 5.4a. I’ll sync with the latest codebase shortly.

Greg_Ward · April 21, 2026, 4:09pm

If you have time at a later point, it would also be interesting to compare performance to the new pvsum command, which overlaps somewhat with dctimestep but adds some optimizations under Unix (not Windows). If I get this working under macOS, I will definitely be comparing the two.

Cheers,
-Greg

1112 · April 22, 2026, 2:39pm

Hi Greg,

I’ve synced my code with the latest updates (Link: dctimestep files - Google Drive).
I’m currently working on the performance comparison with pvsum and will share the results soon.

1112 · April 22, 2026, 3:28pm

A performance comparison between pvsum and the optimized dctimestep is presented below:

Fig.1 pvsum

Fig.2 optimized dctimestep

Additionally, I noticed that the hasFormat function cannot recognize formats like %04d, which causes an error (see Fig. 3). Therefore, I have replaced it with hasNumberFormat.

Fig.3 run error

Fig.4 hasFormat function

Fig.5 hasNumberFormat function

Greg_Ward · April 22, 2026, 4:21pm

Oops – I had made a change just yesterday to the format detection routine and introduced a bug while attempting to fix another (more minor) one. I just checked in a fix to my fix – sorry about that.

Thanks for updating your code to the latest. I will see if I can run a comparison on my Mac mini m4 pro. If you can make your files available someplace, we could make sure it’s an apples-to-apples comparison. Otherwise, I’ll use something I have lying around.

Cheers,
-Greg

Greg_Ward · April 22, 2026, 5:24pm

OK, I guess I meant ‘apples-to-intels comparison,’ which isn’t going to work as your changes don’t seem to compile on a non-Intel machine. I get:

immintrin.h:14:2: error: “This header is only meant to be used on x86 and x64 architecture”

and:

hresetintrin.h:42:27: error: invalid input constraint ‘a’ in asm
42 | asm (“hreset $0” :: “a”(__eax));

as well as a whole bunch of type errors.

I would like to include your code in the distribution, since it seems to be a huge performance improvement, but it may need to be an optional compile to avoid breaking the build on Apple Silicon and other incompatible systems.

Regarding your tests against pvsum, could you also try using the -m option set to your available RAM (in GBytes) and -N set to the number of physical cores? I’m wondering if the performance would be any better.

Cheers,
-Greg

1112 · April 23, 2026, 1:10pm

Hi Greg,
I have modifide the code to make it work on Apple computer and sent it to you through email. Additionally, I also have used -m option and -N set to the number of physical cores, below is the test results: