I have implemented a highly optimized version of dctimestep, achieving a significant performance increase (an 8x to 34x speedup). The primary modifications are as follows:
- Memory-Mapped I/O: The daylight coefficient HDR images were combined into a single binary file and accessed via memory mapping (lazy loading) to reduce I/O overhead.
- Matrix Multiplication: The
sum_imagesfunction was refactored to perform matrix multiplication, replacing the traditional sequential weighted accumulation of multiple images. - Hardware Acceleration & Parallelism: The matrix multiplication was heavily optimized using:
- SIMD: Processes 8 pixels (24 floating-point values) simultaneously in a single CPU cycle, rather than one by one.
- Multi-threading: Leverages OpenMP to distribute computations across multiple CPU cores.
- Zero-Skipping (Sparsity Optimization): Bypasses calculations for completely black (zero-value) pixels, drastically reducing processing time for sparse images.
- Sun Matrix Fallback: If the weather data is a sun matrix, the program dynamically falls back to the original method of weighted image accumulation.
The complied binary files and source files was uploaded here.(dctimestep files - Google Drive).
When compiling the files with gcc, the compiler flags-mavx2,-mfma, and-fopenmpmust be included. For example:gcc -O3 -mavx2 -mfma -fopenmp cmatrix.c cmbsdf.c dctimestep.c -I../../src/common -L../../build/src/lib -lrtrad -lm -o dctimestep