A high performance dctmestep

I have implemented a highly optimized version of dctimestep, achieving a significant performance increase (an 8x to 34x speedup). The primary modifications are as follows:

  1. Memory-Mapped I/O: The daylight coefficient HDR images were combined into a single binary file and accessed via memory mapping (lazy loading) to reduce I/O overhead.
  2. Matrix Multiplication: The sum_images function was refactored to perform matrix multiplication, replacing the traditional sequential weighted accumulation of multiple images.
  3. Hardware Acceleration & Parallelism: The matrix multiplication was heavily optimized using:
  • SIMD: Processes 8 pixels (24 floating-point values) simultaneously in a single CPU cycle, rather than one by one.
  • Multi-threading: Leverages OpenMP to distribute computations across multiple CPU cores.
  • Zero-Skipping (Sparsity Optimization): Bypasses calculations for completely black (zero-value) pixels, drastically reducing processing time for sparse images.
  1. Sun Matrix Fallback: If the weather data is a sun matrix, the program dynamically falls back to the original method of weighted image accumulation.
    The complied binary files and source files was uploaded here.(dctimestep files - Google Drive).
    When compiling the files with gcc, the compiler flags -mavx2, -mfma, and -fopenmp must be included. For example: gcc -O3 -mavx2 -mfma -fopenmp cmatrix.c cmbsdf.c dctimestep.c -I../../src/common -L../../build/src/lib -lrtrad -lm -o dctimestep