memory usage rtrace -n

Jan_Wienold · April 25, 2018, 8:28am

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" (that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a large swap space, some of these processes failed with "cannot allocate memory". I know that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But we dont know how to estimate to total memory usage (lets say a single rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images (lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but I have to admit I didn't restart the simulation explicitly on the same machine, since I had a fully automized process, where the failed ones were automatically restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have some rules of thumb to calculate the memory usage when applying the -n option when the usage of a single process is known?

best

Jan

top - 19:16:33 up 57 days, 6:55, 24 users, load average: 88.11, 81.49, 70.31
Tasks: 891 total, 51 running, 840 sleeping, 0 stopped, 0 zombie
%Cpu(s): 20.7 us, 0.8 sy, 72.3 ni, 1.2 id, 5.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 13191659+total, 11070876+used, 21207820 free, 42656 buffers
KiB Swap: 13411430+total, 9556232 used, 12455806+free. 23473420 cached Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM TIME\+ COMMAND

88260 lipid4 20 0 21.316g 0.021t 2496 R 12.6 16.8 12:45.60 rtrace
88263 lipid4 20 0 41.816g 0.021t 2424 S 0.0 16.8 15:28.52 rtrace
88254 lipid4 20 0 2198728 1.913g 2452 S 0.0 1.5 8:58.91 rtrace
88257 lipid4 20 0 884028 871708 2720 R 99.5 0.7 840:05.96 rtrace
88292 lipid4 20 0 884300 868240 2212 R 99.5 0.7 824:08.50 rtrace
88287 lipid4 20 0 883484 856984 2176 R 22.4 0.6 825:37.93 rtrace
88286 lipid4 20 0 883988 839512 2176 R 32.3 0.6 825:06.91 rtrace
88291 lipid4 20 0 884352 806224 2192 D 7.1 0.6 821:36.40 rtrace
88289 lipid4 20 0 884280 796720 2160 D 9.8 0.6 817:12.97 rtrace
88288 lipid4 20 0 883532 783388 2160 D 1.1 0.6 816:19.17 rtrace

lipid4 88262 1.1 0.0 7548 1964 ? S 03:38 10:52 rcalc -f /home/lipid4/finalrun/files/view360stereo.cal -e XD:12960;YD:12960;X:-2.69452;Y:-31.0606;Z:2.098;IPD:0.06;EX:0;EZ:0
lipid4 88263 1.6 16.7 43846856 22123884 ? S 03:38 15:28 rtrace -w -n 3 -dj 0.02 -ds 0.05 -dt .05 -dc .5 -dp 256 -st 0.5 -ab 4 -aa 0.02 -ar 32 -ad 50000 -as 25000 -lr 4 -lw 0.000003 -af /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.af -x 12960 -y 12960 -fac /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.oct
lipid4 88286 87.9 0.6 883988 861640 ? R 03:38 825:20 rtrace -w -n 2 -dj 0.02 -ds 0.05 -dt .05 -dc .5 -dp 256 -st 0.5 -ab 4 -aa 0.02 -ar 32 -ad 50000 -as 25000 -lr 4 -lw 0.000003 -af /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.af -x 12960 -y 12960 -fac /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.oct
lipid4 88287 87.9 0.6 883484 863988 ? R 03:38 826:02 rtrace -w -n 2 -dj 0.02 -ds 0.05 -dt .05 -dc .5 -dp 256 -st 0.5 -ab 4 -aa 0.02 -ar 32 -ad 50000 -as 25000 -lr 4 -lw 0.000003 -af /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.af -x 12960 -y 12960 -fac /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.oct
lipid4 88288 86.9 0.6 883532 807460 ? D 03:38 816:20 rtrace -w -n 3 -dj 0.02 -ds 0.05 -dt .05 -dc .5 -dp 256 -st 0.5 -ab 4 -aa 0.02 -ar 32 -ad 50000 -as 25000 -lr 4 -lw 0.000003 -af /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.af -x 12960 -y 12960 -fac /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.oct
lipid4 88289 87.0 0.6 884280 815772 ? D 03:38 817:18 rtrace -w -n 3 -dj 0.02 -ds 0.05 -dt .05 -dc .5 -dp 256 -st 0.5 -ab 4 -aa 0.02 -ar 32 -ad 50000 -as 25000 -lr 4 -lw 0.000003 -af /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.af -x 12960 -y 12960 -fac /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.oct
lipid4 88290 88.5 0.5 884308 784652 ? R 03:38 831:32 rtrace -w -n 3 -dj 0.02 -ds 0.05 -dt .05 -dc .5 -dp 256 -st 0.5 -ab 4 -aa 0.02 -ar 32 -ad 50000 -as 25000 -lr 4 -lw 0.000003 -af /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.af -x 12960 -y 12960 -fac /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.oct
lipid4 88291 87.5 0.6 884352 822328 ? D 03:38 821:40 rtrace -w -n 2 -dj 0.02 -ds 0.05 -dt .05 -dc .5 -dp 256 -st 0.5 -ab 4 -aa 0.02 -ar 32 -ad 50000 -as 25000 -lr 4 -lw 0.000003 -af /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.af -x 12960 -y 12960 -fac /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.oct
lipid4 88292 87.8 0.6 884300 869016 ? R 03:38 824:47 rtrace -w -n 2 -dj 0.02 -ds 0.05 -dt .05 -dc .5 -dp 256 -st 0.5 -ab 4 -aa 0.02 -ar 32 -ad 50000 -as 25000 -lr 4 -lw 0.000003 -af /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.af -x 12960 -y 12960 -fac /home/lipid4/finalrun/p5_social_overcast_ntnu_largewin_simu/p5_social_overcast_ntnu_largewin_simu.oct

···

--
Dr.-Ing. Jan Wienold
Ecole Polytechnique Fédérale de Lausanne (EPFL)
EPFL ENAC IA LIPID

http://people.epfl.ch/jan.wienold
LE 1 111 (Office)
Phone +41 21 69 30849

Greg_Ward · April 25, 2018, 5:14pm

Hi Jan,

This could be an unexpected "hang" condition with one of the rtrace processes, where a single ray evaluation is blocked waiting for access to the ambient file, while the other processes continue computing away, filling up the queue with results after the blocked one. I could see this becoming a runaway memory condition, but I don't know why a process would be blocked. NFS file locking is used on the ambient file, and this has been known to fail on some Linux builds, but I haven't seen it fail by refusing to unlock. (The problem in the past has been unlocking when it shouldn't.)

If you can monitor your processes, watching for when the parent becomes large, then stop all the child processes (kill -stop pid1 pid2 ...), restarting them one by one using (kill -continue pidN). If the parent process starts to shrink after that, or at least doesn't continue to grow, then this would support my hypothesis.

The other thing to look for is a child process with 0% CPU time. If none of the child processes are hung, then I'm not sure why memory would be growing in the parent.

There's no sense trying to fix such an unusual problem until we have a firmer idea of the cause.

Cheers,
-Greg

···

From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" (that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a large swap space, some of these processes failed with "cannot allocate memory". I know that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But we dont know how to estimate to total memory usage (lets say a single rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images (lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but I have to admit I didn't restart the simulation explicitly on the same machine, since I had a fully automized process, where the failed ones were automatically restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have some rules of thumb to calculate the memory usage when applying the -n option when the usage of a single process is known?

best

Jan

--
Dr.-Ing. Jan Wienold

Jan_Wienold · April 26, 2018, 7:46am

Hi Greg,

thx for the explanations - now it is much clearer to me, especially why it does not happen after a restart ( I guess the locking "situation" is different then). We are using always only a local file system - nfs should not be involved. Some of the machines which failed were virtual machines, some of them nodes of a cluster. The screenshot I showed was on our big server, which has 128GB ram, so the rise didn't cause a crash.
Next time I will monitor this more closely and will try stopping of the child processes, as you suggested to get more insights. At the moment we have finished all renderings (and they are great in Okulus ), but I'm sure there will come more in some weeks.

thx for the help!

best
Jan

···

On 04/25/2018 07:14 PM, Gregory J. Ward wrote:

Hi Jan,

This could be an unexpected "hang" condition with one of the rtrace processes, where a single ray evaluation is blocked waiting for access to the ambient file, while the other processes continue computing away, filling up the queue with results after the blocked one. I could see this becoming a runaway memory condition, but I don't know why a process would be blocked. NFS file locking is used on the ambient file, and this has been known to fail on some Linux builds, but I haven't seen it fail by refusing to unlock. (The problem in the past has been unlocking when it shouldn't.)

If you can monitor your processes, watching for when the parent becomes large, then stop all the child processes (kill -stop pid1 pid2 ...), restarting them one by one using (kill -continue pidN). If the parent process starts to shrink after that, or at least doesn't continue to grow, then this would support my hypothesis.

The other thing to look for is a child process with 0% CPU time. If none of the child processes are hung, then I'm not sure why memory would be growing in the parent.

There's no sense trying to fix such an unusual problem until we have a firmer idea of the cause.

Cheers,
-Greg

From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" (that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a large swap space, some of these processes failed with "cannot allocate memory". I know that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But we dont know how to estimate to total memory usage (lets say a single rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images (lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but I have to admit I didn't restart the simulation explicitly on the same machine, since I had a fully automized process, where the failed ones were automatically restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have some rules of thumb to calculate the memory usage when applying the -n option when the usage of a single process is known?

best

Jan

--
Dr.-Ing. Jan Wienold

--
Dr.-Ing. Jan Wienold
Ecole Polytechnique Fédérale de Lausanne (EPFL)
EPFL ENAC IA LIPID

LE 1 111 (Office)
Phone +41 21 69 30849

Lars_Grobe · May 4, 2018, 9:24am

Hi Greg, Jan,

I just observed a similar problem with rcontib. I am running a chain of vwrays, rtrace, awk, rfluxmtx to calculate daylight coefficients in an image region (rtrace returns view origin, direction and modifier, and awk filters so that rays are passed into rfluxmtx only if a defined modifier is hit). This in general works pretty well, even with 38 processes in parallel, but I just had one rcontrib process stuck at 100% CPU (no memory effects though). Issuing a kill -stop PID; kill -cont PID sequence on the rcontrib process made it immediately complete the task. The ambient file can be excluded here as a cause, since rcontrib does not utilize the ambient cache. This is all on non-networked filesystems, ubuntu linux.

Cheers, Lars.

···

Hi Jan,

This could be an unexpected "hang" condition with one of the rtrace processes, where a single ray evaluation is blocked waiting for access to the ambient file, while the other processes continue computing away, filling up the queue with results after the blocked one. I could see this becoming a runaway memory condition, but I don't know why a process would be blocked. NFS file locking is used on the ambient file, and this has been known to fail on some Linux builds, but I haven't seen it fail by refusing to unlock. (The problem in the past has been unlocking when it shouldn't.)

If you can monitor your processes, watching for when the parent becomes large, then stop all the child processes (kill -stop pid1 pid2 ...), restarting them one by one using (kill -continue pidN). If the parent process starts to shrink after that, or at least doesn't continue to grow, then this would support my hypothesis.

The other thing to look for is a child process with 0% CPU time. If none of the child processes are hung, then I'm not sure why memory would be growing in the parent.

There's no sense trying to fix such an unusual problem until we have a firmer idea of the cause.

Cheers,
-Greg

From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" (that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a large swap space, some of these processes failed with "cannot allocate memory". I know that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But we dont know how to estimate to total memory usage (lets say a single rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images (lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but I have to admit I didn't restart the simulation explicitly on the same machine, since I had a fully automized process, where the failed ones were automatically restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have some rules of thumb to calculate the memory usage when applying the -n option when the usage of a single process is known?

best

Jan

--
Dr.-Ing. Jan Wienold

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Lars_Grobe · May 4, 2018, 3:43pm

Hi,

a quick follow-up just to clarify - I cannot proof that the stop / cont signals caused the completion of the task, it may have been coincidence... Right now I have a never-ending rcontrib-process again, and stop / cont does not help this time.

Cheers,
Lars.

···

Hi Greg, Jan,

I just observed a similar problem with rcontib. I am running a chain of vwrays, rtrace, awk, rfluxmtx to calculate daylight coefficients in an image region (rtrace returns view origin, direction and modifier, and awk filters so that rays are passed into rfluxmtx only if a defined modifier is hit). This in general works pretty well, even with 38 processes in parallel, but I just had one rcontrib process stuck at 100% CPU (no memory effects though). Issuing a kill -stop PID; kill -cont PID sequence on the rcontrib process made it immediately complete the task. The ambient file can be excluded here as a cause, since rcontrib does not utilize the ambient cache. This is all on non-networked filesystems, ubuntu linux.

Cheers, Lars.

Hi Jan,

This could be an unexpected "hang" condition with one of the rtrace processes, where a single ray evaluation is blocked waiting for access to the ambient file, while the other processes continue computing away, filling up the queue with results after the blocked one. I could see this becoming a runaway memory condition, but I don't know why a process would be blocked. NFS file locking is used on the ambient file, and this has been known to fail on some Linux builds, but I haven't seen it fail by refusing to unlock. (The problem in the past has been unlocking when it shouldn't.)

If you can monitor your processes, watching for when the parent becomes large, then stop all the child processes (kill -stop pid1 pid2 ...), restarting them one by one using (kill -continue pidN). If the parent process starts to shrink after that, or at least doesn't continue to grow, then this would support my hypothesis.

The other thing to look for is a child process with 0% CPU time. If none of the child processes are hung, then I'm not sure why memory would be growing in the parent.

There's no sense trying to fix such an unusual problem until we have a firmer idea of the cause.

Cheers,
-Greg

From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" (that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a large swap space, some of these processes failed with "cannot allocate memory". I know that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But we dont know how to estimate to total memory usage (lets say a single rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images (lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but I have to admit I didn't restart the simulation explicitly on the same machine, since I had a fully automized process, where the failed ones were automatically restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have some rules of thumb to calculate the memory usage when applying the -n option when the usage of a single process is known?

best

Jan

--
Dr.-Ing. Jan Wienold

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Greg_Ward · May 4, 2018, 4:16pm

Hi Lars,

Did you check to make sure that none of your surfaces has 100% or greater reflection? This can throw ray processing into a loop.

If you compile with the "-g" in place of "-O" and terminate the process when it's stuck with a "kill -QUIT" signal, you should be able to get a backtrace to find out where the process was hung.

Cheers,
-Greg

P.S. I did introduce a change to the ray queuing code used by both rtrace and rcontrib that should prevent runaway memory growth, but it won't prevent an infinite loop. Those are the worst.

···

From: "Lars O. Grobe" <[email protected]>
Date: May 4, 2018 8:43:55 AM PDT

Hi,

a quick follow-up just to clarify - I cannot proof that the stop / cont signals caused the completion of the task, it may have been coincidence... Right now I have a never-ending rcontrib-process again, and stop / cont does not help this time.

Cheers,
Lars.

Hi Greg, Jan,

I just observed a similar problem with rcontib. I am running a chain of vwrays, rtrace, awk, rfluxmtx to calculate daylight coefficients in an image region (rtrace returns view origin, direction and modifier, and awk filters so that rays are passed into rfluxmtx only if a defined modifier is hit). This in general works pretty well, even with 38 processes in parallel, but I just had one rcontrib process stuck at 100% CPU (no memory effects though). Issuing a kill -stop PID; kill -cont PID sequence on the rcontrib process made it immediately complete the task. The ambient file can be excluded here as a cause, since rcontrib does not utilize the ambient cache. This is all on non-networked filesystems, ubuntu linux.

Cheers, Lars.

Hi Jan,

This could be an unexpected "hang" condition with one of the rtrace processes, where a single ray evaluation is blocked waiting for access to the ambient file, while the other processes continue computing away, filling up the queue with results after the blocked one. I could see this becoming a runaway memory condition, but I don't know why a process would be blocked. NFS file locking is used on the ambient file, and this has been known to fail on some Linux builds, but I haven't seen it fail by refusing to unlock. (The problem in the past has been unlocking when it shouldn't.)

If you can monitor your processes, watching for when the parent becomes large, then stop all the child processes (kill -stop pid1 pid2 ...), restarting them one by one using (kill -continue pidN). If the parent process starts to shrink after that, or at least doesn't continue to grow, then this would support my hypothesis.

The other thing to look for is a child process with 0% CPU time. If none of the child processes are hung, then I'm not sure why memory would be growing in the parent.

There's no sense trying to fix such an unusual problem until we have a firmer idea of the cause.

Cheers,
-Greg

From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" (that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a large swap space, some of these processes failed with "cannot allocate memory". I know that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But we dont know how to estimate to total memory usage (lets say a single rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images (lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but I have to admit I didn't restart the simulation explicitly on the same machine, since I had a fully automized process, where the failed ones were automatically restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have some rules of thumb to calculate the memory usage when applying the -n option when the usage of a single process is known?

best

Jan

--
Dr.-Ing. Jan Wienold

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Lars_Grobe · May 4, 2018, 4:38pm

Hi Greg,

unfortunately the currently running process was started from a build without the -g switch. I will recompile and test it to try getting the backtrace. I am pretty sure that there is no >100% reflection.

The one thing that I suspected to be the culprit is how I mask the rendering. Does rfluxmtx properly digest zero direction vectors as rtrace does? I have observed that the rcontrib process is stuck once the last visible pixel has been rendered. The remaining part is all "out of view", e.g. 0 0 0 direction vector. So it might be that rcontrib is just busy computing the zero length rays (most of my image is masked) but makes no visible progress. I expected these rays to result in little load, assuming that oversampling would not apply for zero length vectors - and I have a pretty high "oversampling" set with the -c N parameter. Is it possible that oversampling (accumulating) collides with my use of the "dummy rays" to mask the image?

Cheers, Lars.

···

Hi Lars,

Did you check to make sure that none of your surfaces has 100% or greater reflection? This can throw ray processing into a loop.

If you compile with the "-g" in place of "-O" and terminate the process when it's stuck with a "kill -QUIT" signal, you should be able to get a backtrace to find out where the process was hung.

Cheers,
-Greg

P.S. I did introduce a change to the ray queuing code used by both rtrace and rcontrib that should prevent runaway memory growth, but it won't prevent an infinite loop. Those are the worst.

From: "Lars O. Grobe" <[email protected]>
Date: May 4, 2018 8:43:55 AM PDT

Hi,

a quick follow-up just to clarify - I cannot proof that the stop / cont signals caused the completion of the task, it may have been coincidence... Right now I have a never-ending rcontrib-process again, and stop / cont does not help this time.

Cheers,
Lars.

Hi Greg, Jan,

I just observed a similar problem with rcontib. I am running a chain of vwrays, rtrace, awk, rfluxmtx to calculate daylight coefficients in an image region (rtrace returns view origin, direction and modifier, and awk filters so that rays are passed into rfluxmtx only if a defined modifier is hit). This in general works pretty well, even with 38 processes in parallel, but I just had one rcontrib process stuck at 100% CPU (no memory effects though). Issuing a kill -stop PID; kill -cont PID sequence on the rcontrib process made it immediately complete the task. The ambient file can be excluded here as a cause, since rcontrib does not utilize the ambient cache. This is all on non-networked filesystems, ubuntu linux.

Cheers, Lars.

Hi Jan,

This could be an unexpected "hang" condition with one of the rtrace processes, where a single ray evaluation is blocked waiting for access to the ambient file, while the other processes continue computing away, filling up the queue with results after the blocked one. I could see this becoming a runaway memory condition, but I don't know why a process would be blocked. NFS file locking is used on the ambient file, and this has been known to fail on some Linux builds, but I haven't seen it fail by refusing to unlock. (The problem in the past has been unlocking when it shouldn't.)

If you can monitor your processes, watching for when the parent becomes large, then stop all the child processes (kill -stop pid1 pid2 ...), restarting them one by one using (kill -continue pidN). If the parent process starts to shrink after that, or at least doesn't continue to grow, then this would support my hypothesis.

The other thing to look for is a child process with 0% CPU time. If none of the child processes are hung, then I'm not sure why memory would be growing in the parent.

There's no sense trying to fix such an unusual problem until we have a firmer idea of the cause.

Cheers,
-Greg

From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" (that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a large swap space, some of these processes failed with "cannot allocate memory". I know that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But we dont know how to estimate to total memory usage (lets say a single rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images (lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but I have to admit I didn't restart the simulation explicitly on the same machine, since I had a fully automized process, where the failed ones were automatically restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have some rules of thumb to calculate the memory usage when applying the -n option when the usage of a single process is known?

best

Jan

--
Dr.-Ing. Jan Wienold

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Greg_Ward · May 4, 2018, 5:04pm

Hi Lars,

If you're running rfluxmtx in pass-through mode, then it's rcontrib that is actually handling the zero direction vectors. It would help me to know exactly the parameters being used in that case, as the logic in rcontrib is pretty complicated, especially when it comes to multiprocessing. What rcontrib command is reported by rfluxmtx using the '-v' option?

Under some circumstances, the zero rays are interpreted as "flush requests", which can slow things down a bit. There shouldn't be any infinite loops, however, and I don't think flushing happens if you specify both -x and -y > 0 to rcontrib.

Oversampling should work fine with dummy rays. In any case, you should get a result for each N input rays, even if their directions are 0 0 0. The results will just be zero for those records. (I assume you know that to have gotten this far.)

Cheers,
-Greg

···

From: "Lars O. Grobe" <[email protected]>
Date: May 4, 2018 9:38:42 AM PDT

Hi Greg,

unfortunately the currently running process was started from a build without the -g switch. I will recompile and test it to try getting the backtrace. I am pretty sure that there is no >100% reflection.

The one thing that I suspected to be the culprit is how I mask the rendering. Does rfluxmtx properly digest zero direction vectors as rtrace does? I have observed that the rcontrib process is stuck once the last visible pixel has been rendered. The remaining part is all "out of view", e.g. 0 0 0 direction vector. So it might be that rcontrib is just busy computing the zero length rays (most of my image is masked) but makes no visible progress. I expected these rays to result in little load, assuming that oversampling would not apply for zero length vectors - and I have a pretty high "oversampling" set with the -c N parameter. Is it possible that oversampling (accumulating) collides with my use of the "dummy rays" to mask the image?

Cheers, Lars.

Hi Lars,

Did you check to make sure that none of your surfaces has 100% or greater reflection? This can throw ray processing into a loop.

If you compile with the "-g" in place of "-O" and terminate the process when it's stuck with a "kill -QUIT" signal, you should be able to get a backtrace to find out where the process was hung.

Cheers,
-Greg

P.S. I did introduce a change to the ray queuing code used by both rtrace and rcontrib that should prevent runaway memory growth, but it won't prevent an infinite loop. Those are the worst.

From: "Lars O. Grobe" <[email protected]>
Date: May 4, 2018 8:43:55 AM PDT

Hi,

a quick follow-up just to clarify - I cannot proof that the stop / cont signals caused the completion of the task, it may have been coincidence... Right now I have a never-ending rcontrib-process again, and stop / cont does not help this time.

Cheers,
Lars.

Hi Greg, Jan,

I just observed a similar problem with rcontib. I am running a chain of vwrays, rtrace, awk, rfluxmtx to calculate daylight coefficients in an image region (rtrace returns view origin, direction and modifier, and awk filters so that rays are passed into rfluxmtx only if a defined modifier is hit). This in general works pretty well, even with 38 processes in parallel, but I just had one rcontrib process stuck at 100% CPU (no memory effects though). Issuing a kill -stop PID; kill -cont PID sequence on the rcontrib process made it immediately complete the task. The ambient file can be excluded here as a cause, since rcontrib does not utilize the ambient cache. This is all on non-networked filesystems, ubuntu linux.

Cheers, Lars.

Hi Jan,

This could be an unexpected "hang" condition with one of the rtrace processes, where a single ray evaluation is blocked waiting for access to the ambient file, while the other processes continue computing away, filling up the queue with results after the blocked one. I could see this becoming a runaway memory condition, but I don't know why a process would be blocked. NFS file locking is used on the ambient file, and this has been known to fail on some Linux builds, but I haven't seen it fail by refusing to unlock. (The problem in the past has been unlocking when it shouldn't.)

If you can monitor your processes, watching for when the parent becomes large, then stop all the child processes (kill -stop pid1 pid2 ...), restarting them one by one using (kill -continue pidN). If the parent process starts to shrink after that, or at least doesn't continue to grow, then this would support my hypothesis.

The other thing to look for is a child process with 0% CPU time. If none of the child processes are hung, then I'm not sure why memory would be growing in the parent.

There's no sense trying to fix such an unusual problem until we have a firmer idea of the cause.

Cheers,
-Greg

From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" (that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a large swap space, some of these processes failed with "cannot allocate memory". I know that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But we dont know how to estimate to total memory usage (lets say a single rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images (lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but I have to admit I didn't restart the simulation explicitly on the same machine, since I had a fully automized process, where the failed ones were automatically restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have some rules of thumb to calculate the memory usage when applying the -n option when the usage of a single process is known?

best

Jan

--
Dr.-Ing. Jan Wienold

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Lars_Grobe · May 4, 2018, 6:23pm

Hi Greg,

this is how the rcontrib process was started by rfluxmtx (grepped from ps ax):

rcontrib -fo+ -n 38 -ab 2 -ad 128 -lw .008 -ss 16 -st .01 -o /tmp/Cellular_fisheye_celWg01XVBP_d_02_sys_Klems_celWg02XVBP_r_02_sys_Klems_celWg03XVBP_r_02_sys_Klems_TUR_Izmir.172180_IWEC.glazingdcs/%04d.hdr -x 1024 -y 1024 -ld- -fac -c 32 -bn 1 -b if(-Dx*0-Dy*0-Dz*1,0,-1) -m ground_glow -f reinhartb.cal -p MF=1,rNx=0,rNy=0,rNz=-1,Ux=0,Uy=1,Uz=0,RHS=+1 -bn Nrbins -b rbin -m sky_glow !oconv -f offices/cellularOffice/Cellular.rad offices/cellularOffice/Cellular_wg01XVBPd02o_wg02XVBPr02o_wg03XVBPr02o.rad uniformSky.rad

Maybe it is just the constant flushing (after every pixel...), and I had just managed to issue the kill -stop; kill -cont right before the process was done by coincidence. I there a better way than to get just parts of a view rendered if the zero rays affect performance that drastically?

Cheers, Lars.

···

Hi Lars,

If you're running rfluxmtx in pass-through mode, then it's rcontrib that is actually handling the zero direction vectors. It would help me to know exactly the parameters being used in that case, as the logic in rcontrib is pretty complicated, especially when it comes to multiprocessing. What rcontrib command is reported by rfluxmtx using the '-v' option?

Under some circumstances, the zero rays are interpreted as "flush requests", which can slow things down a bit. There shouldn't be any infinite loops, however, and I don't think flushing happens if you specify both -x and -y > 0 to rcontrib.

Oversampling should work fine with dummy rays. In any case, you should get a result for each N input rays, even if their directions are 0 0 0. The results will just be zero for those records. (I assume you know that to have gotten this far.)

Cheers,
-Greg

From: "Lars O. Grobe" <[email protected]>
Date: May 4, 2018 9:38:42 AM PDT

Hi Greg,

unfortunately the currently running process was started from a build without the -g switch. I will recompile and test it to try getting the backtrace. I am pretty sure that there is no >100% reflection.

The one thing that I suspected to be the culprit is how I mask the rendering. Does rfluxmtx properly digest zero direction vectors as rtrace does? I have observed that the rcontrib process is stuck once the last visible pixel has been rendered. The remaining part is all "out of view", e.g. 0 0 0 direction vector. So it might be that rcontrib is just busy computing the zero length rays (most of my image is masked) but makes no visible progress. I expected these rays to result in little load, assuming that oversampling would not apply for zero length vectors - and I have a pretty high "oversampling" set with the -c N parameter. Is it possible that oversampling (accumulating) collides with my use of the "dummy rays" to mask the image?

Cheers, Lars.

Hi Lars,

Did you check to make sure that none of your surfaces has 100% or greater reflection? This can throw ray processing into a loop.

If you compile with the "-g" in place of "-O" and terminate the process when it's stuck with a "kill -QUIT" signal, you should be able to get a backtrace to find out where the process was hung.

Cheers,
-Greg

P.S. I did introduce a change to the ray queuing code used by both rtrace and rcontrib that should prevent runaway memory growth, but it won't prevent an infinite loop. Those are the worst.

From: "Lars O. Grobe" <[email protected]>
Date: May 4, 2018 8:43:55 AM PDT

Hi,

a quick follow-up just to clarify - I cannot proof that the stop / cont signals caused the completion of the task, it may have been coincidence... Right now I have a never-ending rcontrib-process again, and stop / cont does not help this time.

Cheers,
Lars.

Hi Greg, Jan,

I just observed a similar problem with rcontib. I am running a chain of vwrays, rtrace, awk, rfluxmtx to calculate daylight coefficients in an image region (rtrace returns view origin, direction and modifier, and awk filters so that rays are passed into rfluxmtx only if a defined modifier is hit). This in general works pretty well, even with 38 processes in parallel, but I just had one rcontrib process stuck at 100% CPU (no memory effects though). Issuing a kill -stop PID; kill -cont PID sequence on the rcontrib process made it immediately complete the task. The ambient file can be excluded here as a cause, since rcontrib does not utilize the ambient cache. This is all on non-networked filesystems, ubuntu linux.

Cheers, Lars.

Hi Jan,

This could be an unexpected "hang" condition with one of the rtrace processes, where a single ray evaluation is blocked waiting for access to the ambient file, while the other processes continue computing away, filling up the queue with results after the blocked one. I could see this becoming a runaway memory condition, but I don't know why a process would be blocked. NFS file locking is used on the ambient file, and this has been known to fail on some Linux builds, but I haven't seen it fail by refusing to unlock. (The problem in the past has been unlocking when it shouldn't.)

If you can monitor your processes, watching for when the parent becomes large, then stop all the child processes (kill -stop pid1 pid2 ...), restarting them one by one using (kill -continue pidN). If the parent process starts to shrink after that, or at least doesn't continue to grow, then this would support my hypothesis.

The other thing to look for is a child process with 0% CPU time. If none of the child processes are hung, then I'm not sure why memory would be growing in the parent.

There's no sense trying to fix such an unusual problem until we have a firmer idea of the cause.

Cheers,
-Greg

From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" (that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a large swap space, some of these processes failed with "cannot allocate memory". I know that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But we dont know how to estimate to total memory usage (lets say a single rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images (lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but I have to admit I didn't restart the simulation explicitly on the same machine, since I had a fully automized process, where the failed ones were automatically restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have some rules of thumb to calculate the memory usage when applying the -n option when the usage of a single process is known?

best

Jan

--
Dr.-Ing. Jan Wienold

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Greg_Ward · May 4, 2018, 6:35pm

Hi Lars,

Thanks for sending this. Your delays are not caused by ray flushing. Your settings for -x and -y override the flushing behavior, so performance should not be affected by your zero rays. This does leave me puzzled as to the cause of your delays, however. You say the rcontrib process is at 100%? If that's the case, then kill -QUIT on a binary compiled with "-g" should give us a clue where it's getting stuck. If the process is at 0%, then a backtrace from kill -QUIT may or may not be helpful.

Cheers,
-Greg

···

From: "Lars O. Grobe" <[email protected]>
Date: May 4, 2018 11:23:17 AM PDT

Hi Greg,

this is how the rcontrib process was started by rfluxmtx (grepped from ps ax):

rcontrib -fo+ -n 38 -ab 2 -ad 128 -lw .008 -ss 16 -st .01 -o /tmp/Cellular_fisheye_celWg01XVBP_d_02_sys_Klems_celWg02XVBP_r_02_sys_Klems_celWg03XVBP_r_02_sys_Klems_TUR_Izmir.172180_IWEC.glazingdcs/%04d.hdr -x 1024 -y 1024 -ld- -fac -c 32 -bn 1 -b if(-Dx*0-Dy*0-Dz*1,0,-1) -m ground_glow -f reinhartb.cal -p MF=1,rNx=0,rNy=0,rNz=-1,Ux=0,Uy=1,Uz=0,RHS=+1 -bn Nrbins -b rbin -m sky_glow !oconv -f offices/cellularOffice/Cellular.rad offices/cellularOffice/Cellular_wg01XVBPd02o_wg02XVBPr02o_wg03XVBPr02o.rad uniformSky.rad

Maybe it is just the constant flushing (after every pixel...), and I had just managed to issue the kill -stop; kill -cont right before the process was done by coincidence. I there a better way than to get just parts of a view rendered if the zero rays affect performance that drastically?

Cheers, Lars.

Hi Lars,

If you're running rfluxmtx in pass-through mode, then it's rcontrib that is actually handling the zero direction vectors. It would help me to know exactly the parameters being used in that case, as the logic in rcontrib is pretty complicated, especially when it comes to multiprocessing. What rcontrib command is reported by rfluxmtx using the '-v' option?

Under some circumstances, the zero rays are interpreted as "flush requests", which can slow things down a bit. There shouldn't be any infinite loops, however, and I don't think flushing happens if you specify both -x and -y > 0 to rcontrib.

Oversampling should work fine with dummy rays. In any case, you should get a result for each N input rays, even if their directions are 0 0 0. The results will just be zero for those records. (I assume you know that to have gotten this far.)

Cheers,
-Greg

From: "Lars O. Grobe" <[email protected]>
Date: May 4, 2018 9:38:42 AM PDT

Hi Greg,

unfortunately the currently running process was started from a build without the -g switch. I will recompile and test it to try getting the backtrace. I am pretty sure that there is no >100% reflection.

The one thing that I suspected to be the culprit is how I mask the rendering. Does rfluxmtx properly digest zero direction vectors as rtrace does? I have observed that the rcontrib process is stuck once the last visible pixel has been rendered. The remaining part is all "out of view", e.g. 0 0 0 direction vector. So it might be that rcontrib is just busy computing the zero length rays (most of my image is masked) but makes no visible progress. I expected these rays to result in little load, assuming that oversampling would not apply for zero length vectors - and I have a pretty high "oversampling" set with the -c N parameter. Is it possible that oversampling (accumulating) collides with my use of the "dummy rays" to mask the image?

Cheers, Lars.

Hi Lars,

Did you check to make sure that none of your surfaces has 100% or greater reflection? This can throw ray processing into a loop.

If you compile with the "-g" in place of "-O" and terminate the process when it's stuck with a "kill -QUIT" signal, you should be able to get a backtrace to find out where the process was hung.

Cheers,
-Greg

P.S. I did introduce a change to the ray queuing code used by both rtrace and rcontrib that should prevent runaway memory growth, but it won't prevent an infinite loop. Those are the worst.

From: "Lars O. Grobe" <[email protected]>
Date: May 4, 2018 8:43:55 AM PDT

Hi,

a quick follow-up just to clarify - I cannot proof that the stop / cont signals caused the completion of the task, it may have been coincidence... Right now I have a never-ending rcontrib-process again, and stop / cont does not help this time.

Cheers,
Lars.

Hi Greg, Jan,

I just observed a similar problem with rcontib. I am running a chain of vwrays, rtrace, awk, rfluxmtx to calculate daylight coefficients in an image region (rtrace returns view origin, direction and modifier, and awk filters so that rays are passed into rfluxmtx only if a defined modifier is hit). This in general works pretty well, even with 38 processes in parallel, but I just had one rcontrib process stuck at 100% CPU (no memory effects though). Issuing a kill -stop PID; kill -cont PID sequence on the rcontrib process made it immediately complete the task. The ambient file can be excluded here as a cause, since rcontrib does not utilize the ambient cache. This is all on non-networked filesystems, ubuntu linux.

Cheers, Lars.

Hi Jan,

This could be an unexpected "hang" condition with one of the rtrace processes, where a single ray evaluation is blocked waiting for access to the ambient file, while the other processes continue computing away, filling up the queue with results after the blocked one. I could see this becoming a runaway memory condition, but I don't know why a process would be blocked. NFS file locking is used on the ambient file, and this has been known to fail on some Linux builds, but I haven't seen it fail by refusing to unlock. (The problem in the past has been unlocking when it shouldn't.)

If you can monitor your processes, watching for when the parent becomes large, then stop all the child processes (kill -stop pid1 pid2 ...), restarting them one by one using (kill -continue pidN). If the parent process starts to shrink after that, or at least doesn't continue to grow, then this would support my hypothesis.

The other thing to look for is a child process with 0% CPU time. If none of the child processes are hung, then I'm not sure why memory would be growing in the parent.

There's no sense trying to fix such an unusual problem until we have a firmer idea of the cause.

Cheers,
-Greg

From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" (that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a large swap space, some of these processes failed with "cannot allocate memory". I know that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But we dont know how to estimate to total memory usage (lets say a single rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images (lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but I have to admit I didn't restart the simulation explicitly on the same machine, since I had a fully automized process, where the failed ones were automatically restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have some rules of thumb to calculate the memory usage when applying the -n option when the usage of a single process is known?

best

Jan