Why does render process stall?


#1

Good morning all! I’m experiencing an issue where a render will not finish. Here’s the summary:

  • I am rendering with rad with an ambient cache, using -N 12
  • all tiles finish rendering except for one tile
  • if I use kill -CONT pid, I see the rays increasing, but the percentage complete stuck at say 7% and does not change even given another 12 hours
  • that pid is using 100% CPU as expected

I have tried killing the rad process, leaving the unf file and amb file, and rerunning rad. This goes back up to 7% faster but does not complete.

(Note a potential bug, I notice if I start a render with -N 12 it uses 12 cores, but if I kill it and rerun it it will only use 1 core. Why is this?)

I have tried killing the rad process and deleting the unf file. It’ll re-render all tiles but again get stuck on that tile.

I have tried rendering that specific tile in rvu with frame. The tile renders successfully without issue.

I have not tried rerunning rad for that final file but changing rendering settings to something lower quality.

How can I debug this?


#2

Sounds like a frustrating problem. Which OS are you on, and which version of Radiance? (I.e., what does “rpict -version” return?)

Re-running rad only starts one process even with -N 12 because there is only one tile left to render. In fact, rad will start 12 rpiece processes, and all but one will exit with nothing to do. Only the first process, whose job is specifically rerendering any unfinished tiles, continues grinding away.

You should be able to look at the unfinished output by running:

pfilt -x /3 -y /3 blah.unf > blah_unf.hdr

and viewing the HDR. If you can figure out what is in the single black tile, it might help.

The other thing I would suggest is removing the ambient file and trying to render the unfinished tile again. Maybe it got corrupted and is causing an infinite loop, though it seems unlikely as I’m pretty thorough in checking ambient values taken from these files. The only other time I’ve seen infinite loops is with -lr < 0 and greater-than-unity reflectance values.

Some systems (like Mac OS X) allow you to sample active processes, and this can give a clue as to where it’s getting stuck. If repeated samples show the same suspicious routine(s) taking up inordinate amounts of time, that’s a big clue. Of course, you would have to know which routines are suspicious, so best to send that info to me if you can get it.


#3

Thanks for your feedback Greg. You are right about only using one core because there is only one tile left, I thought I had tested for this scenario, but I was mistaken!

I managed to solve the stalled render by killing it when it stalled on the last tile, then editing the .rif file so that VARIABILITY=M instead of the setting VARIABILITY=H which I had used previously. Maybe that sheds some clues as to why it stalled.

In addition:

  1. I had checked the unf file to see what is in the single black tile. It is a keyboard and cloth. I have successfully rendered this from another angle as shown in my other thread (admittedly, with VARIABILITY=M), so I don’t know why it is problematic this time.

2018-10-02-193338_192x141_scrot

  1. I don’t think it is a mesh issue, as I can render the tile interactively with frame in rvu. Only in rpict do I get this black tile. I suspect it is due to the higher quality settings that rpict applies.

  2. rpict -version says RADIANCE 5.1.0 NREL/googs 2017.08.21 (based on RADIANCE 5.1 Official Release by G. Ward) This is run on Gentoo Linux.

  3. Interestingly, after removing the ambient file and trying to re-render it again, it then stalled at 29% instead of 7%.

The issue is now resolved for me, but Greg, if you’d like me to continue investigating in case it is a symptom of a deeper issue, I’d be happy to :slight_smile:


#4

Well, I wouldn’t really call it “resolved,” since the latest Radiance HEAD is stalling on your scene for some unknown reason. If you send me your input files and the command to re-run, I can give it a go on my system to see if the problem is reproducible on another build.


#5

Hi Greg, thanks very much for your help! I have placed all the files here: https://gitlab.com/dionmoult/living-room-radiance-demo

I apologise that the repo is a little large - I have tried to cull a bunch of unnecessary non-source files but the reality is that 3D files take space. Everything is CC-BY-SA, but if you have another license preference please let me know.

Everything is controlled using the Makefile, so just run make render and that should do the trick. But first, change scene.rif so that VARIABILITY=H, because that is the setting I was using when the render would stall.

Hopefully you can reproduce the error. Also, I will try to organise the files a bit neater as I am finding my current organisation does not scale well. If you spot any other ways I can improve things please let me know as I’m still a bit new to all this :slight_smile:


#6

Thanks, Dion.

Unfortunately, I’m a bit of a “git” when it comes to using git. What command will download everything I need in one go? The web interface only seems to give me things one at a time…


#7

https://gitlab.com/dionmoult/living-room-radiance-demo/-/archive/master/living-room-radiance-demo-master.zip


#8

Ah, silly me. Now I see the pull-down that gets to a download thingy. Every interface is different!


#9

I seem to be missing the following file:

rpict: system - cannot find data file “textures/white_concrete_stucco_normal_x.dat”: No such file or directory


#10

Despite the missing data file, I think I’ve identified the problem. The cloth is laid on the desk, leaving small gaps in between that sample rays get into. These then bounce around like crazy in the ambient (interreflection) calculation, causing the ambient file to grow unnecessarily. You can either modify the -ar parameter to something smaller (which is what VARIABILITY=M does among other things), or exclude the cloth material from your ambient calculation using the -ae option.
The good news is, it isn’t an infinite loop. It just takes a really, really long time computing something you don’t care about. (Which isn’t much better, I suppose.)


#11

Thanks Greg, I think you’re right that is the problem. The data file is unused. It is referenced in scene.mat in the definition of normal_map, but normal_map is never used anywhere. It is a remnant of me testing the ability of normal maps to displace the Z normal of meshes. I didn’t upload the dat file to the repo because I knew it was unused and it would heavily add to the filesize. I have removed the reference to it in the repo.

However, when I tested I found that I could not see an impact by changing -ar. In fact, I checked the settings used with VARIABILITY=M and VARIABILITY=H and found that the setting for -ar is the same (128) for both.

The two ambient calculation settings which differ between M and H are -ad and -as. I found that decreasing -ad helps significantly. I did not test -as. I wonder why this is, as from reading the man page it does seem as though -ar should be more helpful.


#12

OK, this should be my final post on this topic… The problem was related to an ambient calculation run amok, but the cloth wasn’t the culprit; it was the keyboard. If you look at the first attached picture, I’ve run:
genambpos -l 0 -r .01 scene.amb | oconv -i scene.amb - > scene_amb.oct
to show where the final ambient values are getting crowded:


The curtains are one place, but I think you need them there to get the right kind of shading. The bush outside the window seems silly, and you’ll notice I’ve already fixed the bush on the right by excluding its modifier from the ambient calcs. However, neither of these places is where we’re stalling, and although I’ve excluded the cloth from the ambient calcs, the keyboard is a mess of values.
I ended up with one tile stuck as you did, and had to restart it, this time excluding the keyboard material rather than the cloth. You can see in this tile how it affects the appearance of both:

As you can see, the keyboard doesn’t suffer much from losing the ambient calculation, and the cloth looks better, so that’s a good trade. The final exclude file I’m using with -aE exclude_mats.txt contains:
pb1
black_plastic
large_bush

That’s it!


#13

Thanks Greg for the excellent breakdown! I’m sure that’s it!