ranimate, recovering from broken rpicts?

Lars_Grobe · February 18, 2006, 8:42pm

Hi,

it is the first time I ran into trouble with ranimate... I use it, no, not render animations, but to distribute single picture renderings over an openmosix cluster (I simply use three host lines pointing to localhost). Now my rpicts died because of lack of memory (they all tried to migrate to one node at once), and I restarted ranimate in the hope that it would continue rendering (the pictures rendered almost one week so far, and I donot want to start at zero again). I was surprised by the following:

- When rpict failed, the frames were still filtered by pfilt, so I have frame001-003.pic in the output directory now.
- Still, ranimate found that something went wrong, as it started with -ro. I hope it will not get confused by the fact there is already a target file frame001.pic.
- Strange enough, ranimate recovered, but now it started just one process, though there are three broken (unfinished) frames and more waiting in the viewfile. Will it try to recover the three frames one by one instead of starting processes on all nodes now? That will take a very long time too complete, and my machines are getting lazy

So, many questions from a ranimate-beginner (and mis-user, but I don't want to imagine how much time it will take to render an animation of that scene...)

CU Lars.

Jack_de_Valpine · February 18, 2006, 11:59pm

Hi Lars,

I am not sure that I can answer everything here. But I will offer what I can. I believe that the recovery mode does operate as one (1) process not multiple. It would be too difficult to keep track of what is being recovered if there were multiple processes, rather with one process it just steps through to pick up anything that is unfinished. I believe that it is checking for .unf frames not .pic filtered frames.

Hope this helps a bit.

-Jack

Lars O. Grobe wrote:

···

Hi,

it is the first time I ran into trouble with ranimate... I use it, no, not render animations, but to distribute single picture renderings over an openmosix cluster (I simply use three host lines pointing to localhost). Now my rpicts died because of lack of memory (they all tried to migrate to one node at once), and I restarted ranimate in the hope that it would continue rendering (the pictures rendered almost one week so far, and I donot want to start at zero again). I was surprised by the following:

- When rpict failed, the frames were still filtered by pfilt, so I have frame001-003.pic in the output directory now.
- Still, ranimate found that something went wrong, as it started with -ro. I hope it will not get confused by the fact there is already a target file frame001.pic.
- Strange enough, ranimate recovered, but now it started just one process, though there are three broken (unfinished) frames and more waiting in the viewfile. Will it try to recover the three frames one by one instead of starting processes on all nodes now? That will take a very long time too complete, and my machines are getting lazy

So, many questions from a ranimate-beginner (and mis-user, but I don't want to imagine how much time it will take to render an animation of that scene...)

CU Lars.
------------------------------------------------------------------------

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

Greg_Ward · February 19, 2006, 6:59am

Hi Lars,

Jack is correct. Ranimate recovers frames using the following method:

1) Checks "STATUS" file to see where it left off with filtered frames.
2) Starts batch of filtering processes (in parallel according to your ranimate input file).
3) For each filtering process that fails (presumably due to unfinished rpict output), ranimate starts serial rpict recovery processes, one after the other, using local node so it can be sure of new rpict exit status.
4) If these rpict runs fail, or pfilt fails afterwards, ranimate gives it up as a bad job.

In your case, it might be better to manually start rpict -ro on each of the failed frames on separate nodes in your cluster, so the processes are not run in serial.

Using ranimate to break up the rendering of large images rather than rpiece is a clever idea I had not heard before. Rpiece has the same problem as ranimate with recovering aborted processes, forcing it to go through and find the redo the pieces one by one in a serial fashion.

-Greg

···

From: Jack de Valpine <[email protected]>
Date: February 18, 2006 3:59:53 PM PST

Hi Lars,

I am not sure that I can answer everything here. But I will offer what I can. I believe that the recovery mode does operate as one (1) process not multiple. It would be too difficult to keep track of what is being recovered if there were multiple processes, rather with one process it just steps through to pick up anything that is unfinished. I believe that it is checking for .unf frames not .pic filtered frames.

Hope this helps a bit.

-Jack

Lars O. Grobe wrote:

Hi,

it is the first time I ran into trouble with ranimate... I use it, no, not render animations, but to distribute single picture renderings over an openmosix cluster (I simply use three host lines pointing to localhost). Now my rpicts died because of lack of memory (they all tried to migrate to one node at once), and I restarted ranimate in the hope that it would continue rendering (the pictures rendered almost one week so far, and I donot want to start at zero again). I was surprised by the following:

- When rpict failed, the frames were still filtered by pfilt, so I have frame001-003.pic in the output directory now.
- Still, ranimate found that something went wrong, as it started with -ro. I hope it will not get confused by the fact there is already a target file frame001.pic.
- Strange enough, ranimate recovered, but now it started just one process, though there are three broken (unfinished) frames and more waiting in the viewfile. Will it try to recover the three frames one by one instead of starting processes on all nodes now? That will take a very long time too complete, and my machines are getting lazy

So, many questions from a ranimate-beginner (and mis-user, but I don't want to imagine how much time it will take to render an animation of that scene...)

CU Lars.

Francesco_Anselmo3 · February 19, 2006, 9:17am

Hello,

In your case, it might be better to manually start rpict -ro on each
of the failed frames on separate nodes in your cluster, so the
processes are not run in serial.

I agree with Greg, and I think you can launch rpict -ro directly
on your "master" node and wait for automatic migration,
or use the "runon" or "migrate" scripts to move the jobs to your
preferred nodes, if necessary.

In my experience, for separate single frame renderings with radiance
and openmosix I launch the rendering jobs from the same machine
with shell scripts or rad -N.

I use a sequence of rpiece commands like this
mosrun -t 1 -d 700 rpiece @args &
mosrun -t 1 -d 700 rpiece @args &
...

to distribute a large image rendering over my cluster.

HTH,

···

--
Francesco

Lars_Grobe · February 19, 2006, 1:10pm

Hi!

I agree with Greg, and I think you can launch rpict -ro directly
on your "master" node and wait for automatic migration,
or use the "runon" or "migrate" scripts to move the jobs to your
preferred nodes, if necessary.

I usually start all jobs on my local node and let them get migrated. This will usually take some minutes, but as the rendering time for a picture is better described in weeks than days at the moment, the startup time is not important as longs as the nodes do not run out of memory. Also I think I should write a small how-to, as this way of distributing renderings works really nice (as long as the network is stable, else the ssh or former rsh way is more fault tolerant).

I started the rpict processes just the same way ranimate would do, as far as I know rpict -ro will find out which view to use from the viewfile as this should be containes in the image header, right? So the command is

nohup /opt/openmosix/bin/mosrun -c /opt/radiance/bin/rpict @stills/render.opt -w0 -ro stills/frame003.unf scene_illum.oct &

for the third frame. In fact, it is amazing again and again how powerful these small little radiance tools are, e.g. that I can recover that easily...

One other question, did anyone use openssi with radiance? It should even allow to use rpict with shared memory (-PP), but I could not install it so far because it like to live in it's own network, and my machines have to integrate in an existing network.

CU Lars.

Rob_Guglielmetti2 · February 19, 2006, 4:28pm

Lars O. Grobe wrote:

I usually start all jobs on my local node and let them get migrated. This will usually take some minutes, but as the rendering time for a picture is better described in weeks than days at the moment, the startup time is not important as longs as the nodes do not run out of memory. Also I think I should write a small how-to, as this way of distributing renderings works really nice (as long as the network is stable, else the ssh or former rsh way is more fault tolerant).

Yes Lars, I was fascinated to read of your use of ranimate, and would love to hear more about the process. Francesco, you had posted a lengthy analysis of your use of openmosix a while back, but I thought your conclusion was that the performance was not worth the hassle. I just re-read it and I see that my recollection is a little off. It seems that if you have enough (fast) machines to dedicate to the effort, you can set up a really killer rendering cluster. I will have to delve into this a bit more. Lars & Francesco, I'd love to hear more about your process(es) for doing distributed renderings. I recently started using rad -N on multiprocessor machines at work, but of course now I want more. =)

Jack_de_Valpine · February 19, 2006, 5:59pm

Hi Lars,

In brief again. I did some preliminary experimentation with OpenSSI a couple of years ago. I think that it shows a lot of promise and a very supportive user group. I believe a couple of the developers are from HP and thus really interested in developing a robust and stable platform. They have integrated some of HPs clustering technology as well as tools from other systems. I believe that they have utilized some elements of the OpenMosix load balancing algorithm.

As you know I did experiment with OpenMosix as well prior to learning about OpenSSI. At the time my experiments lead me to recognize a problem with OpenMosix when jobs requiring large allocations of memory start up, (eg multiple jobs get started on a given node, the jobs need to load to memory prior to getting migrated, however the memory requirement exceeds that of the local node thus swapping occurs). Thus my instinct if I were to use openmosix, would be to forgo the automated migration mechanism and develop a simple scheduler that would send jobs out to specified nodes.

If I decide to move forward with a clustering solution, my instinct would be to go with OpenSSI at this time. This would be from standpoint of stability, robustness, shared filesystems and development team. But this is just my bias without anything well documented at this time to support it. There was an excellent paper written on Single System Images comparing (openmosix, openssi and Kerrighed)

http://www.irisa.fr/paris/Biblio/Papers/Lottiaux/LotBoiGalValMor05CCGrid.pdf

The latter is a clustering solution out of a French research group, though I am not sure if it available in any kind of stable release, (I have not checked).

I am happy to see that you and Francesco at least have made some real efforts to use OpenMosix in a production setting. It would be great to hear about some of your experiences thus far. I think that the real opportunity of these clustering systems is as follows:

   1. single process space across nodes
   2. shared filesystems that do locking/caching correctly (ie more
      stable than NFS)

There are many others, but these are the main features that come to mind.

Best,

-Jack de Valpine

Lars O. Grobe wrote:

···

Hi!

I agree with Greg, and I think you can launch rpict -ro directly
on your "master" node and wait for automatic migration,
or use the "runon" or "migrate" scripts to move the jobs to your
preferred nodes, if necessary.

I usually start all jobs on my local node and let them get migrated. This will usually take some minutes, but as the rendering time for a picture is better described in weeks than days at the moment, the startup time is not important as longs as the nodes do not run out of memory. Also I think I should write a small how-to, as this way of distributing renderings works really nice (as long as the network is stable, else the ssh or former rsh way is more fault tolerant).

I started the rpict processes just the same way ranimate would do, as far as I know rpict -ro will find out which view to use from the viewfile as this should be containes in the image header, right? So the command is

nohup /opt/openmosix/bin/mosrun -c /opt/radiance/bin/rpict @stills/render.opt -w0 -ro stills/frame003.unf scene_illum.oct &

for the third frame. In fact, it is amazing again and again how powerful these small little radiance tools are, e.g. that I can recover that easily...

One other question, did anyone use openssi with radiance? It should even allow to use rpict with shared memory (-PP), but I could not install it so far because it like to live in it's own network, and my machines have to integrate in an existing network.

CU Lars.
------------------------------------------------------------------------

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction