"Broken pipe" message from rpiece on multi-core Linux system

This problem is back for a sequel, and it would really help my work if I could get it going.

It's been a few months since I last asked about this. Has anyone else experienced this in a Linux environment? Anyone have any ideas what to do about it or how to debug it?

/proc/version reports:
  Linux version 2.6.18-274.18.1.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012

Randolph

···

On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:

On 2011-07-07 16:54:06 -0700, Greg Ward said:

Hi Randolph,

This shouldn't happen, unless one of the rpict processes died
unexpectedly. Even then, I would expect some other kind of error to be
reported as well.

-Greg

Thanks, Greg. I think that's what happenned; in fact seven of the
eight died in two cases. Wierdly, the third succeeded. If I run it as
a single-processor job, it works. Here's a piece of the log:

rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd
12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6
-ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as
392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct

rpict: warning - no output produced

rpict: system - write error in io_process: Broken pipe
rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1
rad: error rendering view blinds

--
Randolph M. Fritz

If it is a startup issue as Jack suggests, you might try inserting a few seconds of delay between the spawning of each new rpiece process using "sleep 5" or similar. This allows time for the sync file to be updated without contention between processes. This is what I do in rad with the -N option. I actually wait 10 seconds between each new rpiece process.

This isn't to say that I understand the source of your error, which still puzzles me.

-Greg

···

From: Jack de Valpine <[email protected]>
Date: April 9, 2012 1:46:03 PM PDT

Hey Randolph,

I have run into this before. Unfortunately I have had limited success in tracking down the issue and also have not really looked at it for some time. If I recall correctly, a couple of things that I have noticed:
possible problem if a piece finishes before the first set of pieces are parcelled out out by rpiece - so if it 8 pieces are being distributed at startup and piece 2 (for example) finishes before one of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed by rpiece or while rpiece is still forking off the initial jobs.
Sorry I cannot offer more, I have spent some time in the code on this one and it is not for the faint of heart to say the least.
-Jack
--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:

This problem is back for a sequel, and it would really help my work if I could get it going.

It's been a few months since I last asked about this. Has anyone else experienced this in a Linux environment? Anyone have any ideas what to do about it or how to debug it?

/proc/version reports:
Linux version 2.6.18-274.18.1.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012

Randolph

On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:

On 2011-07-07 16:54:06 -0700, Greg Ward said:

Hi Randolph,

This shouldn't happen, unless one of the rpict processes died
unexpectedly. Even then, I would expect some other kind of error to be
reported as well.

-Greg

Thanks, Greg. I think that's what happenned; in fact seven of the
eight died in two cases. Wierdly, the third succeeded. If I run it as
a single-processor job, it works. Here's a piece of the log:

rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd
12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6
-ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as
392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct

rpict: warning - no output produced

rpict: system - write error in io_process: Broken pipe
rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1
rad: error rendering view blinds

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

Thanks Jack, Greg.

Jack, what kernel were you using? Was it also Linux?

Greg, I was using rad, so those delays are already in there, alas. I wonder if there is some subtle difference between the Mac OS Mach kernel and the Linux kernel that's causing the problem, or if it occurs on all platforms, just more frequently in the very fast cluster nodes.

Or, it could be an NFS locking problem, bah.

If I find time, maybe I can dig into it some more. Right now, I may just finesse it by running multiple *different* simulations on the same cluster node.

Randolph

···

On 2012-04-09 21:52:47 +0000, Greg Ward said:

If it is a startup issue as Jack suggests, you might try inserting a few seconds of delay between the spawning of each new rpiece process using "sleep 5" or similar. This allows time for the sync file to be updated without contention between processes. This is what I do in rad with the -N option. I actually wait 10 seconds between each new rpiece process.

This isn't to say that I understand the source of your error, which still puzzles me.

-Greg

From: Jack de Valpine <[email protected]>

Date: April 9, 2012 1:46:03 PM PDT

Hey Randolph,

I have run into this before. Unfortunately I have had limited success in tracking down the issue and also have not really looked at it for some time. If I recall correctly, a couple of things that I have noticed:

  • possible problem if a piece finishes before the first set of pieces are parcelled out out by rpiece - so if it 8 pieces are being distributed at startup and piece 2 (for example) finishes before one of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed by rpiece or while rpiece is still forking off the initial jobs.

Sorry I cannot offer more, I have spent some time in the code on this one and it is not for the faint of heart to say the least. -Jack
--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:
This problem is back for a sequel, and it would really help my work if I could get it going.

It's been a few months since I last asked about this. Has anyone else experienced this in a Linux environment? Anyone have any ideas what to do about it or how to debug it?

/proc/version reports: Linux version 2.6.18-274.18.1.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012

Randolph

On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:

On 2011-07-07 16:54:06 -0700, Greg Ward said:

Hi Randolph,

This shouldn't happen, unless one of the rpict processes died unexpectedly. Even then, I would expect some other kind of error to be reported as well.

-Greg

Thanks, Greg. I think that's what happenned; in fact seven of the eight died in two cases. Wierdly, the third succeeded. If I run it as a single-processor job, it works. Here's a piece of the log:

rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd 12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6 -ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as 392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct

rpict: warning - no output produced

rpict: system - write error in io_process: Broken pipe rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1 rad: error rendering view blinds

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general
_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

--
Randolph M. Fritz

Hi Randolph,

For what it's worth I don't use rpiece when I render on the cluster. I have a script that takes divides takes a view file, tile number and number of rows an columns and will render the assigned tile number (run_render.csh). In the job submit script I distribute these tile rendering tasks to multiple cores on multiple nodes. I can't use the ambient cache with this method, but i typically use rtcontrib so I would be able to use it regardless. There is also the problem that some processors sit idle after they've finished their tile while other processes are running, but I don't worry about it because computing time on lawrencium is cheap and available.

Snippets from my scripts are below.

Andy

### job_submitt.bsh #####

#!/bin/bash
# specify the queue: lr_debug, lr_batch
#PBS -q lr_batch
#PBS -A ac_rad71t
#PBS -l nodes=16:ppn=8:lr1
#PBS -l walltime=24:00:00
#PBS -m be
#PBS -M [email protected]
#PBS -e run_v4a.err
#PBS -o run_v4a.out

# change to working directory & run the program
cd ~/models/wwr60

for i in {0..127}; do
pbsdsh -n $i $PBS_O_WORKDIR/run_render.csh views/v4a.vf $(printf "%03d" $i) 8 16 &
done

wait

### run_render.csh ######
#! /bin/csh

cd $PBS_O_WORKDIR
set path=($path ~/applications/Radiance/bin/ )

set oxres = 512
set oyres = 512

set view = $argv[1]
set thispiece = $argv[2]
set numcols = $argv[3]
set numrows = $argv[4]
set numpieces = `ev "$numcols * $numrows"`

set pxres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print int($2/'$numcols'+.5)}'`
set pyres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print int($4/'$numrows'+.5)}'`

set vtype = `awk '{for(i=1;i<NF;i++) if(match($i,"-vt")==1) split($i,vt,"")} END { print vt[4] }' $view`
set vshift = `ev "$thispiece - $numcols * floor( $thispiece / $numcols) - $numcols / 2 + .5"`
set vlift = `ev "floor( $thispiece / $numcols ) - $numrows / 2 + .5"`

if ($vtype == "v") then
set vhoriz = `awk 'BEGIN{PI=3.14159265} \
      {for(i=1;i<NF;i++) if($i=="-vh") vh=$(i+1)*PI/180 } \
      END{print atan2(sin(vh/2)/'$numcols',cos(vh/2))*180/PI*2}' $view`
set vvert = `awk 'BEGIN{PI=3.14159265} \
      {for(i=1;i<NF;i++) if($i=="-vv") vv=$(i+1)*PI/180 } \
      END{print atan2(sin(vv/2)/'$numrows',cos(vv/2))*180/PI*2}' $view`
endif

vwrays -ff -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x $pxres -y $pyres \

rtcontrib -n 1 `vwrays -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x $pxres -y $pyres -d` \

  -ffc -fo \
  -o binpics/wwr60/${view:t:r}/${view:t:r}_wwr60_%s_%04d_${thispiece}.hdr \
  -f klems_horiz.cal -bn Nkbins \
  -b 'kbin(0,1,0,0,0,1)' -m GlDay -b 'kbin(0,1,0,0,0,1)' -m GlView \
  -w -ab 6 -ad 6000 -lw 1e-7 -ds .07 -dc 1 oct/vmx.oct

···

On Apr 11, 2012, at 5:54 AM, Jack de Valpine wrote:

Hi Randolph,

All I have is Linux. Not sure what kernels at this point. But I have noticed this over multiple kernels and distributions. Although I have not run anything on the most recent kernels.

I know that one thing I did was to disable the fork and wait functionality in rpiece to wait for a job to finish. I do not recall though if this was related to this problem, nfs locking, or running on a cluster with job distribution queuing....? Sorry I do not remember more right now.

Just thinking out loud here, but if you are running on a cluster then could network latency also be an issue?

Here is my suspicion/theory, which I have not been able to test. I think that somehow there is a race condition in the way jobs get forked off and status of pieces gets recorded in the syncfile...

For testing/debugging purposes, a few things to look at compare might be:
big scene - slow load time
small scene - fast load time
"fast" parameters - small image size with lots of divisions
"slow" parameters - small image size with lots of divisions
On my cluster, I ended up setting up things so that any initial small image run for building the ambient cache would actually just run as a single rpict process and then large images would get distributed across nodes/cores.

As an aside, perhaps Rob G. has some thoughts on Radiance/Clusters as I think they have a large one also. What is the cluster set up at LBNL? I believe that at one point they were using a provisioning system called Warewulf which has now evolved to Perceus. I have the former setup and have not gotten around to the latter. LBNL may also be using a job queuing system called Slurm which they developed (or maybe that was at LLNL)?

Hopefully this is not leading you off on the wrong track though. Probably would be useful to figure out if the problem is indeed rpiece related or something else entirely.

-Jack
--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/11/2012 1:27 AM, Randolph M. Fritz wrote:

Thanks Jack, Greg.

Jack, what kernel were you using? Was it also Linux?

Greg, I was using rad, so those delays are already in there, alas. I wonder if there is some subtle difference between the Mac OS Mach kernel and the Linux kernel that's causing the problem, or if it occurs on all platforms, just more frequently in the very fast cluster nodes.

Or, it could be an NFS locking problem, bah.

If I find time, maybe I can dig into it some more. Right now, I may just finesse it by running multiple *different* simulations on the same cluster node.

Randolph

On 2012-04-09 21:52:47 +0000, Greg Ward said:

If it is a startup issue as Jack suggests, you might try inserting a few seconds of delay between the spawning of each new rpiece process using "sleep 5" or similar. This allows time for the sync file to be updated without contention between processes. This is what I do in rad with the -N option. I actually wait 10 seconds between each new rpiece process.

This isn't to say that I understand the source of your error, which still puzzles me.

-Greg
From: Jack de Valpine <[email protected]>
Date: April 9, 2012 1:46:03 PM PDT

Hey Randolph,

I have run into this before. Unfortunately I have had limited success in tracking down the issue and also have not really looked at it for some time. If I recall correctly, a couple of things that I have noticed:
possible problem if a piece finishes before the first set of pieces are parcelled out out by rpiece - so if it 8 pieces are being distributed at startup and piece 2 (for example) finishes before one of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed by rpiece or while rpiece is still forking off the initial jobs.
Sorry I cannot offer more, I have spent some time in the code on this one and it is not for the faint of heart to say the least.

-Jack

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:
This problem is back for a sequel, and it would really help my work if I could get it going.

It's been a few months since I last asked about this. Has anyone else experienced this in a Linux environment? Anyone have any ideas what to do about it or how to debug it?

/proc/version reports:
Linux version 2.6.18-274.18.1.el5 (mockbuild-t2f/um9L7dhDWr0U+X5jBOG/[email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012

Randolph

On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:

On 2011-07-07 16:54:06 -0700, Greg Ward said:

Hi Randolph,

This shouldn't happen, unless one of the rpict processes died
unexpectedly. Even then, I would expect some other kind of error to be
reported as well.

-Greg

Thanks, Greg. I think that's what happenned; in fact seven of the
eight died in two cases. Wierdly, the third succeeded. If I run it as
a single-processor job, it works. Here's a piece of the log:

rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd
12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6
-ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as
392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct

rpict: warning - no output produced

rpict: system - write error in io_process: Broken pipe
rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1
rad: error rendering view blinds

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general
_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

--
Randolph M. Fritz

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

I've gathered little more information Thursday and Friday.

First off, I am so far only using single nodes on our cluster for renderings and instead taking advantage of multiple nodes to run more renderings. Each node in the Lawrencium cluster has two six-core Xeons, and they are very capable.

I ran a rendering on a different model in the same cluster environment, which worked perfectly. My experience so far leads me to the following conclusion: whatever the problem is, it is dependent on the model and the simulation parameters. The model where it worked used mkillum; the two where it doesn't, don't. This may--or may not--be significant.

I think adding mkillum surfaces to the models that failed would be an interesting experiment. Maybe I can find some time to do it next week.

Randolph

···

On 2012-04-11 17:22:52 +0000, Jack de Valpine said:

Hey Andy,

This jogs my memory a bit. Perhaps a different topic at this point not sure as this is more about clusters, radiance and rpiece. Another problem that I encountered with using rpiece on my cluster was sometimes the time tiles/pieces would get written into the image in the wrong place when using stock rpiece. My solution if I remember correctly was to customize rpiece so that each running instance of rpiece would write out its pieces each to their image file. These would all then get assembled as a post process. I think my idea behind this was to take advantage of the functionality that rpiece does offer.

-Jack
--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/11/2012 1:04 PM, Andy McNeil wrote:
Hi Randolph,

For what it's worth I don't use rpiece when I render on the cluster. I have a script that takes divides takes a view file, tile number and number of rows an columns and will render the assigned tile number (run_render.csh). In the job submit script I distribute these tile rendering tasks to multiple cores on multiple nodes. I can't use the ambient cache with this method, but i typically use rtcontrib so I would be able to use it regardless. There is also the problem that some processors sit idle after they've finished their tile while other processes are running, but I don't worry about it because computing time on lawrencium is cheap and available.

Snippets from my scripts are below.

Andy

### job_submitt.bsh #####

#!/bin/bash
# specify the queue: lr_debug, lr_batch
#PBS -q lr_batch
#PBS -A ac_rad71t
#PBS -l nodes=16:ppn=8:lr1
#PBS -l walltime=24:00:00
#PBS -m be
#PBS -M [email protected]
#PBS -e run_v4a.err
#PBS -o run_v4a.out

# change to working directory & run the program
cd ~/models/wwr60

for i in {0..127}; do
pbsdsh -n $i $PBS_O_WORKDIR/run_render.csh views/v4a.vf $(printf "%03d" $i) 8 16 &
done

wait

### run_render.csh ######
#! /bin/csh

cd $PBS_O_WORKDIR
set path=($path ~/applications/Radiance/bin/ )

set oxres = 512
set oyres = 512

set view = $argv[1]
set thispiece = $argv[2]
set numcols = $argv[3]
set numrows = $argv[4]
set numpieces = `ev "$numcols * $numrows"`

set pxres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print int($2/'$numcols'+.5)}'`
set pyres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print int($4/'$numrows'+.5)}'`

set vtype = `awk '{for(i=1;i<NF;i++) if(match($i,"-vt")==1) split($i,vt,"")} END { print vt[4] }' $view`
set vshift = `ev "$thispiece - $numcols * floor( $thispiece / $numcols) - $numcols / 2 + .5"`
set vlift = `ev "floor( $thispiece / $numcols ) - $numrows / 2 + .5"`

if ($vtype == "v") then
set vhoriz = `awk 'BEGIN{PI=3.14159265} \
{for(i=1;i<NF;i++) if($i=="-vh") vh=$(i+1)*PI/180 } \
END{print atan2(sin(vh/2)/'$numcols',cos(vh/2))*180/PI*2}' $view`
set vvert = `awk 'BEGIN{PI=3.14159265} \
{for(i=1;i<NF;i++) if($i=="-vv") vv=$(i+1)*PI/180 } \
END{print atan2(sin(vv/2)/'$numrows',cos(vv/2))*180/PI*2}' $view`
endif

vwrays -ff -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x $pxres -y $pyres \
> rtcontrib -n 1 `vwrays -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x $pxres -y $pyres -d` \
            -ffc -fo \
            -o binpics/wwr60/${view:t:r}/${view:t:r}_wwr60_%s_%04d_${thispiece}.hdr \
            -f klems_horiz.cal -bn Nkbins \
            -b 'kbin(0,1,0,0,0,1)' -m GlDay -b 'kbin(0,1,0,0,0,1)' -m GlView \
            -w -ab 6 -ad 6000 -lw 1e-7 -ds .07 -dc 1 oct/vmx.oct

On Apr 11, 2012, at 5:54 AM, Jack de Valpine wrote:
Hi Randolph,

All I have is Linux. Not sure what kernels at this point. But I have noticed this over multiple kernels and distributions. Although I have not run anything on the most recent kernels.

I know that one thing I did was to disable the fork and wait functionality in rpiece to wait for a job to finish. I do not recall though if this was related to this problem, nfs locking, or running on a cluster with job distribution queuing....? Sorry I do not remember more right now.

Just thinking out loud here, but if you are running on a cluster then could network latency also be an issue?

Here is my suspicion/theory, which I have not been able to test. I think that somehow there is a race condition in the way jobs get forked off and status of pieces gets recorded in the syncfile...

For testing/debugging purposes, a few things to look at compare might be:

  • big scene - slow load time • small scene - fast load time
  • "fast" parameters - small image size with lots of divisions
  • "slow" parameters - small image size with lots of divisions

On my cluster, I ended up setting up things so that any initial small image run for building the ambient cache would actually just run as a single rpict process and then large images would get distributed across nodes/cores.

As an aside, perhaps Rob G. has some thoughts on Radiance/Clusters as I think they have a large one also. What is the cluster set up at LBNL? I believe that at one point they were using a provisioning system called Warewulf which has now evolved to Perceus. I have the former setup and have not gotten around to the latter. LBNL may also be using a job queuing system called Slurm which they developed (or maybe that was at LLNL)?

Hopefully this is not leading you off on the wrong track though. Probably would be useful to figure out if the problem is indeed rpiece related or something else entirely.

-Jack
--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/11/2012 1:27 AM, Randolph M. Fritz wrote:
Thanks Jack, Greg.

Jack, what kernel were you using? Was it also Linux?

Greg, I was using rad, so those delays are already in there, alas. I wonder if there is some subtle difference between the Mac OS Mach kernel and the Linux kernel that's causing the problem, or if it occurs on all platforms, just more frequently in the very fast cluster nodes.

Or, it could be an NFS locking problem, bah.

If I find time, maybe I can dig into it some more. Right now, I may just finesse it by running multiple *different* simulations on the same cluster node.

Randolph

On 2012-04-09 21:52:47 +0000, Greg Ward said:

If it is a startup issue as Jack suggests, you might try inserting a few seconds of delay between the spawning of each new rpiece process using "sleep 5" or similar. This allows time for the sync file to be updated without contention between processes. This is what I do in rad with the -N option. I actually wait 10 seconds between each new rpiece process.

This isn't to say that I understand the source of your error, which still puzzles me.

-Greg

From: Jack de Valpine <[email protected]>

Date: April 9, 2012 1:46:03 PM PDT

Hey Randolph,

I have run into this before. Unfortunately I have had limited success in tracking down the issue and also have not really looked at it for some time. If I recall correctly, a couple of things that I have noticed:

  • possible problem if a piece finishes before the first set of pieces are parcelled out out by rpiece - so if it 8 pieces are being distributed at startup and piece 2 (for example) finishes before one of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed by rpiece or while rpiece is still forking off the initial jobs.

Sorry I cannot offer more, I have spent some time in the code on this one and it is not for the faint of heart to say the least. -Jack
--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:
This problem is back for a sequel, and it would really help my work if I could get it going.

It's been a few months since I last asked about this. Has anyone else experienced this in a Linux environment? Anyone have any ideas what to do about it or how to debug it?

/proc/version reports: Linux version 2.6.18-274.18.1.el5 (mockbuild-t2f/um9L7dhDWr0U+X5jBOG/[email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012

Randolph

On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:

On 2011-07-07 16:54:06 -0700, Greg Ward said:

Hi Randolph,

This shouldn't happen, unless one of the rpict processes died unexpectedly. Even then, I would expect some other kind of error to be reported as well.

-Greg

Thanks, Greg. I think that's what happenned; in fact seven of the eight died in two cases. Wierdly, the third succeeded. If I run it as a single-processor job, it works. Here's a piece of the log:

rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd 12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6 -ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as 392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct

rpict: warning - no output produced

rpict: system - write error in io_process: Broken pipe rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1 rad: error rendering view blinds

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general
_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

--
Randolph M. Fritz

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general
_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general
_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

--
Randolph M. Fritz

Hi @R_Fritz1 and @Andrew_McNeil2. I’m reviving this thread from the dead because it’s happening to me, too, and I’m curious if anyone has found a solution.

The problem comes up when I run rpiece via rad using Torque PBS on a Linux cluster. The first time I run rad with -N 8, I get this error:

rpict: system - write error in io_process: Broken pipe
rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on xxxxxx (PID 18652)
rad: error rendering view v3

So I guess one of my processes terminated. If I run again without deleting the ambient file, I get the error 7 times. If I set -aa 0, then everything runs fine. I guess this means something is going wrong when the processes try to access the ambient file. However, I’ve never had an issue sharing ambient files across a large number of cores using rtrace in this environment, so I’m not sure why this is different.

If I run the script directly instead of submitting it to PBS, it runs without error but appears to run sequentially rather than in parallel. This must be what is meant when the rad documentation says " The −N option instructs rad to run as many as npr rendering processes in parallel," but I’m not clear on how it actually decides on the number of cores to use.

If I remove -N 8 from the command, everything runs fine on PBS but in series.

Any ideas what’s going on here? The goal is to create a production environment to run arbitrary scripts, and the script has worked on other machines, so I’d rather change settings in the environment than edit the script.

Nathaniel

Hi Nathaniel,

I don’t know what Torque PBS does, exactly. I assume it doesn’t start multiple of the same process, does it? You shouldn’t run multiple copies of rad with the same target.

Assuming that’s not the problem, and that you’re just running “rad -N 8” on one machine with 8 cores available, then your error is most likely due to an rpict process dying, though you should get an error saying the reason it died, somewhere. The “write error in io_process” comes from a forked feeder or reader process, not from a rendering process.

The feeder process in this case is only passing view parameters, one per tile, but will die with this message if the rendering process dies. The reader process passes computed pixels back to a child of rpiece, which writes it to the output picture. This process could also be killed by the system for exceeding the file size limit, conceivably. Are you rendering something super high-resolution? Did you check on your shell’s file size limit?

Sharing ambient files across network file systems can be problematic, especially if the lock manager is misbehaving, causing conflicting writes. It’s hard to know when that is happening, but rpict should tell you the next time on startup if there are bad values in an ambient file.

Hi Greg,

Stepping through this:

  • There’s only one instance of rad and no more than eight instances of rpict running.
  • There is no other error message to indicate that an rpict process died. If one is dying, it is either not showing an error message, or the message is not going to stderr.
  • The image is 512x512, so it shouldn’t be anywhere close to the file size limit. The unf file (what does get produced) is 4064 kb, and the hdr file produced successfully from the single thread version ends up being 625 kb.
  • I can use the ambient file for subsequent regular rpict runs, so it doesn’t seem to be corrupt.

Also looking back at Randolph’s thread, this model also contains lights, not mkillum necessarily, but ies2rad. I’m not sure if that’s a factor, but maybe I can try out some other models.

Nathaniel

Hmmm. I’m stumped. You should see more than 8 rpict processes, but only 8 using significant CPU. The others just pass i/o, so you would have to search for them or sort by process name.

It’s almost impossible to debug these things, especially without access to the machine running everything. Does it only happen under PBS? Can you run “rad -N 8” on the machine manually?

Does the process die immediately, or only after some tiles have been computed? Is the output readable at all?

Hi Greg,

You’re right, there are other rpict processes that I hadn’t noticed before because they weren’t taking up CPU time. I’m using pgrep to grab the process PIDs now, and it turn out there are at least 8 instances of rad and 21 instances of rpict. I observe this both when running through PBS and directly through the command line.

Also, I had previously thought that the unf file output was corrupt, but this turns out not to be the case. I now think that the program I’m using to read it has a problem with the unf extension, because when I change it to hdr, I can open the file and it looks correct.

This leads me to think that the process is dying only after it’s finished whatever rendering task it was doing, or perhaps it’s not dead but just failing to close the pipe properly. In any case, rad thinks things have gone badly so it’s not calling pfilt to make the final hdr file, and again, only when running through PBS.

Nathaniel

OK, well it sounds like it has something to do with PBS, but really hard to say what, exactly. Some systems don’t handle named pipes very well, and rpiece relies on them with rpict running in parallel. I have seen this problem on some FreeBSD systems. Specifically, there was a version of FreeBSD that didn’t return an EOF after closing a named pipe, which rpict relies on to know that it’s finished. It was causing a hang for me, which is different than what you’re seeing, so it’s probably something else at fault.

At this point, I’m just searching to figure out what could be different when PBS is running. In theory, it starts up using exactly the same environment that I have when I log in. Searching the web for “broken pipe” and “PBS”, I found a few discussion forums for FDS and other tools that use MPI that indicated that the max locked memory needs to be increased when using PBS, but that didn’t help in my case.

I did find out that the number of cores I use has something to do with it. When I have -N 4 or lower, rad runs without issue, but with -N 5 or higher, I get the broken pipe error. I have 30 cores available, so I’d like to split up the work more if possible.