"Broken pipe" message from rpiece on multi-core Linux system

Hey Randolph,

I have run into this before. Unfortunately I have had limited success in tracking down the issue and also have not really looked at it for some time. If I recall correctly, a couple of things that I have noticed:

  * possible problem if a piece finishes before the first set of pieces
    are parcelled out out by rpiece - so if it 8 pieces are being
    distributed at startup and piece 2 (for example) finishes before one
    of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed by rpiece or
    while rpiece is still forking off the initial jobs.

Sorry I cannot offer more, I have spent some time in the code on this one and it is not for the faint of heart to say the least.

-Jack

···

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:

This problem is back for a sequel, and it would really help my work if I could get it going.

It's been a few months since I last asked about this. Has anyone else experienced this in a Linux environment? Anyone have any ideas what to do about it or how to debug it?

/proc/version reports:
Linux version 2.6.18-274.18.1.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012

Randolph

On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:

On 2011-07-07 16:54:06 -0700, Greg Ward said:

Hi Randolph,

This shouldn't happen, unless one of the rpict processes died
unexpectedly. Even then, I would expect some other kind of error to be
reported as well.

-Greg

Thanks, Greg. I think that's what happenned; in fact seven of the
eight died in two cases. Wierdly, the third succeeded. If I run it as
a single-processor job, it works. Here's a piece of the log:

rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd
12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6
-ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as
392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct

rpict: warning - no output produced

rpict: system - write error in io_process: Broken pipe
rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1
rad: error rendering view blinds

Hi Randolph,

All I have is Linux. Not sure what kernels at this point. But I have noticed this over multiple kernels and distributions. Although I have not run anything on the most recent kernels.

I know that one thing I did was to disable the fork and wait functionality in rpiece to wait for a job to finish. I do not recall though if this was related to this problem, nfs locking, or running on a cluster with job distribution queuing....? Sorry I do not remember more right now.

Just thinking out loud here, but if you are running on a cluster then could network latency also be an issue?

Here is my suspicion/theory, which I have not been able to test. I think that somehow there is a race condition in the way jobs get forked off and status of pieces gets recorded in the syncfile...

For testing/debugging purposes, a few things to look at compare might be:

  * big scene - slow load time
  * small scene - fast load time
  * "fast" parameters - small image size with lots of divisions
  * "slow" parameters - small image size with lots of divisions

On my cluster, I ended up setting up things so that any initial small image run for building the ambient cache would actually just run as a single rpict process and then large images would get distributed across nodes/cores.

As an aside, perhaps Rob G. has some thoughts on Radiance/Clusters as I think they have a large one also. What is the cluster set up at LBNL? I believe that at one point they were using a provisioning system called Warewulf which has now evolved to Perceus. I have the former setup and have not gotten around to the latter. LBNL may also be using a job queuing system called Slurm which they developed (or maybe that was at LLNL)?

Hopefully this is not leading you off on the wrong track though. Probably would be useful to figure out if the problem is indeed rpiece related or something else entirely.

-Jack

···

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/11/2012 1:27 AM, Randolph M. Fritz wrote:

Thanks Jack, Greg.

Jack, what kernel were you using?Was it also Linux?

Greg, I was using rad, so those delays are already in there, alas.I wonder if there is some subtle difference between the Mac OS Mach kernel and the Linux kernel that's causing the problem, or if it occurs on all platforms, just more frequently in the very fast cluster nodes.

Or, it could be an NFS locking problem, bah.

If I find time, maybe I can dig into it some more.Right now, I may just finesse it by running multiple *different* simulations on the same cluster node.

Randolph

On 2012-04-09 21:52:47 +0000, Greg Ward said:

If it is a startup issue as Jack suggests, you might try inserting a few seconds of delay between the spawning of each new rpiece process using "sleep 5" or similar. This allows time for the sync file to be updated without contention between processes. This is what I do in rad with the -N option. I actually wait 10 seconds between each new rpiece process.

This isn't to say that I understand the source of your error, which still puzzles me.

-Greg

From: Jack de Valpine <[email protected] <mailto:[email protected]>>

Date: April 9, 2012 1:46:03 PM PDT

Hey Randolph,

I have run into this before. Unfortunately I have had limited success in tracking down the issue and also have not really looked at it for some time. If I recall correctly, a couple of things that I have noticed:

  * possible problem if a piece finishes before the first set of
    pieces are parcelled out out by rpiece - so if it 8 pieces are
    being distributed at startup and piece 2 (for example) finishes
    before one of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed
    by rpiece or while rpiece is still forking off the initial jobs.

Sorry I cannot offer more, I have spent some time in the code on this one and it is not for the faint of heart to say the least.

-Jack

--

# Jack de Valpine

# president

#

# visarc incorporated

# http://www.visarc.com <http://www.visarc.com/>

#

# channeling technology for superior design and construction

On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:

This problem is back for a sequel, and it would really help my work if I could get it going.

It's been a few months since I last asked about this. Has anyone else experienced this in a Linux environment? Anyone have any ideas what to do about it or how to debug it?

/proc/version reports:

Linux version 2.6.18-274.18.1.el5 (mockbuild-t2f/um9L7dhDWr0U+X5jBOG/[email protected] <mailto:[email protected]>) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012

Randolph

On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:

On 2011-07-07 16:54:06 -0700, Greg Ward said:

Hi Randolph,

This shouldn't happen, unless one of the rpict processes died

unexpectedly. Even then, I would expect some other kind of error to be

reported as well.

-Greg

Thanks, Greg. I think that's what happenned; in fact seven of the

eight died in two cases. Wierdly, the third succeeded. If I run it as

a single-processor job, it works. Here's a piece of the log:

rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd

12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6

-ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as

392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct

rpict: warning - no output produced

rpict: system - write error in io_process: Broken pipe

rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1

rad: error rendering view blinds

_______________________________________________

Radiance-general mailing list

[email protected]<mailto:[email protected]>

http://www.radiance-online.org/mailman/listinfo/radiance-general

_______________________________________________

Radiance-general mailing list

[email protected]

http://www.radiance-online.org/mailman/listinfo/radiance-general

--

Randolph M. Fritz

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

Hey Andy,

Thanks as always for sharing this. I tried to do something similar a couple years ago, and Greg even pointed me to the bit of code in rpiece that does the tiling, but I could never get it to work for me. This is great.

- Rob

Hi Randolph,

For what it's worth I don't use rpiece when I render on the cluster. I have a script that takes divides takes a view file, tile number and number of rows an columns and will render the assigned tile number (run_render.csh). In the job submit script I distribute these tile rendering tasks to multiple cores on multiple nodes. I can't use the ambient cache with this method, but i typically use rtcontrib so I would be able to use it regardless. There is also the problem that some processors sit idle after they've finished their tile while other processes are running, but I don't worry about it because computing time on lawrencium is cheap and available.

Snippets from my scripts are below.

Andy

### job_submitt.bsh #####

#!/bin/bash
# specify the queue: lr_debug, lr_batch
#PBS -q lr_batch
#PBS -A ac_rad71t
#PBS -l nodes=16:ppn=8:lr1
#PBS -l walltime=24:00:00
#PBS -m be
#PBS -M [email protected]<mailto:[email protected]>
#PBS -e run_v4a.err
#PBS -o run_v4a.out

# change to working directory & run the program
cd ~/models/wwr60

for i in {0..127}; do
pbsdsh -n $i $PBS_O_WORKDIR/run_render.csh views/v4a.vf $(printf "%03d" $i) 8 16 &
done

wait

### run_render.csh ######
#! /bin/csh

cd $PBS_O_WORKDIR
set path=($path ~/applications/Radiance/bin/ )

set oxres = 512
set oyres = 512

set view = $argv[1]
set thispiece = $argv[2]
set numcols = $argv[3]
set numrows = $argv[4]
set numpieces = `ev "$numcols * $numrows"`

set pxres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print int($2/'$numcols'+.5)}'`
set pyres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print int($4/'$numrows'+.5)}'`

set vtype = `awk '{for(i=1;i<NF;i++) if(match($i,"-vt")==1) split($i,vt,"")} END { print vt[4] }' $view`
set vshift = `ev "$thispiece - $numcols * floor( $thispiece / $numcols) - $numcols / 2 + .5"`
set vlift = `ev "floor( $thispiece / $numcols ) - $numrows / 2 + .5"`

if ($vtype == "v") then
set vhoriz = `awk 'BEGIN{PI=3.14159265} \
{for(i=1;i<NF;i++) if($i=="-vh") vh=$(i+1)*PI/180 } \
END{print atan2(sin(vh/2)/'$numcols',cos(vh/2))*180/PI*2}' $view`
set vvert = `awk 'BEGIN{PI=3.14159265} \
{for(i=1;i<NF;i++) if($i=="-vv") vv=$(i+1)*PI/180 } \
END{print atan2(sin(vv/2)/'$numrows',cos(vv/2))*180/PI*2}' $view`
endif

vwrays -ff -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x $pxres -y $pyres \

rtcontrib -n 1 `vwrays -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x $pxres -y $pyres -d` \

-ffc -fo \
-o binpics/wwr60/${view:t:r}/${view:t:r}_wwr60_%s_%04d_${thispiece}.hdr \
-f klems_horiz.cal -bn Nkbins \
-b 'kbin(0,1,0,0,0,1)' -m GlDay -b 'kbin(0,1,0,0,0,1)' -m GlView \
-w -ab 6 -ad 6000 -lw 1e-7 -ds .07 -dc 1 oct/vmx.oct

···

On 4/11/12 11:04 AM, "Andy McNeil" <[email protected]<mailto:[email protected]>> wrote:

On Apr 11, 2012, at 5:54 AM, Jack de Valpine wrote:

Hi Randolph,

All I have is Linux. Not sure what kernels at this point. But I have noticed this over multiple kernels and distributions. Although I have not run anything on the most recent kernels.

I know that one thing I did was to disable the fork and wait functionality in rpiece to wait for a job to finish. I do not recall though if this was related to this problem, nfs locking, or running on a cluster with job distribution queuing....? Sorry I do not remember more right now.

Just thinking out loud here, but if you are running on a cluster then could network latency also be an issue?

Here is my suspicion/theory, which I have not been able to test. I think that somehow there is a race condition in the way jobs get forked off and status of pieces gets recorded in the syncfile...

For testing/debugging purposes, a few things to look at compare might be:

* big scene - slow load time
* small scene - fast load time
* "fast" parameters - small image size with lots of divisions
* "slow" parameters - small image size with lots of divisions

On my cluster, I ended up setting up things so that any initial small image run for building the ambient cache would actually just run as a single rpict process and then large images would get distributed across nodes/cores.

As an aside, perhaps Rob G. has some thoughts on Radiance/Clusters as I think they have a large one also. What is the cluster set up at LBNL? I believe that at one point they were using a provisioning system called Warewulf which has now evolved to Perceus. I have the former setup and have not gotten around to the latter. LBNL may also be using a job queuing system called Slurm which they developed (or maybe that was at LLNL)?

Hopefully this is not leading you off on the wrong track though. Probably would be useful to figure out if the problem is indeed rpiece related or something else entirely.

-Jack

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com<http://www.visarc.com/>
#
# channeling technology for superior design and construction

On 4/11/2012 1:27 AM, Randolph M. Fritz wrote:

Thanks Jack, Greg.

Jack, what kernel were you using? Was it also Linux?

Greg, I was using rad, so those delays are already in there, alas. I wonder if there is some subtle difference between the Mac OS Mach kernel and the Linux kernel that's causing the problem, or if it occurs on all platforms, just more frequently in the very fast cluster nodes.

Or, it could be an NFS locking problem, bah.

If I find time, maybe I can dig into it some more. Right now, I may just finesse it by running multiple *different* simulations on the same cluster node.

Randolph

On 2012-04-09 21:52:47 +0000, Greg Ward said:

If it is a startup issue as Jack suggests, you might try inserting a few seconds of delay between the spawning of each new rpiece process using "sleep 5" or similar. This allows time for the sync file to be updated without contention between processes. This is what I do in rad with the -N option. I actually wait 10 seconds between each new rpiece process.

This isn't to say that I understand the source of your error, which still puzzles me.

-Greg

From: Jack de Valpine <[email protected]<mailto:[email protected]>>

Date: April 9, 2012 1:46:03 PM PDT

Hey Randolph,

I have run into this before. Unfortunately I have had limited success in tracking down the issue and also have not really looked at it for some time. If I recall correctly, a couple of things that I have noticed:

* possible problem if a piece finishes before the first set of pieces are parcelled out out by rpiece - so if it 8 pieces are being distributed at startup and piece 2 (for example) finishes before one of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed by rpiece or while rpiece is still forking off the initial jobs.

Sorry I cannot offer more, I have spent some time in the code on this one and it is not for the faint of heart to say the least.

-Jack

--

# Jack de Valpine

# president

#

# visarc incorporated

# http://www.visarc.com<http://www.visarc.com/>

#

# channeling technology for superior design and construction

On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:

This problem is back for a sequel, and it would really help my work if I could get it going.

It's been a few months since I last asked about this. Has anyone else experienced this in a Linux environment? Anyone have any ideas what to do about it or how to debug it?

/proc/version reports:

Linux version 2.6.18-274.18.1.el5 (mockbuild-t2f/um9L7dhDWr0U+X5jBOG/[email protected]<mailto:[email protected]>) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012

Randolph

On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:

On 2011-07-07 16:54:06 -0700, Greg Ward said:

Hi Randolph,

This shouldn't happen, unless one of the rpict processes died

unexpectedly. Even then, I would expect some other kind of error to be

reported as well.

-Greg

Thanks, Greg. I think that's what happenned; in fact seven of the

eight died in two cases. Wierdly, the third succeeded. If I run it as

a single-processor job, it works. Here's a piece of the log:

rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd

12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6

-ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as

392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct

rpict: warning - no output produced

rpict: system - write error in io_process: Broken pipe

rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1

rad: error rendering view blinds

_______________________________________________

Radiance-general mailing list

[email protected]<mailto:[email protected]>

http://www.radiance-online.org/mailman/listinfo/radiance-general

_______________________________________________

Radiance-general mailing list

[email protected]<mailto:[email protected]>

http://www.radiance-online.org/mailman/listinfo/radiance-general

--

Randolph M. Fritz

_______________________________________________
Radiance-general mailing list
[email protected]<mailto:[email protected]>http://www.radiance-online.org/mailman/listinfo/radiance-general

_______________________________________________
Radiance-general mailing list
[email protected]<mailto:[email protected]>
http://www.radiance-online.org/mailman/listinfo/radiance-general

Hey Andy,

This jogs my memory a bit. Perhaps a different topic at this point not sure as this is more about clusters, radiance and rpiece. Another problem that I encountered with using rpiece on my cluster was sometimes the time tiles/pieces would get written into the image in the wrong place when using stock rpiece. My solution if I remember correctly was to customize rpiece so that each running instance of rpiece would write out its pieces each to their image file. These would all then get assembled as a post process. I think my idea behind this was to take advantage of the functionality that rpiece does offer.

-Jack

···

--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/11/2012 1:04 PM, Andy McNeil wrote:

Hi Randolph,

For what it's worth I don't use rpiece when I render on the cluster. I have a script that takes divides takes a view file, tile number and number of rows an columns and will render the assigned tile number (run_render.csh). In the job submit script I distribute these tile rendering tasks to multiple cores on multiple nodes. I can't use the ambient cache with this method, but i typically use rtcontrib so I would be able to use it regardless. There is also the problem that some processors sit idle after they've finished their tile while other processes are running, but I don't worry about it because computing time on lawrencium is cheap and available.

Snippets from my scripts are below.

Andy

### job_submitt.bsh #####

#!/bin/bash
# specify the queue: lr_debug, lr_batch
#PBS -q lr_batch
#PBS -A ac_rad71t
#PBS -l nodes=16:ppn=8:lr1
#PBS -l walltime=24:00:00
#PBS -m be
#PBS -M [email protected] <mailto:[email protected]>
#PBS -e run_v4a.err
#PBS -o run_v4a.out

# change to working directory & run the program
cd ~/models/wwr60

for i in {0..127}; do
pbsdsh -n $i $PBS_O_WORKDIR/run_render.csh views/v4a.vf $(printf "%03d" $i) 8 16 &
done

wait

### run_render.csh ######
#! /bin/csh

cd $PBS_O_WORKDIR
set path=($path ~/applications/Radiance/bin/ )

set oxres = 512
set oyres = 512

set view = $argv[1]
set thispiece = $argv[2]
set numcols = $argv[3]
set numrows = $argv[4]
set numpieces = `ev "$numcols * $numrows"`

set pxres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print int($2/'$numcols'+.5)}'`
set pyres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print int($4/'$numrows'+.5)}'`

set vtype = `awk '{for(i=1;i<NF;i++) if(match($i,"-vt")==1) split($i,vt,"")} END { print vt[4] }' $view`
set vshift = `ev "$thispiece - $numcols * floor( $thispiece / $numcols) - $numcols / 2 + .5"`
set vlift = `ev "floor( $thispiece / $numcols ) - $numrows / 2 + .5"`

if ($vtype == "v") then
set vhoriz = `awk 'BEGIN{PI=3.14159265} \
{for(i=1;i<NF;i++) if($i=="-vh") vh=$(i+1)*PI/180 } \
END{print atan2(sin(vh/2)/'$numcols',cos(vh/2))*180/PI*2}' $view`
set vvert = `awk 'BEGIN{PI=3.14159265} \
{for(i=1;i<NF;i++) if($i=="-vv") vv=$(i+1)*PI/180 } \
END{print atan2(sin(vv/2)/'$numrows',cos(vv/2))*180/PI*2}' $view`
endif

vwrays -ff -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x $pxres -y $pyres \
> rtcontrib -n 1 `vwrays -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x $pxres -y $pyres -d` \
-ffc -fo \
-o binpics/wwr60/${view:t:r}/${view:t:r}_wwr60_%s_%04d_${thispiece}.hdr \
-f klems_horiz.cal -bn Nkbins \
-b 'kbin(0,1,0,0,0,1)' -m GlDay -b 'kbin(0,1,0,0,0,1)' -m GlView \
-w -ab 6 -ad 6000 -lw 1e-7 -ds .07 -dc 1 oct/vmx.oct

On Apr 11, 2012, at 5:54 AM, Jack de Valpine wrote:

Hi Randolph,

All I have is Linux. Not sure what kernels at this point. But I have noticed this over multiple kernels and distributions. Although I have not run anything on the most recent kernels.

I know that one thing I did was to disable the fork and wait functionality in rpiece to wait for a job to finish. I do not recall though if this was related to this problem, nfs locking, or running on a cluster with job distribution queuing....? Sorry I do not remember more right now.

Just thinking out loud here, but if you are running on a cluster then could network latency also be an issue?

Here is my suspicion/theory, which I have not been able to test. I think that somehow there is a race condition in the way jobs get forked off and status of pieces gets recorded in the syncfile...

For testing/debugging purposes, a few things to look at compare might be:

  * big scene - slow load time
  * small scene - fast load time
  * "fast" parameters - small image size with lots of divisions
  * "slow" parameters - small image size with lots of divisions

On my cluster, I ended up setting up things so that any initial small image run for building the ambient cache would actually just run as a single rpict process and then large images would get distributed across nodes/cores.

As an aside, perhaps Rob G. has some thoughts on Radiance/Clusters as I think they have a large one also. What is the cluster set up at LBNL? I believe that at one point they were using a provisioning system called Warewulf which has now evolved to Perceus. I have the former setup and have not gotten around to the latter. LBNL may also be using a job queuing system called Slurm which they developed (or maybe that was at LLNL)?

Hopefully this is not leading you off on the wrong track though. Probably would be useful to figure out if the problem is indeed rpiece related or something else entirely.

-Jack
--
# Jack de Valpine
# president
#
# visarc incorporated
#http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/11/2012 1:27 AM, Randolph M. Fritz wrote:

Thanks Jack, Greg.

Jack, what kernel were you using?Was it also Linux?

Greg, I was using rad, so those delays are already in there, alas.I wonder if there is some subtle difference between the Mac OS Mach kernel and the Linux kernel that's causing the problem, or if it occurs on all platforms, just more frequently in the very fast cluster nodes.

Or, it could be an NFS locking problem, bah.

If I find time, maybe I can dig into it some more.Right now, I may just finesse it by running multiple *different* simulations on the same cluster node.

Randolph

On 2012-04-09 21:52:47 +0000, Greg Ward said:

If it is a startup issue as Jack suggests, you might try inserting a few seconds of delay between the spawning of each new rpiece process using "sleep 5" or similar. This allows time for the sync file to be updated without contention between processes. This is what I do in rad with the -N option. I actually wait 10 seconds between each new rpiece process.

This isn't to say that I understand the source of your error, which still puzzles me.

-Greg

From: Jack de Valpine <[email protected] <mailto:[email protected]>>

Date: April 9, 2012 1:46:03 PM PDT

Hey Randolph,

I have run into this before. Unfortunately I have had limited success in tracking down the issue and also have not really looked at it for some time. If I recall correctly, a couple of things that I have noticed:

  * possible problem if a piece finishes before the first set of
    pieces are parcelled out out by rpiece - so if it 8 pieces are
    being distributed at startup and piece 2 (for example) finishes
    before one of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed
    by rpiece or while rpiece is still forking off the initial jobs.

Sorry I cannot offer more, I have spent some time in the code on this one and it is not for the faint of heart to say the least.

-Jack

--

# Jack de Valpine

# president

#

# visarc incorporated

# http://www.visarc.com <http://www.visarc.com/>

#

# channeling technology for superior design and construction

On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:

This problem is back for a sequel, and it would really help my work if I could get it going.

It's been a few months since I last asked about this. Has anyone else experienced this in a Linux environment? Anyone have any ideas what to do about it or how to debug it?

/proc/version reports:

Linux version 2.6.18-274.18.1.el5 (mockbuild-t2f/um9L7dhDWr0U+X5jBOG/[email protected] <mailto:[email protected]>) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012

Randolph

On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:

On 2011-07-07 16:54:06 -0700, Greg Ward said:

Hi Randolph,

This shouldn't happen, unless one of the rpict processes died

unexpectedly. Even then, I would expect some other kind of error to be

reported as well.

-Greg

Thanks, Greg. I think that's what happenned; in fact seven of the

eight died in two cases. Wierdly, the third succeeded. If I run it as

a single-processor job, it works. Here's a piece of the log:

rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd

12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6

-ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as

392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct

rpict: warning - no output produced

rpict: system - write error in io_process: Broken pipe

rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1

rad: error rendering view blinds

_______________________________________________

Radiance-general mailing list

[email protected]<mailto:[email protected]>

http://www.radiance-online.org/mailman/listinfo/radiance-general

_______________________________________________

Radiance-general mailing list

[email protected]

http://www.radiance-online.org/mailman/listinfo/radiance-general

--

Randolph M. Fritz

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general

_______________________________________________
Radiance-general mailing list
[email protected] <mailto:[email protected]>
http://www.radiance-online.org/mailman/listinfo/radiance-general

_______________________________________________
Radiance-general mailing list
[email protected]
http://www.radiance-online.org/mailman/listinfo/radiance-general