Before we give up on lock files...

Greg_Ward1 · February 1, 2003, 4:30pm

Schorch wrote:

And you think sockets are nasty?

One of the few things that I know about lockfiles is that they're
not exactly simple to get right.

- What happens if the lockfile owner gets killed?
  Will the others be able to figure out that the lock file is
  stale and override it? Or will that require human intervention?
  I think that NFS locks get purged when the owner dies, so we
  might lose quite some convenience here.

- What happens process b checks for the file after process a
  does, but before process a actually creates the new file?

There's no way to make checking and creating an atomic operation,
so that this situation must be handled explicitly and gracefully.
In a big simulation, it may well be that a dozen processes are
competing for the lock file several times a second for hours or
even days. Race conditions *will* happen.

The open() call with the O_EXCL flag is supposed to be atomic with the file system -- i.e., if the file exists, the call fails, so if competing processes compete in a race, only one can win. If the open() call fails, then our process moves on as if the lock file were there all along. To avoid race problems, we use a random number of ambient values between checks, which will gradually increase each time a lock check fails. This is the same method used to reduce packet collisions on a network, and it works quite well at avoiding perpetual races.

As for processes that die unexpectedly in the middle of an ambient sync operation, we could either disregard this possibility as neither likely nor disasterous -- it would just untie all the processes so their values wouldn't get written out -- not a disaster. Or, we could employ some check on the file creation time and if a lock has been around more than a certain period (a minute would be extremely generous), any of the processes that's been failing to obtain it can remove it after their Nth failure.

The total code for the above procedure would not be much longer than what is currently in ambient.c, and wouldn't require any network transactions, which I'd rather avoid. There's no such code currently in Radiance, and it adds a whole set of dependencies I don't want to deal with.

Having said that, I'm willing to go along with Schorsch's suggestion of implementing multiple methods. I'll go ahead and give this one a shot, and you guys can try yours as well if you like.

-Greg

Georg_Mischler · February 1, 2003, 11:10pm

Greg Ward wrote:

The open() call with the O_EXCL flag is supposed to be atomic with the
file system -- i.e., if the file exists, the call fails, so if
competing processes compete in a race, only one can win. If the open()
call fails, then our process moves on as if the lock file were there
all along.

You almost convinced me with this one, until I checked the
open(2) man page on my Linux box:

  O_EXCL is broken on
  NFS file systems, programs which rely on it for performing
  locking tasks will contain a race condition. The solution for
  performing atomic file locking using a lockfile is to create a
  unique file on the same fs (e.g., incorporating hostname and
  pid), use link(2) to make a link to the lockfile. If link()
  returns 0, the lock is successful. Otherwise, use stat(2) on
  the unique file to check if its link count has increased to 2,
  in which case the lock is also successful.

I don't know if this is specific to Linux, or if is just the only
system where the documentation spells out the problem. The method
with the hard link is what Mailman uses, btw., making reference to
the above paragraph of the linux man page in the code.

There's no link() on Windows, so we'll have to use open() there.
Does anyone know if open() on Windows is supposed to be atomic?
The worst case scenario I see there is someone running a Samba
server on a Linux box, serving a file system that it has mounted
from another unix host per NFS, which opens us to the above race
condition again...

As for processes that die unexpectedly in the middle of an ambient sync
operation, we could either disregard this possibility as neither likely
nor disasterous -- it would just untie all the processes so their
values wouldn't get written out -- not a disaster.

For a high quality production rendering that is taking lots of
time, this *can* be a disaster. If we can figure out a reliable
way to avoid this, then I think we should do so.

Or, we could employ
some check on the file creation time and if a lock has been around more
than a certain period (a minute would be extremely generous), any of
the processes that's been failing to obtain it can remove it after
their Nth failure.

Relying on the ctime of the file makes us dependent on the system
clocks being exactly snychronized. It is not uncommon for
machines in the same network to by out of sync by several
minutes. Maybe we just have to watch the file for at least a
minute to decide it's expired.

The other problem is that removing the file will create new race
conditions. If two processes try this at nearly the same time,
and a third one succeeds at creating a new lockfile in between,
then we'll remove that one as well. There may have to be a
secondary lockfile protecting the lock breaking process, with
another minute of waiting after it is created. Any process trying
to aquire the primary lock must check first to make sure that no
secondary lock exists. Waiting a few minutes shouldn't be a
problem, as it will only happen in an exceptional situation
anyway. To make individual lock files identifiable, they can
contain the hostname and PID of the creating process, and the
creation time.

Bad scenario:
- A creates lock and dies
- B and C both fail to create a lock several times,
- B and C check and remember the ID of the lock
- B and C make sure that they are still looking at the same
file after at least a minute
- B removes the lockfile
- D succeeds in creating a new lock
- C removes the new lock, thinking it is still the old one

Better (but still not perfect) scenario:
- A creates lock and dies (or gets suspended)
- B and C fail to create a lock several times,
- B and C check and remember the ID of the lock
- B and C make sure that they are still looking at the same
  file after at least a minute
- B creates a secondary lockfile
- C tries to create a secondary lockfile, fails, and steps back
- D wants to create a new lock, but sees the secondary lock
  and steps back (similarly during all following steps)
- B waits for another minute, to make sure that all other
  processes interested in aquiring or breaking the original lock
  have had time to notice the secondary lock
- B finds that it still owns the secondary lockfile (this
  eliminates the last chance for A to wake up and intercept the
  breakage by removing the secondary lock)
- B removes the original lock
- B waits yet another minute to insure against A removing
  someone elses lock if it wakes up again later
- B removes the secondary lock
- any process can try to aquire the lock again
- If A wakes up now, we need to make sure that it doesn't
  remove someone elses lock. This means that it will have to
  check that it really owns the lockfile immeditatly before
  removing it (hence the third wait above). If it still owns
  the primary lock, and notices a secondary lock, it may first
  remove the secondary lock.

Well yeah, all this assuming that we have a reliable way to
create the original lock in the first place...
And if the file system (or network) is really bogged down, eg.
because of excessive swapping or other disk activity, then
writing a few ambient records *can* take more than a minute.

To put this all into perspective: Do we have any data on the
typical frequency of an ambient file getting locked/unlocked when
a dozen processes share it in a largish simulation?

-schorsch

···

--
Georg Mischler -- simulations developer -- schorsch at schorsch com
+schorsch.com+ -- lighting design tools -- http://www.schorsch.com/

Greg_Ward1 · February 3, 2003, 1:54am

Schorsch wrote:

You almost convinced me with this one, until I checked the
open(2) man page on my Linux box:

  O_EXCL is broken on
  NFS file systems, programs which rely on it for performing
  locking tasks will contain a race condition. The solution for
  performing atomic file locking using a lockfile is to create a
  unique file on the same fs (e.g., incorporating hostname and
  pid), use link(2) to make a link to the lockfile. If link()
  returns 0, the lock is successful. Otherwise, use stat(2) on
  the unique file to check if its link count has increased to 2,
  in which case the lock is also successful.

I don't know if this is specific to Linux, or if is just the only
system where the documentation spells out the problem. The method
with the hard link is what Mailman uses, btw., making reference to
the above paragraph of the linux man page in the code.

Hmmm... Is Everything broken on Linux, or does it just seem that way? The NFS lock manager doesn't work, O_EXCL doesn't work, what are they going to throw in our path, next? I found no such warning on my OS X machine, though as you say, the bug could be there and just not be documented... Anyway, it's very nice that they included a workaround, which isn't too awful, though I don't get the part about the link() call failing but still succeeding. Very strange.

Relying on the ctime of the file makes us dependent on the system
clocks being exactly snychronized. It is not uncommon for
machines in the same network to by out of sync by several
minutes. Maybe we just have to watch the file for at least a
minute to decide it's expired.

I thought about this, and that's why I think it's best for the process to simply keep track of when the lock file says it was created, and if the date hasn't changed between checks that are a few minutes apart by the local clock, it's safe to assume that the process died during an update.

I think your other scenarios are getting a bit far-fetched, especially if we follow the strategy I recommended of removing the lock file in one process, but not assuming then that we can claim the lock ourselves -- just continuing to render until the next normal checkpoint. Then, your second-tier race condition would require the simultaneous lock removal by two processes coincident with the lock-assertion of a third process. Unless we are really stupid about how often we check the lock file, I don't think this scenario will play out in any of our lifetimes. The chance of a process dying in the middle of an update alone is probably small, and you're multiplying this by the probability of a three-way tie. At some point, we have to say, "that's good enough." I don't want to implement a second lock file to remove the first one in the event of a process dying mid-write.

-Greg

Georg_Mischler · February 4, 2003, 12:10am

Greg Ward wrote:

Is Everything broken on Linux, or does it just seem that way?

The problem appears to come from the specification of the NFS
protocol version 2, so it's not necessarily limited to Linux:
http://www.linux.se/doc/HOWTO/Secure-Programs-HOWTO/avoid-race.html

  > if the lock file may be on an NFS-mounted filesystem, then
  > you have the problem that NFS version 2 doesn't completely
  > support normal file semantics.
  > ...
  > NFS version 3 added support for O_EXCL mode in open(2); see
  > IETF RFC 1813, in particular the "EXCLUSIVE" value to the
  > "mode" argument of "CREATE". Sadly, not everyone has switched
  > to NFS version 3 or higher at the time of this writing, so
  > you can't depend on this yet in portable programs. Still,
  > in the long run there's hope that this issue will go away.

There's another idea that I ran into on my searches (as a reply
to someone with a similar problem to ours):
http://lists.linux.org.au/archives/linuxcprogramming/2002-July/msg00054.html

  > Surely the simple solution is to write a network lock server
  > (which would take a couple of hours) or grab one from
  > somewhere (which might take even less time). Then you can
  > forget about special cases for different platforms and file
  > systems.

This is a less involved variation of what I was planning to try.
I thought that once I have a server running, I can just as well
feed all the data through it. The more lightweight implementation
suggested here just establishes the lock, while the ambient data
still goes directly to the file.

An example implementation that we might even be able to use
and/or modify can be found here:
Lock Server download | SourceForge.net
This does a lot more than we need, as it also queues requests,
stores locks to a file to survive a restart, and manages
priorities and TTLs for both queued requests and granted locks.

> Maybe we just have to watch the file for at least a minute to
> decide it's expired.

I thought about this, and that's why I think it's best for the
process to simply keep track of when the lock file says it was
created, and if the date hasn't changed between checks that are
a few minutes apart by the local clock, it's safe to assume
that the process died during an update.

Same idea. Potential pitfall: We'll need to check out which of
the timestamps we can rely on on Windows. I *think* ctime should
be fine. One of the problems here is that Windows file systems
don't use inodes, but store this kind of information with the
file. If I remember correctly, then this means that eg. the atime
gets modified just by looking at it (kind of logically inevitable,
but still weird!).

Unless we are really stupid about how often we check the lock
file, I don't think this scenario will play out in any of our
lifetimes.

Maybe, maybe not. All I know for sure is that it *can* happen.
That was the reason I asked for data (or even just a reasonable
estimation) on the frequency that the lock might typically get
aquired during a large simulation. Maybe we can add a way to log
this information, to get a better idea about the risks involved.

I agree that there will be a point of being "good enough" (my
convoluted scenario isn't 100% safe either), but I would feel
much better if we could decide on this point based on solid data.
If the data shows that it might happen once during my lifetime,
then I'd gladly accept that. If I were to see it twice, I might
have second thoughts...

Ok, independently of the specific implementation details, I see
the following strategies that we could test:

System lock (as implemented for unix)
* direct read access to ambient file
* direct write access to ambient file
* locking through fcntl() (resp. the standard Windows locking
mechanisms, translated through Samba when needed)

Simple lock file
* direct read access to ambient file
* direct write access to ambient file
* locking through lock file, broken after a TTL of n minutes

Complex lock file
* direct read access to ambient file
* direct write access to ambient file
* locking through lock file, broken under protection of a
secondary lock

Locking server
* direct read access to ambient file
* direct write access to ambient file
* locking through a seperate server process

Unidirectional data server
* direct read access to ambient file
* ambient data written through server process
* server may use one of the above locking mechanisms if file
writing is still shared with other processes, or none if it
has exclusive access

Bidirectional data server
* ambient data read through server process
* ambient data written through server process
* server may use one of the above locking mechanisms if file
writing is still shared with other processes, or none if it
has exclusive access

Lots of toys to play with...

-schorsch

···

--
Georg Mischler -- simulations developer -- schorsch at schorsch com
+schorsch.com+ -- lighting design tools -- http://www.schorsch.com/

Randolph_Fritz · February 4, 2003, 12:20am

The problem occurred in Solaris when I was in Sun tech support, lo this many years ago. I don't know if it's cleared up yet or not--it has been fixed in Linux.

Thing is, NFS file locking just wasn't very important to the NFS designers, so it didn't work very well. I don't think that's changed, unfortunately.

Randolph

···

On Monday, February 3, 2003, at 04:10 PM, Georg Mischler wrote:

Greg Ward wrote:

Is Everything broken on Linux, or does it just seem that way?

The problem appears to come from the specification of the NFS
protocol version 2, so it's not necessarily limited to Linux:
http://www.linux.se/doc/HOWTO/Secure-Programs-HOWTO/avoid-race.html

Charles_Ehrlich · February 6, 2003, 9:55pm

What about a process that works with the ambient file and, either in real-time, or as a batch process, distills the ambient data points. At a minimum, it should cull duplicate points, and preferably, could do some gaussian (or other) smoothing on the data. The culling would reduce child process startup times for large datasets and the smoothing would possibly reduce wierdness in the ambient data, or at least make the wierdness less obvious.
I mention it now not only because I've been frustrated with long startup times in the past, but also because if this mechanism is to work in real time, then the design of the ambient data file server should keep this capability in mind.
-Chas