Wednesday, June 19, 2013

Side note on ELF notes

ELF file may contain section .note which may contain numerous additional implementation defined values. Each value is opaque from the ELF specification point of view which only describes the structure of its header. That's because notes are considered optional and implementation may ignore the notes it can not understand.

However, this particular case of note headers resembles export keyword introduced by C++03, ignored by most compilers and eventually removed in C++11. The problem with ELF note is that all implementations use slightly different note header format than the one described by the specification.

According to ELF format note contains 5 fields:
  • namesz and name: The latter is representation of the entry owner (null terminated string) used mainly to avoid conflicts between custom, implementation defined notes. The former is the size of that representation.
  • descsz and desc: The actual note value and the size of that value.
  • type: Note type used to correctly interpret its value.

ELF specification clearly states that all these fields should be 4-byte (8-byte on 64-bit processors) aligned and appropriately padded. That's what standard says. In reality, all systems in BSD family, Linux,  GNU tools, illumos, Solaris, etc use 4-bit alignment on both 32-bit and 64-bit architectures. A comment in the illumos source code claims that it is due to the mistake in the initial 64-bit port and in order to maintain compatibility was not corrected. Nothing serious, but may cause some confusion to anyone not aware of that.

The actual note owners, types and the way of interpreting its values is not documented by ELF specification in any way. Moreover, since they are all implementation defined there is no documentation at all. Another thing that makes working with ELF notes a bit more unpleasant.

In addition to all this, one would expect that a pair name:type identifies a way of interpreting a note. That's correct assumption but care must be taken when introducing new note types since some note types are interpreted in certain way regardless of the note owner. Basically, there is a set of notes used in core files which owner is usually CORE. However, some implementations store them with different owners (e.g. FreeBSD) as a result most implementations of tools that read ELF notes assumes that the note owner is CORE if it does not recognize the actual one. This means that the only way to reliably avoid conflicts is to use custom note owner and note types that are not used by the owner CORE.

Useful links

Monday, August 27, 2012

How NFS4 became stateful

Network File System since version 4 is a stateful protocol. In order not to introduce any regression in comparison to the earlier, stateless, versions a numerous methods to recover from either client or server failure had to be adopted.

During client operation the server may store up to three different types of states. Initialization of all of them involves client sending an opaque value called owner which should be generated using the rules depending on the state type. Usually there are also sequence numbers used in order to ensure at-most-once semantics what was described in more details in my previous post. If state creation is successful the server returns a stateid value (or clientid in case of client state) which acts as a shorthand reference for the, usually long, original owner value.
  • client state - "root" state, obtaining it is necassary in order to create or reclaim any other state on the server. It is used to identify a client instance. The owner value has to remained unchanged after client reboot. More throughout description of client state usage in order to recover after client reboot is described the following section of this post.
  • open state - obtained when a client opens a file. The same open owner may be used simultanously for many opened files. The NFS4 specification compares an open owner to a file descriptor that may be shared among multiple processes.
  • lock state - obtained when a client creates a lock. The same lock owner may be able to upgrade its own locks, if the server supports such operation. To achieve POSIX-like behavior lock owner should be generated using process identifier.

Client crash recovery

Recovery from a client crash is usually quite straightforward. Clients are obliged to periodically renew all leases (i.e. states) they hold on the server. It can be accomplished by issuing either any request that contains stateid value or special RENEW request that automatically renews all leases held by the client.

The server decides how often leases are to be renewed. It is a matter of choosing between network traffic and slower recovery from either client or server failure. Apparently, 90 seconds is quite common default value.

When a client reboots and reconnects it sends a SETCLIENTID request using the same owner value as its previous instance and new verifier. Such request informs the server that the client has rebooted and all leases it held can be released immediately.

Server crash recovery

When server reboots the client will be notified not later than after lease time. Either a request containing stateid value or a RENEW operation will return error code indicating that server had rebooted and clients need to reclaim their leases.

In order to prevent conflicts between clients reclaiming old leases and clients trying to acquire new, the server after reboot enters so called grace period in which no new leases can be acquired. Grace period is no shorter than lease validity time so that all clients will attempt to renew their leases at least once during this period and evantually will be notified about the reboot. Moreover, only during grace period old leases may be reclaimed, what allows to avoid possible race conditions.

Crash recovery and open delegations

Lease reclaimation also happens when a client reboots while holding an open delegation. In such case issuing a SETCLIENTID does not release state tied with open delegations, since there may be cached writes that need to be synchronized with the server. Then, the client reclaims its previously held delegation just as it does with any other lease after server reboot.

Open delegations also introduce another problem. Since, they rely on RPC callbacks it is possible that callback path breaks. In such case, the server waits for a RENEW operation and responds with error NFS4ERR_CB_PATH_DOWN. Such error code means for the client that although all leases were successfully renewed the callback path is broken and all delegations have to be returned as soon as possible.


Useful links

  • RFC 3530 - Network File System Protocol version 4

Tuesday, August 7, 2012

How NFS4 improved RPC

Network File System (NFS) since its very beginning has been using a lower level protocol in order to perform remote procedure calls. Thus NFS deals with files and Open Network Computing Remote Procedure Call (ONC RPC) takes care of sending requests and replies over the network.

However, it has been over 20 years since first NFS and RPC specifications and the former evolved very much while the latter remained virtually unchanged. NFS has changed from stateless protocol to stateful what made helper protocols like Network Lock Manager (NLM) and Network Status Monitor (NSM). Consequently, NFS started to require more guarantees on the way in which client requests are processed. Some of them RPC could not provide.

Since version 4 NFS supports file locks on its own without using any external protocol like the earlier versions did. There are also share reservations which are also a kind of file locks. Acquiring and releasing both file locks and share reservations are the operations that have to be ordered and executed at most once.

Unfortunately, RPC does not guarantee uses at-least-once semantics and messages are not ordered. RPC transaction identifier (XID) also does not help at all since the specification forbids the server to treat it as a sequence number. In addition to that, since RPC is independent from transport layer protocols NFS can not take advantage of any TCP or SCTP guarantees.

Ordering and at-most-once semantics are achieved by introducing sequence numbers and state owners to the requests that require it. For each state owner the server stores the last received sequence number L and the response that was returned to that request. Then, when the server receives another request with sequence number r one of the following will happen:

  • r < L - the request is rejected
  • r == L - received request is a duplicate and server returns the cached response
  • r == L + 1 - new request received, server performs any necessary action, then updates L and response cache
  • r > L + 1 - the request is rejected

Following this behavior ensures that requests are performed in correct order and at most once if L was correctly initialized. The server need also a way to deal with the first use of a state owner and corresponding sequence number. NFS4 specification states that:

The first request issued for any given lock_owner is issued with a sequence number of zero.

This guarantee is too weak, though. The server is allowed to dispose any state owner if it is not used for a prolonged period of time. Hence, there may be a valid request with, from the server point of view, new state owner and non-zero sequence number.

To deal with such situations first use of an open owner needs to be additionally confirmed. Correct confirm request has sequence number one greater than the request it is confirming. Once the request is confirmed, a proper state is established. However, if the client fails to confirm the request in a timely manner or sends another request with sequence number that is incorrect for the one that is pending confirmation the server disposes the unconfirmed state.

Lock owners are dealt with in a bit less complicated way. Since, it is impossible to lock a file that is not already opened it can be safely assumed that when using a new lock owner there already exists a confirmed open owner. Each time new lock owner is used the client in the same request sends open owner sequence number. Thus first use of lock owner sequence number is also sequenced and does not need to be confirmed.

Earlier versions

It is worth mentioning that in the earlier versions there were also non-idempotent operations, namely create, rename and remove. Nevertheless, they did not require such special treatment as locking. In case of replicated rename or remove request the client was returned an error and assumed that someone else already removed the file.

Exclusive create operation in version 3 of the protocol uses a verifier to ensure at-most-once semantics. Verifier is a random value provided by the client. When a file is created the server stores the verifier, then when another exclusive create request is issued the server compares verifiers, if they are the same the request is a duplicate and still a success reply is returned. Otherwise, server informs the client that the file already exists.

Useful links

  • RFC 1050 - Remote Procedure Call version 1
  • RFC 1094 - Network File System Protocol version 2
  • RFC 1813 - Network File System Protocol version 3
  • RFC 3530 - Network File System Protocol version 4
  • RFC 5531 - Remote Procedure Call version 2