Athshe preliminary design

Raph Levien, 4 Dec 1999

Athshe is an attempt to synthesize a bunch of current ideas in distributed documents into a unified, fairly simple system. It is more an exploration of what is technically possible than a response to a precise set of application needs.

A major focus of Athshe is performance. If you try to run DOM over CORBA to access XML, be prepared for an unpleasant surprise.

Athshe is a little hard to pin down precisely, as it isn't just one thing. It incorporates a simple set of abstract data structures, a network protocol for accessing documents represented in these data structures remotely, an api for accessing documents either locally or remotely, a namespace for tag metadata, and doubtless other stuff.

The data structures

A node is one of four things:

A byte sequence.
A list of nodes.
A mapping from byte sequence to node.
A content-type tag attached to a node.

The list of nodes and byte sequence correspond quite closely to the "list" and "atom" (or string) concepts in S-expressions. S-expressions induce a tree structure that was undoubtedly a major inspiration for XML (via SGML, of course).

The mapping is similar to a Perl hashtable, and also to a directory in a filesystem.

Finally, the content type is similar to MIME content types, or Macintosh TYPE informtation stored in the resource fork, or (to a certain extent) XML namespaces.

Thus, documents represented in Athshe have quite a bit of similarity to XML files. Mapping nodes, however, significantly help in the management of very large documents. Also, note that the inclusion of byte sequences makes Athshe suitable for binary data such as images, another XML limitation.

Binary file format

Athshe needs to define a binary file format for representing its data structures. Here's a dumb proposal, likely to be replaced.

An Athshe data structure is encoded as a file as a header, 0xA7, 0x45, 0x4E, 0x0A, 0x00, followed by the encoding of the toplevel node.
A byte sequence is encoded as its length encoded as a type byte, a b128 variable size int, and the binary content. The type byte is 0x01 for UTF-8 encoded Unicode text, 0x0F for unknown binary data. 0x02-0x0E are reserved for other basic binary types, such as integers and so on.

A list is encoded as the byte 0x10, followed by the representation of each of the nodes in the list, followed by the byte 0x11.

A mapping is encoded as the byte 0x20, followed by a list of entries, followed by the byte 0x21. Each entry consists of a b128 int encoding of the key's length, the contents of the key, and the encoding of the node. In general, the key will be a UTF-8 encoded Unicode string, but this is not required. (have a bit in the opening byte?)

A tagged node is encoded as the byte 0x30, the b128 int encoding of the tag's length, the tag itself (a UTF-8 encoded Unicode string), and the encoding of the tagged node.

The b128 encoding of an integer is as follows: represent the integer in base 128. Embed each "digit" in the 7 low order bits of a byte. For all digits but the least significant, set the high order bit to "1", for the least significant "0". The encoding is the resulting sequence of bytes in most-significant-first order.

Example:

The Athshe data structure loosely specified by ("hello" "world") is represented as follows:
0xA7 0x45 0x4E 0x0A 0x00       Athshe header
0x10                           begin list
     0x01 0x05 h e l l o       "hello"
     0x01 0x05 w o r l d       "world"
0x11                           end list
This encoding is not exceptionally compact, so it may benefit from gzip compression. Alternatively, it might be interesting to define another encoding that's more compact, almost as simple, and without the speed penalties of gzip.

In any case, it's a hell of a lot more compact than XML.

Note to self: probably want to combine type byte with size for small strings. Shouldn't be a problem, there's lots of extra bits in there :)

Network access method

This is the heart of the Athshe concept. It's inspired by running DOM over CORBA, but is designed to fix the technical flaws of that approach, mostly the distributed garbage collection, latency, and transaction/locking issues.

The details are still a bit vague, so this is an outline. You have a fairly traditional model of a client, a server, and a connection between the two. The client sends requests to the server, which responds with a response. In addition, the server might send the client notifications of events, either requested by the client or speculative.

The connection is designed to be asynchronous, but ordered. Specifically, the client can issue any number of requests without waiting for confirmation, but the requests are considered to be in order, so one requests can depend on previous ones without waiting for the roundtrip latency. (Damn, I wish CORBA had a clean way of specifying these semantics). The responses, however, may come back out of order.

Within a connection, you have the concept of node id's. These can be allocated by either the client or server (thus, client-allocated and server-allocated id's need to be distinguished, to avoid the possibility of collision). You have to get an initial node id from somewhere, probably a naming service that also operates over the protocol.

Node id lifecycle

Node id's are explicitly allocated. The exact protocol is a little subtle, however.

Either the server or the client can allocate a new node id. When the server allocates a new node id, it sends a "node id allocated" notification to the client. Alternatively, it may allocate a new node as part of the return value from a query operation. Clients generally allocate new node id's by including the id in a node creation request.

Deallocation is always initiated by the client. First, the client sends a "deallocate node id" request to the server. The server then acknowledges it, at which point the client can reuse the id.

Waiting for the acknowlegement is essential because otherwise there may be race conditions.

Finally, if the connection is terminated, all node id's are deallocated.

When a node id is deallocated, sometimes its corresponding node can also be deallocated. It depends on whether the node is referenced by another allocated node, or by a persistent node. Thus, node data is effectively refcounted.

I think that within a connection, node id's are unique. That is, a single node cannot be referenced by two different node id's.
Client requests

Here's a brief summary of client requests.

Querying

query_type (Node_id): Node_type
Queries the type of a node, returning an indication whether it's a byte sequence, a list, a mapping, or a tagged node.

query_n_children (Node_id): int
Returns the number of children of a list (or mapping) node.

query_size (Node_id): int
Returns the size in bytes of a byte sequence node.

query_child (Node_id, int child_num): Node_id
Returns child number child_num of a list node.

query_map (Node_id, Byte_seq key): Node_id
Looks the key up in the map referenced by the Node_id, returning the value as a new node_id.

query_substring (Node_id, int start, int length): Byte_sequence
Selects a substring of a byte sequence node.

query_map_child_key (Node_id, int child_num): Byte_seq
Returns the key for child child_num of the mapping node (what's the order? sorted by key? arbitrary?).

query_map_child_value (Node_id, int child_num): Node_id
The corresponding value for child child_num of the mapping node.

query_tag (Node_id): Byte_seq
Returns the tag for a tagged node.

query_tagged_val (Node_id): Node_id
Returns the value for a tagged node.

query_parent (Node_id): Node_id
Returns the parent node of the given node.

Node creation

Each of these functions takes a Node_id argument referring to an unallocated node id in the client space. The call allocates the node id.

create_byte_seq (Byte_seq, Node_type, Node_id): void
Creates a new byte sequence node with the specified Node_id.

create_list (Node_id): void
Creates a new empty list with the specified Node_id.

create_mapping (Node_id): void
Creates a new empty mapping with the specified Node_id.

create_tagged (Byte_seq tag, Node_id val, Node_id): void Creates a new tagged node wrapping the node val.

Mutation

insert_child (Node_id parent, int child_num, Node_id child): void
Inserts the child before child_num in the parent list.

remove_child (Node_id parent, int child_num): void
Removes child child_num from the parent list.

remove_map (Node_id parent, Byte_seq key): void
Removes a key/value pair from the map.

put_map (Node_id parent, Byte_seq key, Node_id val): void
Inserts the val in the parent mapping under the given key.

replace_substr (Node_id, int start, int length, Byte_seq): void
Removes length bytes starting at start from the byte sequence node, and inserts the given byte sequence at that location.

Probably more, like replace_child, append_child, as needed...

Events

The server sends events to the client, either to fulfill a client request, or to help out with caching (see below). These events have varying levels of precision. At most precise level, there is basically an event corresponding to each mutation request. In addition, the server can send more vague events, all the way to "some descendant of node_id X changed". This prevents a large event volume from saturating a low-bandwidth net connection.

When a client requests to listen to events on a particular node, it also specifies its preference as to the level of precision.

Transactions

Athshe is most useful in a truly distributed environment, when there may be more than one client connected to a document. In this context, some kind of locking or transaction control is mandatory.

Here's a simple transaction protocol I've come up with:

The client initiates a transaction by sending a request, associated with a node. The transaction comes in one of three flavors: read-only, read-write, and write-locked.

The server eventually grants the transaction. Note that it may send a number of events first.

Within a transaction, the client has a view of the document that is consistent with it being the sole client. Thus, if it sends the same query, it will get the same answer (assuming that it has not itself modified the data).

At the end, the client has the option of committing or aborting the transaction. Again, the server eventually responds saying that the transaction is actually committed.

If a connection terminates, it aborts all open transactions.

A read-write transaction has the special property that it may be aborted by the server. This happens if the server uses optimistic concurrency control and a deadlock happens (see your favorite database text for an explanation for now). If you don't feel like writing your client to deal with this, use write-locked transactions instead and be prepared to wait a lot for the lock when there is high contention.

Caching

It's very important for the client to be able to cache parts of the document, otherwise you incur roundtrip latency and network bandwidth every time it needs to access the document. The Athshe caching model also has a prefetching mechanism that should avoid read-traversal of a document tree from being latency bound as it is in the DOM over CORBA case.

Basically, when the client stores a node in its cache, it sends the server a listen request at low precision (note to self: have to be careful about race conditions, maybe make listen request atomic with query?). Then, if the data ever changes, the server sends a change event. This can be a precise event, in which case the client updates the cache, or a vague one, in which case the client simply invalidates the cache entry.

The prefetching mechanism is based on the fact that events can be sent by the server not in response to any particular listen request. Thus, if the client starts querying the document in some traversal order, the server can speculatively send events as if in response to future queries.

Client API

The client API is actually not a core part of Athshe, in that it's easy to imagine multiple client API's, and have everything interoperate just fine. Even so, for clients to talk bare network protocol is a pain in the ass, so there has to be a layer to hide it.

I see the API as supplying wrapper functions for most of the net functions. In C (at least), these functions should be available in blocking and callback flavors. In Java, making them all blocking would not be unreasonable, although providing callback variants might make sense just for reducing thread overhead.

The client side API implementation is typically going to include caching and other fun stuff too. So it's not exactly trivial.

Other stuff, in outline

Nameservice mapping "well known names" to node ids.

Starting from a node id and traversing mapping keys is just like a filesystem.

Protocol above is the client-server one. There should also be a server-server one that helps with redundant storage. The server-server protocol needs a more robust node naming strategy than the one mentioned above with per-connection-only node id's. There also needs to be a set of quick, reliable ways to map these node names to servers that can serve them.

Security needs to be baked in, using public key encryption and related stuff. One intriguing possibility: a server can store data for clients, without the server knowing the public key. In this setting, the server essentially acts as a dumb disk drive, ie storing encrypted snapshots of the document, as well as encrypted log entries of mutations. A client can reconstruct the document by retrieving and decrypting the snapshot and log. Any client with write access can create a new snapshot. Perhaps intermediate-grained operations also make sense as a performance optimization.

Access control interacts with the operations supported. For example, appending to a list is a reasonable operation (it corresponds, for example, to sending mail to a mail spool). Without access control, this can be done by querying the list length, and doing an insert operation. However, if there is no read access to the list (as would be reasonable for a mail spool), this doesn't work. Also worth noting that an append operation can improve performance by avoiding roundtrip latencies.

Interaction with XML

Define an embedding of XML into Athshe. For simple XML files, a simple embedding can be defined: the element

<a b="c">child1<child2/></a>

becomes:

{name -> "a", attrs -> {b -> "c"}, children -> ( "child1", {name -> "child2"} ) }

This obviously becomes a fuckload more complex once we add notations, dtd's, entities, namespaces, and other cruft. Yet, it should be reasonably straightforward to define a DOM on top of the Athshe API (make sure to define Athshe API so as to not make this gratuitously hard). This would gain many of the performance advantages, without exporting any XML complexity into either the backend (ie, networking, persistent storage, transactions) or into applications that don't need all the XML junk. To me personally, this is a clean separation.

Athshe home
levien.com home