вторник, 22 июля 2014 г.

Double-typed relations for partial data representation

During enterprise system's development it is often the case that the data instances are represented as ValueObjects (or case classes). For example:

   case class Person(name: String, address: Address)

This representation has advantages:
  •  strongly typed data access,
  •  meta information binding is available through annotations,
and drawbacks:
  • if there are many entities then the data handling becomes routine (copy-paste);
  • annotations are rather inconvenient as a means of metainformation representation;
  • partial data cannot be represented easily - we need either to use empty values for unavailable fields, or create additional classes with required attributes;
  • it is difficult to represent the change of a property.
We would like to have a framework that allows us to represent partial information about entities irredundantly and typesafe. In this post we propose such an approach that has been implemented in the library synapse-frames.

Double-typed relations

We would like to create a model of object-oriented data structure. In our model every class, an instance, a property and any other aspects of the structure are represented as objects. To abstract the property declaration as an object (rather than a method or a field definition) we need to use some type. As soon as the property declaration contains the data type T, it is necessary to include it as a generic argument to the property type. Thus a type of an object that can represent the property itself could be something like Slot[T]. The single type is used only to constrain the data that the property can contain. However, the property is usually defined within the object class and in fact it is implicitly bound to the containing class.
Let's model the property as a relation parametrized with two types - the type L of the container and the type R of the value:

   sealed trait Relation[-L,R]
   case class Rel[-L, R](name: String) extends Relation[L, R]
 
(-L means "contravariance", i.e. the property will also be available for descendants of L. The property is invariant on type R because for the property getters and setters are available.)

The Rel class is the way to declare simple ordinary attributes of type L. For example:

   class Box

   val width = Rel[Box, Int]("width")
   val height = Rel[Box, Int]("height")

(Thanks to the contravariance, these properties are also available for descendents of the Box). (Note also that the class itself needs no methods nor properties.)

The property can also contain any application specific metainformation - database domain, property description, serializer/deserializer, constraint, column width, data format, etc. It is also easy to bind additional layer-specific metainformation using Maps.

For the type parameter L we need some real type. Any valid Scala type can be used here - a primitive type, a type alias, a trait, an abstract/final class, an object.type, etc. Contravariance of the type parameter L allows us to leverage inheritance relation of the types used. It seems to be convenient to reflect relations between domain entities directly by a hierarchy of classes and traits:

   abstract class Shape
   trait BoundingRectangle

   final class Rectangle extends Shape with BoundingRectangle
   final class Circle extends Shape with BoundingRectangle

   val width = Rel[BoundingRectangle, Int]("width")
   val height = Rel[BoundingRectangle, Int]("height")

   val radius = Rel[Circle, Int]("radius")
 
An attribute can be considered as a path component for navigation from parent to child instance. If the child instance has it's own attributes, we can navigate further. A pair of adjacent attributes can be combined to a single relation between the grandparent and the grandchild - Rel2(attr1, attr2).

   case class Rel2[-L, M, R](_1: Relation[L, M], _2: Relation[M, R])
      extends Relation[L, R]
 
DSL has an extension method `/` for convenient composition of relations (it constructs Rel2).

The defined Relation fits very well into RDF/OWL ontology. It is the middle component of the triple:
(identifier of an instance of type L, identifier of property Relation[L,R], identifier of an instance of type R).

Let's see what can we do with identifiers of instances.

Strongly-typed identifiers

When we have partial object description with a set of attributes, it is very important to match different set of attributes with the same instance. We need some way to model the authenticity of the instance. When we use ordinary classes all attributes are localized within the same instance and that is enough. In databases it is possible to obtain partial set of attributes and so the necessity to relate attributes to a particular object is obvious. Often a surrogate ID is used. The object obtains a special attribute that has unique value throughout all available instances.

Here we can also employ the database approach and use some attribute as an identifier. As far as our attributes are bound to the containing class, it is also necessary to bind the type of the identifier to the same entity type. This will enable compile-time safety when dealing with an instance identifier and ascribed attributes.

A simple way to declare identifier type is to use single type of entity as a generic type parameter:

   trait Id[T]

However, this is not really general. Firstly, many objects do not have global identifiers. They can only be found within some scope, relative to parent. Secondly, some objects can be identified by different attributes.
 What if we use some Relation[?,?] as a way of navigating from parent to child?

Let's have a collection of children as an attribute of some Parent:

   val children = Rel[Parent, Seq[Child]]("children")

and let's introduce a relation between the collection and it's elements:
 
   case class IntId[T](id: Int) extends Relation[Seq[T], T]

Note that unlike attributes where the relation is differentiated with name:String, the relation between the collection and it's elements can be "named" by position of the element within the collection.

These two relations combined together provide us with the necessary tool to identify a particular child element:

  val child123 = children / IntId(123)

child123 - is a composite identifier of an object [Child] relative to some instance of the type Parent. It is 123-rd element of the children's collection. (Here again the DSL composition method `/` is used). This way of identification allows us to navigate from the given parent to the required child.

What if we need to identify children with an arbitrary attribute instead of the position within a collection? For example if we know that the "name" attribute is unique, we could refer to children by names. In database management systems unique indexes are the typical solution. Let's implement the similar approach here.

Let's have

  trait IndexedCollection[TId, T]

- a special type (without implementation), that simply denotes some collection of elements T that can be accessed by some index value of type TId. Let's introduce some attribute that will "contain" the index of some collection. The attribute's L type should be the type of collection - Seq[T], the attribute's R type should be the IndexedCollection[TId, T]:

  case class Index[TId, T](keyProperty: Relation[T, TId])
    extends Relation[Seq[T],IndexedCollection[TId, T]]

Note that the relation identifies the index that is based on the keyProperty. The only thing to define is an index value that can be used as a key to find elements:
 
  case class IndexValue[TId, T](value:TId)
    extends Relation[IndexedCollection[TId, T], T]

Example:
 
   val name = Rel[Child, String]("name") // declare a simple property
  
   val childByName = name.index          // create an index identifier. The index is based on the property values.

   val childJohn = parent / children / childByName / IndexValue("John")

Hence, the type Relation[-L,R] with a couple of descendents, allows us to navigate hierarchical data structures. To identify top level objects that do not have any parent, we introduce an auxiliary type Global that will be the source of references to top level objects:
 
   final class Global

   val persons = Rel[Global, Seq[Person]]("persons")
   val otherTopLevelObjects =
        Rel[Global, Seq[OtherTopLevelObject]]("otherTopLevelObjects")

The scheme of data

The double-typed relations are the building blocks that can be used for both the data structures and the scheme construction. The scheme can be either relational or object-oriented. The object-oriented scheme is based on a few descendents of Scheme[T]:  
   SimpleScheme[T] - for simple types without attributes;  
   RecordScheme[T] - for composite types with attributes;
   CollectionScheme[T] - for attributes of type Seq[T] in order to bind element scheme.

Data representation

The metainformation cannot hold any data. To use it we need some kind of storage. The storage depends on application requirements and can be any of the following:
  • ordinary classes with getters/setters. The access to the data can be performed either via reflection by property name, or with lenses;
  • special universal classes (descendents of Instance[T]) that resembles maps. SimpleInstance, RecordInstance, CollectionInstance are used to store data that satisfies corresponding scheme. These types simplify generic data handling because the data is stored directly according to the scheme;
  • linear tuple (List[Any]), and a "list of lists". Hierarchical records (objects) can be flattened to a linear structure - the sequence of primitive types. Inner collections can be represented as a list of lists of inner types. This representation can be used to send data over network and to save data to database tables. To convert Instances to and from flat lists a pair of operations is used - align/unalign (flatten);
  • database tables, RecordSet's;
  • JSON-objects;
  • XML-nodes.

DSL for data construction

While constructing records from pieces it is important to have the compiler checking type compatibility. We should only be able to use properties on those types on which they have been defined. (It is the main reason to have the left generic-type on Relation definition). That's why we need special tools for building instances. For example:

  val b1 = empty[Box]
   .set(width, simple(10))
   .set(height, simple(20))

Here an immutable type Instance[Box] is used. The instance is recreated each time when we add a pair (property, value). When we have more data, it is more efficient to have a mutable Builder that is converted to Instance afterwards:
 
   val boxBuilder = new Builder(boxSchema)
   boxBuilder.set(width, simple(10))
   boxBuilder.set(height, simple(20))
   val b1 = boxBuilder.toInstance
 
The builder also performs runtime checks:
  1. Denying the use of properties that are not available in the scheme;
  2. Final check for completeness of the instance.

Conclusion

The approach allows
  • to define complex domains with explicit and convenient metainformation;
  • to manipulate properties with compile-time check;
  • to represent hierarchical data structures (like json or xml) with strong types at every layer;
  • to represent incomplete information and to check the level of completeness (for example we may test an instance against a smallSchema[T] and a fullSchema[T]).
The main advantage of this approach is that we can deal with infinitely many data tree structures in a typesafe way. We get the convenience of maps combined with the compiler support as strong as in ordinary classes.