вторник, 22 июля 2014 г.

Double-typed relations for partial data representation

During enterprise system's development it is often the case that the data instances are represented as ValueObjects (or case classes). For example:

   case class Person(name: String, address: Address)

This representation has advantages:
  •  strongly typed data access,
  •  meta information binding is available through annotations,
and drawbacks:
  • if there are many entities then the data handling becomes routine (copy-paste);
  • annotations are rather inconvenient as a means of metainformation representation;
  • partial data cannot be represented easily - we need either to use empty values for unavailable fields, or create additional classes with required attributes;
  • it is difficult to represent the change of a property.
We would like to have a framework that allows us to represent partial information about entities irredundantly and typesafe. In this post we propose such an approach that has been implemented in the library synapse-frames.

Double-typed relations

We would like to create a model of object-oriented data structure. In our model every class, an instance, a property and any other aspects of the structure are represented as objects. To abstract the property declaration as an object (rather than a method or a field definition) we need to use some type. As soon as the property declaration contains the data type T, it is necessary to include it as a generic argument to the property type. Thus a type of an object that can represent the property itself could be something like Slot[T]. The single type is used only to constrain the data that the property can contain. However, the property is usually defined within the object class and in fact it is implicitly bound to the containing class.
Let's model the property as a relation parametrized with two types - the type L of the container and the type R of the value:

   sealed trait Relation[-L,R]
   case class Rel[-L, R](name: String) extends Relation[L, R]
 
(-L means "contravariance", i.e. the property will also be available for descendants of L. The property is invariant on type R because for the property getters and setters are available.)

The Rel class is the way to declare simple ordinary attributes of type L. For example:

   class Box

   val width = Rel[Box, Int]("width")
   val height = Rel[Box, Int]("height")

(Thanks to the contravariance, these properties are also available for descendents of the Box). (Note also that the class itself needs no methods nor properties.)

The property can also contain any application specific metainformation - database domain, property description, serializer/deserializer, constraint, column width, data format, etc. It is also easy to bind additional layer-specific metainformation using Maps.

For the type parameter L we need some real type. Any valid Scala type can be used here - a primitive type, a type alias, a trait, an abstract/final class, an object.type, etc. Contravariance of the type parameter L allows us to leverage inheritance relation of the types used. It seems to be convenient to reflect relations between domain entities directly by a hierarchy of classes and traits:

   abstract class Shape
   trait BoundingRectangle

   final class Rectangle extends Shape with BoundingRectangle
   final class Circle extends Shape with BoundingRectangle

   val width = Rel[BoundingRectangle, Int]("width")
   val height = Rel[BoundingRectangle, Int]("height")

   val radius = Rel[Circle, Int]("radius")
 
An attribute can be considered as a path component for navigation from parent to child instance. If the child instance has it's own attributes, we can navigate further. A pair of adjacent attributes can be combined to a single relation between the grandparent and the grandchild - Rel2(attr1, attr2).

   case class Rel2[-L, M, R](_1: Relation[L, M], _2: Relation[M, R])
      extends Relation[L, R]
 
DSL has an extension method `/` for convenient composition of relations (it constructs Rel2).

The defined Relation fits very well into RDF/OWL ontology. It is the middle component of the triple:
(identifier of an instance of type L, identifier of property Relation[L,R], identifier of an instance of type R).

Let's see what can we do with identifiers of instances.

Strongly-typed identifiers

When we have partial object description with a set of attributes, it is very important to match different set of attributes with the same instance. We need some way to model the authenticity of the instance. When we use ordinary classes all attributes are localized within the same instance and that is enough. In databases it is possible to obtain partial set of attributes and so the necessity to relate attributes to a particular object is obvious. Often a surrogate ID is used. The object obtains a special attribute that has unique value throughout all available instances.

Here we can also employ the database approach and use some attribute as an identifier. As far as our attributes are bound to the containing class, it is also necessary to bind the type of the identifier to the same entity type. This will enable compile-time safety when dealing with an instance identifier and ascribed attributes.

A simple way to declare identifier type is to use single type of entity as a generic type parameter:

   trait Id[T]

However, this is not really general. Firstly, many objects do not have global identifiers. They can only be found within some scope, relative to parent. Secondly, some objects can be identified by different attributes.
 What if we use some Relation[?,?] as a way of navigating from parent to child?

Let's have a collection of children as an attribute of some Parent:

   val children = Rel[Parent, Seq[Child]]("children")

and let's introduce a relation between the collection and it's elements:
 
   case class IntId[T](id: Int) extends Relation[Seq[T], T]

Note that unlike attributes where the relation is differentiated with name:String, the relation between the collection and it's elements can be "named" by position of the element within the collection.

These two relations combined together provide us with the necessary tool to identify a particular child element:

  val child123 = children / IntId(123)

child123 - is a composite identifier of an object [Child] relative to some instance of the type Parent. It is 123-rd element of the children's collection. (Here again the DSL composition method `/` is used). This way of identification allows us to navigate from the given parent to the required child.

What if we need to identify children with an arbitrary attribute instead of the position within a collection? For example if we know that the "name" attribute is unique, we could refer to children by names. In database management systems unique indexes are the typical solution. Let's implement the similar approach here.

Let's have

  trait IndexedCollection[TId, T]

- a special type (without implementation), that simply denotes some collection of elements T that can be accessed by some index value of type TId. Let's introduce some attribute that will "contain" the index of some collection. The attribute's L type should be the type of collection - Seq[T], the attribute's R type should be the IndexedCollection[TId, T]:

  case class Index[TId, T](keyProperty: Relation[T, TId])
    extends Relation[Seq[T],IndexedCollection[TId, T]]

Note that the relation identifies the index that is based on the keyProperty. The only thing to define is an index value that can be used as a key to find elements:
 
  case class IndexValue[TId, T](value:TId)
    extends Relation[IndexedCollection[TId, T], T]

Example:
 
   val name = Rel[Child, String]("name") // declare a simple property
  
   val childByName = name.index          // create an index identifier. The index is based on the property values.

   val childJohn = parent / children / childByName / IndexValue("John")

Hence, the type Relation[-L,R] with a couple of descendents, allows us to navigate hierarchical data structures. To identify top level objects that do not have any parent, we introduce an auxiliary type Global that will be the source of references to top level objects:
 
   final class Global

   val persons = Rel[Global, Seq[Person]]("persons")
   val otherTopLevelObjects =
        Rel[Global, Seq[OtherTopLevelObject]]("otherTopLevelObjects")

The scheme of data

The double-typed relations are the building blocks that can be used for both the data structures and the scheme construction. The scheme can be either relational or object-oriented. The object-oriented scheme is based on a few descendents of Scheme[T]:  
   SimpleScheme[T] - for simple types without attributes;  
   RecordScheme[T] - for composite types with attributes;
   CollectionScheme[T] - for attributes of type Seq[T] in order to bind element scheme.

Data representation

The metainformation cannot hold any data. To use it we need some kind of storage. The storage depends on application requirements and can be any of the following:
  • ordinary classes with getters/setters. The access to the data can be performed either via reflection by property name, or with lenses;
  • special universal classes (descendents of Instance[T]) that resembles maps. SimpleInstance, RecordInstance, CollectionInstance are used to store data that satisfies corresponding scheme. These types simplify generic data handling because the data is stored directly according to the scheme;
  • linear tuple (List[Any]), and a "list of lists". Hierarchical records (objects) can be flattened to a linear structure - the sequence of primitive types. Inner collections can be represented as a list of lists of inner types. This representation can be used to send data over network and to save data to database tables. To convert Instances to and from flat lists a pair of operations is used - align/unalign (flatten);
  • database tables, RecordSet's;
  • JSON-objects;
  • XML-nodes.

DSL for data construction

While constructing records from pieces it is important to have the compiler checking type compatibility. We should only be able to use properties on those types on which they have been defined. (It is the main reason to have the left generic-type on Relation definition). That's why we need special tools for building instances. For example:

  val b1 = empty[Box]
   .set(width, simple(10))
   .set(height, simple(20))

Here an immutable type Instance[Box] is used. The instance is recreated each time when we add a pair (property, value). When we have more data, it is more efficient to have a mutable Builder that is converted to Instance afterwards:
 
   val boxBuilder = new Builder(boxSchema)
   boxBuilder.set(width, simple(10))
   boxBuilder.set(height, simple(20))
   val b1 = boxBuilder.toInstance
 
The builder also performs runtime checks:
  1. Denying the use of properties that are not available in the scheme;
  2. Final check for completeness of the instance.

Conclusion

The approach allows
  • to define complex domains with explicit and convenient metainformation;
  • to manipulate properties with compile-time check;
  • to represent hierarchical data structures (like json or xml) with strong types at every layer;
  • to represent incomplete information and to check the level of completeness (for example we may test an instance against a smallSchema[T] and a fullSchema[T]).
The main advantage of this approach is that we can deal with infinitely many data tree structures in a typesafe way. We get the convenience of maps combined with the compiler support as strong as in ordinary classes.

среда, 2 октября 2013 г.

DataFlow native concurrency has been implemented in SynapseGrid

As already have been discussed there are a few possible ways to run the StaticSystem (synaptic grid) on a multicore processor. The one that was implemented was running subsystems inside Akka actors.

However, this is not the finest grained concurrency possible. The whole subsystem will run inside a single thread (the current actor's thread) and it can become a bottle neck for other subsystems.

What is the least element that can run in parallel? For SynapseGrid it is a function that connects two contacts. If it is pure function then it can safely run in a separate thread.

Special care should be taken when dealing with states. The access to states should be serialized according to the principle "happen-before". We cannot read the state value before all computations that could influence it have completed. And no other computation that requires this state can start until we complete this computation. However, all the other computations can run in parallel. In particular if two computations use different states then they can run in parallel.

Recently this fine-grained approach has been implemented in SynapseGrid (synapse-grid-concurrent module). To run a system in some ExecutionContext one converts it to a parallel SimpleSignalProcessor:

   import scala.concurrent.ExecutionContext. Implicits.global
   val f = system.toStaticSystem. toParallelSimpleSignalProcessor. toMapTransducer(input, output)
   val y = f(x)

That's it. Now the whole system runs with fine-grained parallelism. To compare that the results are absolutely the same, the same system can also be converted to a traditional single-threaded version:

   val g = system.toStaticSystem. toSimpleSignalProcessor. toMapTransducer(input, output)
   val y2 = g(x)
   assert(y === y2)

The concurrency implementation guarantees that the results will be the same.

The tests for the module contains examples of how it works.

The dependency for the synapse-grid-concurrent module:

"ru.primetalk" % "synapse-grid-concurrent" % "1.3.0"

The source code is published on GitHub.

Ярлыки: , , , , ,

четверг, 19 сентября 2013 г.

Functional Reactive Programming: Introduction to SynapseGrid

Functional programming paradigm opens new ways of building complex software systems. We wish to keep all operations side-effect-free and have all benefits of immutability. Also we want a way to compose functional building blocks into bigger systems that retain the same properties.

Traditional functional composition

Constructing a complex system from functions in a traditional way looks something like:
    def f3(x) = f2(f1(x))
or
    def g3(xs:List) = xs.map(f1).flatMap(f2)
or
    def h3(xs:List) = 
      for { x <- xs
            ys = f1(x)
            y <- ys
      }
         yield f2(y)

What's wrong with it? Well, not too much. This approach works pretty good in many situations. However, what to do if we want
  1. to split the flow of data into two chains?
  2. to implement an arbitrary DataFlow processing?
  3. to modularise the construction of processing function?
The construction code becomes a bit uglier.

Builder / Runtime system separation

During the construction phase of a complex system we may tolerate mutable state, because the construction is usually single-threaded, but during runtime we want to avoid it as much as possible.
The steps looks as follows:
  1. We construct a system using an advanced DSL.
  2. Convert the system into runtime representation.
  3. Run the system on some data.
What can advanced DSL give us?
  1. Arbitrary DataFlow graph of a system can be constructed incrementally.
  2. The construction can be done in separate (/nested) modules.

System construction DSL

Of course for simple cases we wish to retain the usual DSL with maps and flatMaps. But in order to have builder/runtime separation the map and flatMap methods are replaced. Now they do not immediately execute their operation but instead defer execution to runtime.
This system can be constructed with
    val len = myContact.map( _.length )
and:
    Input >> myContact
    len >> Output

Runtime system

The system is constructed within a mutable Builder and can be converted to static immutable system definition with the method toStaticSystem. The system definition can the be statically analyzed (converted to the above picture for instance). Or further converted to a simple function.
    val s = sb.toStaticSystem
    val fun = s.toDynamicSystem.toMapTransducer(Input, Output)The fun has type of a function and can be immediately used in other parts of the program:
    println("fun(hello)="+fun("hello"))

More info

More examples on GitHub.


Ярлыки: , , ,

SynapseGrid

This blog is about SynapseGrid library. SynapseGrid is a framework for constructing reactive real-time immutable data flow systems (see Data flow programming).
The library allows construction of complex systems from building blocks as simple as functions.

Ярлыки: , , ,