SPARQLMotion - Core Vocabulary

Version 2.1.0, February 01, 2010
Contact: Holger Knublauch <holger@topquadrant.com>

Abstract

This document contains a technical description of the SPARQLMotion language, an RDF-based scripting language with a graphical notation to describe data processing pipelines. It introduces the core classes and properties that are used to represent SPARQLMotion scripts, and defines how SPARQLMotion engines will interpret them. Note that this document alone may not be a good starting point to learn the actual use of SPARQLMotion. Instead, it acts as a reference for users who need to fully understand the details and internals of a SPARQLMotion engine.

This document is part of the SPARQLMotion Specification.


Table of Contents

 

1 Introduction

The SPARQLMotion Core Vocabulary is part of the SPARQLMotion specification which is outlined in the SPARQLMotion Overview page. This vocabulary does not describe the various types of modules, which are included in the SPARQLMotion Standard Module Library.

The SPARQLMotion language itself is a fairly light-weight collection of classes and properties used to represent SPARQLMotion scripts in RDF.

The SPARQLMotion system vocabulary is found in the namespace http://topbraid.org/sparqlmotion#, which is typically abbreviated with the prefix sm.

This system vocabulary is associated with semantics, which instruct the execution engine how to process (or display) SPARQLMotion scripts. The remainder of this document provides details on the various parts of the system vocabulary and their semantics.

 

2 Modules and Scripts

SPARQLMotion scripts consist of modules. Each module represents a single processing step. Modules can be linked together with various relationships, as shown in the figure below.

The figure above displays a visual rendering of five SPARQLMotion modules, represented as rectangular nodes in the diagram and linked with the relationships sm:next and sm:body. The visual rendering above is, however, just one way of interpreting SPARQLMotion scripts: the ultimate storage format of scripts is entirely as RDF models. The following sub-sections provide details on those general concepts.

2.1 Modules

A SPARQLMotion module is an instances of a module type. Module types are (RDFS) classes that have the metaclass sm:Module. Module types define properties that the SPARQLMotion user needs to fill in at the instance level to control the module's behavior.

The class sm:Modules (note the 's' at the end!) serves as "abstract" base class of the various module types. It can be used as range or domain of properties, but has no other formal meaning.

A SPARQLMotion execution engine has a registry of known module types. For example, a collection of Standard Module Types may be coded into the engine in Java. The executing engine will call the appropriate implementations at each step. SPARQLMotion module libraries may also define sub-classes of existing module types. Unless a more specific implementation exists, the SPARQLMotion engine should in this case execute the implementation for the superclass. This makes it possible to specialize existing module types without having to add a low-level implementation to the engine. The subclasses may set default values of properties expected by the superclass, e.g. using SPIN constructors. Some modules, including sml:PostRequest will even walk through the properties defined by their subclasses to build a list of request arguments.

In addition to SPARQLMotion module types (sm:Module), scripts can also instantiate SPIN Functions as modules. SPIN functions are another kind of classes, and function calls (in SPIN) are instances of those classes. The arguments of the functions (sp:arg1, sp:arg2, etc) can be passed into the function using the same mechanisms as with other module types.

2.2 Module Relationships

SPARQLMotion modules can be chained together in various ways, to instruct the engine that the output of one module is the input to another module. The core vocabulary defines a collection of RDF properties that are used to link modules (instances) with each other.

Note: SPARQLMotion scripts form a directed acyclic graph, i.e. may not contain cycles.

2.2.1 sm:next

The most frequently used relationship property is sm:next, which indicates that the first module (subject) is producing input for the second module (object).

The presence of an sm:next triple does not necessarily mean that the first module is executed before the second module. The technical execution order is left to the engine and the module implementations.

2.2.2 sm:child, sm:body, sm:if and sm:else

Some SPARQLMotion modules can spawn off sub-scripts. For example, an iteration module such as sml:IterateOverSelect will repeat a "body" script in each iteration, before it continues the execution of its successors via sm:next. In the case of iterations, the property sm:body should be used. In the case of IF-THEN-ELSE branches, the properties sm:if and sm:else should be used. However, SPARQLMotion does not prescribe specific meaning to any of those properties, apart from the fact that they are sub-properties of sm:child, which helps the engine identify that they describe a parent-child relationship between scripts.

Any of the sm:child properties link a module with a child script by pointing to any module of the child script. This means that it is possible to point to a the head or tail or anything in between - for display purposes it is common practice to point to the "start" of the child script. In either case, the child script must be self-contained (i.e. have no backward references into the parent script), and must have a single target module, i.e. exactly one module that does not have any sm:next value.

2.3 Scripts

A SPARQLMotion script is a collection of modules that are connected using any of the module relationships mentioned above. Scripts are usually stored in a single RDF file or graph, but multiple scripts may be stored in the same graph.

Given the acyclic nature of SPARQLMotion scripts, any well-formed script will have at least one module without any successors (sm:next). Those modules are called target modules. A script may have multiple target modules, and users can invoke either one of those target modules separately. In those cases, the execution engine only needs to traverse a sub-set of the modules to create the results, as those branches leading to other target modules can be ignored.

One of the features of SPARQLMotion is that scripts can displayed and edited visually. The suggested rendering of scripts is using directed graphs, so that modules are represented as nodes, and relationships as edges, as shown in the example above. Each node should uniquely identify the module (e.g. with a label) and also indicate the type of module (e.g. with an icon). Furthermore, input and output variables should be displayed so that users can recognize the data flow between modules. The edges should be labeled to distinguish the various kinds of relationships. In this kind of graphical visualization, the properties sm:nodeX and sm:nodeY should be used to store the coordinates of nodes. The property sm:icon should be used to link a module type (class) with the URL of a display icon.

 

3 Data Flow

SPARQLMotion scripts define a processing pipeline in which data is being produced, processed and consumed. Individual modules are free to do whatever they like in each step: They can produce side effects (such as writing files), modify RDF triples in a graph, change variable bindings or invoke sub-scripts. Changes to the RDF graphs and variable bindings are relevant to the engine, and are covered by the following sub-sections.

3.1 RDF Graphs

Most SPARQLMotion modules operate on RDF graphs. They can query RDF graphs and may write to them. These RDF graphs might be derived from files, point to a database, or be entirely virtual, in the sense that they only exist during the execution of a script. In terms of an implementation, it only matters that those graphs implement the usual triple-level graph functions, e.g. as defined by the Jena Graph interface.

Each SPARQLMotion module also represents an RDF graph. For example, the module sml:ImportRDFFromURL represents the graph loaded from a given URL. When invoked, the module may load the file from the web and then passes this graph to the modules specified by sm:next. These next modules may take the loaded graph as input and run SPARQL queries over them. In those SPARQL queries, the input graph is the default named graph, i.e. will be queried in the WHERE clause if no other graph has been specified (e.g. using FROM or SERVICE keywords).

Many SPARQLMotion modules do not manipulate their input graph, and simply pass it on to their successors unmodified. Other modules may completely replace the input graph with some other graph to downstream modules. Some modules may not even produce any graph and simply represent the empty graph.

If a module has multiple incoming sm:next triples, then the input graphs will be merged (logically), forming a union graph. Engines may optimize this step by merging multiple in-memory graphs into a single graph, or pruning empty sub-graphs.

3.2 Variables

In addition to RDF graphs, which are implicitly passed from module to module, modules can also communicate by passing variable bindings. A variable binding is a name/value pair, in which the variable name follows the usual SPARQL variable naming rules. The values of those variable bindings can be anything, but the officially supported default types are:

Many other data types such as file names can often be represented by means of RDF literals or URI references. According to the contract, any SPARQLMotion engine must be able to convert variable values to RDF nodes, so that they can be part of SPARQL queries. For example, in the case of XML nodes, a suitable string rendering must be derived by serializing the nodes.

SPARQLMotion modules can create variable bindings and thus pass new values to their successors. In a typical case such as sml:BindWithConstant, modules only create a single new variable binding as "result" of its execution. This result variable is typically represented using the property sm:outputVariable. However, modules do not have to declare the variables that they bind. For example, sml:BindBySelect may bind any number of variables, as specified by the variables appearing in its SELECT clause.

When a SPARQLMotion module executes a SPARQL query, then the current variable bindings from all its predecessors will be pre-bound as query variables. In the example figure above, the module Set initial text binds the variable text with some value, and this value could be queried as ?text in the WHERE clause of the query in Iterate over persons.

 

4 Module Properties

SPARQLMotion modules are instances of module classes. These classes should formally define the properties that script designers should use to configure the behavior of the module. Most modules have at least one property, e.g. sml:ImportRDFFromURL has a property sml:url containing the URL to load from. Those property values can be specified as triples in the module instance, as illustrated by following example module (in Turtle notation):

:ImportKennedys
      a       sml:ImportRDFFromURL ;
      rdfs:label "Import kennedys"^^xsd:string ;
      sm:next :IterateOverPersons ;
      sm:nodeX 5 ;
      sm:nodeY 2 ;
      sml:url "http://topbraid.org/examples/kennedys"^^xsd:string .

Some modules may not have any properties, and simply operate on the RDF input, with a pre-defined (fixed) behavior.

Since the access to the property values is done by the implementation, modules can query any property of the module that they like. However, it is strongly recommended that module types explicitly declare the properties that they expect. The property spin:constraint is used to link a module class with a property. The values of spin:constraint are typically the SPIN templates spl:Attribute or spl:Argument, both of which are described in the following sub-sections. Note that spl:Argument properties carry special semantics that are hard-coded in the SPARQLMotion engine.

4.1 Attributes

The SPIN template spl:Attribute is used to declare properties that are filled in by the script designer. Attributes are typically used for SPARQL queries, child relationships and any multi-valued property. For example, sml:IterateOverSelect defines two attributes as shown in the following Turtle snippet:

sml:IterateOverSelect
      a       sm:Module ;
      rdfs:comment "..."^^xsd:string ;
      rdfs:label "Iterate over select"^^xsd:string ;
      rdfs:subClassOf sml:ControlFlowModules ;
      spin:constraint
              [ a       spl:Attribute ;
                rdfs:comment "The body of the iteration loop."^^xsd:string ;
                spl:maxCount 1 ;
                spl:minCount 1 ;
                spl:predicate sm:body
              ] ;
      spin:constraint
              [ a       spl:Attribute ;
                rdfs:comment "A SPARQL Select query that ...."^^xsd:string ;
                spl:maxCount 1 ;
                spl:minCount 1 ;
                spl:predicate sml:selectQuery
              ] ;
      ...

For readers familiar with OWL, this is comparable to OWL Restrictions, using rdfs:subClassOf instead of spin:constraint, spl:predicate instead of owl:onProperty and spl:maxCount instead of owl:maxCardinality. In contrast to OWL though, the spl:Attribute template carries strict closed-world semantics, suitable for specifications.

The value type of those attributes can be either specified locally, using spl:valueType, or using global rdfs:range statements (used but not shown above).

4.2 Arguments

Most module properties in SPARQLMotion are declared using the SPIN Template spl:Argument. Arguments can take at most one value, as indicated by the boolean field spl:optional. The following Turtle snippet shows the declaration of the module sml:ImportRDFFromWorkspace.

sml:ImportRDFFromURL
      a       sm:Module ;
      rdfs:comment "Gets RDF data from a given URL. The URL..." ;
      rdfs:label "Import RDF from URL"^^xsd:string ;
      rdfs:subClassOf sml:ImportFromRemoteModules ;
      spin:constraint
              [ a       spl:Argument ;
                rdfs:comment "The URL of the RDF source..."^^xsd:string ;
                spl:predicate sml:url
              ] .

Like with attributes, the above is similar to OWL restrictions. Unlike attributes, module instances do not need to specify the actual value of the property as an explicit triple at the instance. Instead they can be computed dynamically at execution time, as explained in the following sub-sections.

4.2.1 Blank Arguments

If a module instance has no declared value for a property, then the execution engine will check if there is a bound variable with the same name as the local name of the property in the current scope. For example, if a module expects a value for sml:url and a predecessor of the module has created a binding for ?url, then this binding will be inserted into the module at run time. Using this mechanism, it is possible to link modules easily and conveniently.

Most modules follow some naming conventions on the declared properties to make it likely that the output variable of one module matches the expected input properties of another module. For example, the property sml:text is typically used as a property on modules that process text. Modules that produce text, such as sml:ImportText have ?text as their default value for sm:outputVariable.

Although convenient, the disadvantage of this approach is that the link between modules and their variable bindings is neither very transparent nor flexible. In particular, in many use cases the names of output and input variables do not match, so that the following alternatives are often a better choice.

4.2.2 String Template Arguments

Many modules operate on string arguments. For example, modules of type sml:ImportRDFFromURL use the property sml:url to retrieve the URL, stored as xsd:string. In SPARQLMotion, those strings may be String Templates, with inline variable names. For example, the value of sml:url could be http://example.org/{?fileName}.rdf where ?fileName is the name of a bound variable. The SPARQLMotion engine will interpret those string arguments and apply string substitutions based on the current bindings. For example, if the variable ?fileName has the value "test", then the URL above becomes http://example.org/test.rdf. Names of unbound variables will be substituted by empty strings.

It is up to the SPARQLMotion module implementation to decide how to substitute string templates. In particular, many modules do not insert the variable bindings verbatim but instead escape URL characters etc.

4.2.3 SPARQL Expression Arguments

Instead of a direct property value as an RDF resource, literal or string template, SPARQLMotion allows script designers to assign the property through SPARQL expressions. Those SPARQL expressions must be stored as blank nodes using the SPIN RDF Syntax. For example, the following Turtle snippet shows the use of a SPARQL function call to dynamically compute the value of sml:mimeType:

:ReturnTheText
      a       sml:ReturnText ;
      rdfs:label "Return the text"^^xsd:string ;
      sml:mimeType
              [ a       smf:if ;
                sp:arg1 [ sp:varName "xml"^^xsd:string
                        ] ;
                sp:arg2 "xml" ;
                sp:arg3 "text/html"
              ] .

In a more readable form, the module above can be rendered as a form:

At execution time, the engine will invoke the example SPARQL function smf:if and use its result as value for sml:mimeType.

Any SPARQL expression, including variables, function calls and built-in mathematical operations can be used in those expressions. User interfaces should render those SPARQL expressions between { and } so that they can be readily distinguished from constant values.

4.2.4 SPARQL Query Arguments

Taking the idea of SPARQL Expression Arguments further, SPARQLMotion also allows users to insert arbitrary SPARQL SELECT queries into arguments. Again the SPIN RDF Syntax is used to represent those queries, although it is also possible to use SPIN Templates. If the value of an argument property is a SPARQL query, then the engine will evaluate this query when the module executes, and use the first binding of the first result result variable in the SELECT clause as value of the argument.

The following example module shows an equivalent SPARQL query to the example from above (for brevity as a form):

In this example, the first binding for ?result will be used as sml:mimeType when the module is executed. Any other results or variables produced by the query will be ignored.

 

5 User-Defined SPARQLMotion Functions

SPARQLMotion scripts are typically driven by SPARQL queries. This enables script designers to exploit the full range of SPARQL features and extensions at execution time. In particular it is possible to call user-defined functions, including SPIN Functions. Such SPIN functions are typically based on a spin:body - a nested SPARQL query that is executed whenever the function is invoked. However, SPARQLMotion also provides a mechanism for defining new SPARQL/SPIN functions that are backed by a complete SPARQLMotion script instead.

User-Defined SPARQLMotion Functions are declared like other SPIN Functions. The difference is that instead of a spin:body, the function must point to the target module of a SPARQLMotion script, using sm:returnModule. When the function is invoked, the return module will be launched as a SPARQLMotion script, and the result of the target module will be used as function call result value. How this result value is being computed is not defined by the SPARQLMotion Core specification and not all kinds of return modules are permitted. A popular choice is to use sml:ReturnNode as an end point.

Since SPIN functions are instances of the function class, and SPIN functions can also be used as SPARQLMotion modules, it is also possible to directly insert user-defined SPARQLMotion functions into scripts.

 

Appendix: Reference

The URL of the SPARQLMotion Core Schema is http://topbraid.org/sparqlmotion