wiki:Database Schema

Some thoughts on the database schema

The user-modifyable data is stored in a graph database, containing "nodes" and "relationships". Each node and relationship has properties. Relationships have types.

In the java code to deal with the graph, objects that represent nodes are called "Vertex", and objects that represent relationships are called "Edge".

This database model, at the moment, only uses string properties in neo4j.

Structure nodes

The root node of the neo4j graph is connected to three structure nodes using "has-structure" relationships. The three structure nodes are:

  • a node with property "_" : "concepts". All concepts will be linked to this special node using the "is-a" relationship.
  • a node with property "_" : "tags". All concepts that can be used as tags will be linked to this special node using the "is-a" relationship.
  • a node with property "_" : "predicates". All concepts that can be used as predicate in a triple will be linked to this special node using the "is-a" relationship.

Concepts

The concepts in the conceptwiki are stored as "contentless" nodes in a graph database, identified by a UUID (stored as node property "uuid"). A concept has relationships to "content" nodes in the same graph database. The "authorized-by" property can be present to identify that the concept represents a person that corresponds to a valid user of the system.

In the future, if it appears useful, skos-like properties may be added to concept nodes. This will have to be done preferably in such a way that it does not require "modifying" the concept node.

All concepts are connected to the "concepts" structure node using the "is-a" relationship. This relationship does not have properties. If a concept is also usable as a tag, it is connected using the "is-a" relationship to the "tags" structure node. If a concept is also usable as a predicate, it is connected using the "is-a" relationship to the "predicates" structure node.

Relationships between concepts

Triple relationships between concepts are stored as "subject-object" relationships that point from the subject concept to the object concept. The "has-object" relationship has the following properties:

  • "branch". The branch identifier.
  • "predicate-uuid". The "uuid" of a third concept that functions as predicate.

If the relationship is to be used as a concept, it needs to get a "triple-uuid" property. A concept node with that uuid must then exist too.

Concepts that are actually triples

A concept that is actually a triple is a contentless node with a uuid property like any other concept. In the graph, this concept has exactly three relationships with other concept nodes via "has-subject", "has-predicate" and "has-object" relations (one each). Those relations do not have any properties. The concepts linked via "has-subject" and via "has-object" are connected using a "subject-object" relationship, which has the "triple-uuid" property set to the uuid of the triple concept. A triple-concept can be linked to other content nodes and to other concepts as any other normal concept.

Content

Different content nodes are linked to concept nodes in the graph database: they are notation, label and definition. This may be extended later if desired. The concept links to the content-node with a relationship that identifies the role of the content node: the relationships are 'has-notation', 'has-label' and 'has-definition'. Relationships are repeated for each branch that supports it (see below to learn more about branches), and the branch is identified as the property "branch" on the relationship.

Content nodes are designed like mail messages (see RFC5322) or HTTP responses. Header fields are stored as properties of the content node. The property names are stored as lower-case ascii strings. The actual content is stored as the special property "content". Content nodes are singletons: no two different content nodes with the same sets of properties should exist.

Supported properties are:

  • content-type. This is the mime-type of the content. If not specified in a node, the default is "text/plain". Content types will be used to look up handlers that are plugins in the framework. If a content type can not be handled, a user interface that should represent the content will render an error message showing the content-type instead.
  • content-transfer-encoding. If not specified in a node, the default is utf-8 if the content-type is "text/plain", and "8bit" for other content-types.
  • content-language. The language(s) of the content, as specified in rfc3282. This should be empty for language-independent content (such as an image without embedded text).
  • content. This is the actual content. It is not allowed to be empty, except when "content-type" is "multipart/*" when it must be empty.

Content nodes that have a content-type of "multipart/*" have "has-part" relationships with other content nodes that provide the actual parts.

Content nodes are copy-on-write. If the content, or any of the other properties, of a content node that has more than one incoming relation must be modified, a new content node must be created with the new content (or an existing node with the new content must be found to prevent duplication).

All unrecognized node properties are ignored by design to allow extension of the framework.

If functionality will be extended in the future, it will be based on existing rfc822-like headers and known content types where possible.

Notations

Notations are content nodes that are associated with language-free labels, such as database-identifiers, that (hopefully uniquely) define a concept. Notation nodes should not have a 'content-language' property.

Notation nodes have the following additional property:

  • uri. This specifies the "official" uri for the notation, if such an uri exists.

The "has-notation" relationship does not have any additional properties.

Notation nodes can be further linked to the concept that represents the source of the data. This is done using a "has-source" relationship (which does not have any properties(?), not even a 'branch'.

Labels

Labels are content nodes that are associated with human language terms that can be used to refer to the concept.

Label nodes have no additional properties. The "has-label" relationship has the following additional property:

  • label-state: One of "hidden", "preferred", "alternate", "hyponym" or "hypernym". If not specified, "alternate" is assumed.

Hyponyms and hypernyms should only be included if people actually mistake them for synonyms. "Hidden" terms are ?TODO?.

TODO: Do we need literal/normalizable distinction here?

Definitions

Definitions are content nodes that define the concept in a way that is understandable for humans. This can be in the form of text, but also in any other multimedia way (audio, image, video).

Definition nodes have no additional properties. The "has-definition" relationship does not have additional properties.

History

The history of the data is preserved, but this happens outside of the graph, probably in a relational database table. The graph database always represents the latest state. Older states can be reconstructed by following the history table back in time. For each modification, the following information is stored:

  • The UUID of the concept to which it relates.
  • A time stamp
  • Transaction ID
  • User ID
  • action type "add node", "add relationship", "remove node" or "remove relationship"
  • If a node was removed, a unique identifier of the old node
  • If a relationship was removed, a unique identifier of the old relationship
  • If a content node or relationship was added, a serialization of the new is stored in an object table, and a unique identifier is added here.

A modification of a node or relationship is stored as a pair formed by a remove and an add operation.

Changes to relationships between concepts are stored in the history table twice. The history table and object table together make it possible to reconstruct the graph from scratch (with some help of the user table for the authorized pages).

Branches

Each relationship between concepts, and each relationship between a concept and a content node is labeled with a "branch". Branches represent the authorities from which the information is coming: each imported database is associated with a branch. There are two special branches: "community", which represents the users of the graph that have made changes, and "user" which represents a single user of the system (which user becomes clear when the history is examined). A subset of branches can be selected someone interacting with the graph, so that different views of the data can be obtained.

A branch identifier is an "ascii" string.

A missing branch identifier on a relationship identifies it as the "community" branch.

A list of branches, with some essential properties(?) and multilingual representation of labels is stored in an external data base. This information is not user-modifyable.

Languages

A table of all supported languages, their identifiers, and their names in different languages, is stored in a database outside of the graph. The information is not user-modifyable.

Users

User handling is SAML compatible, maybe using OpenAM. Information is stored external to the graph.

Information about the users that made changes to the graph is exclusively stored in the history table.

"authorization" of a page (identification of "equals" relationship between a user and a concept representing a person) is handled by storing the uuid of the person-concept in the user table, and adding the "authorized-by" property to the concept node with the user-id as its value.

Last modified 8 years ago Last modified on Jul 13, 2011, 1:47:42 PM