5/31/11

Deductive Databases

By staying with the simple data types, but adding recursion to this database language, one gets a language called (positive?) Datalog, which is the language underlying deductive databases. Deductive databases are an extension of relational databases which support more complex data modeling. In this section we will see how simple examples of deductive databases can be represented in Prolog, and we will see more of the limitations of Prolog.

A standard example in Prolog is a geneology database. An extensional relation stores the parent relation: parent(X,Y) succeeds with X and Y if X has parent Y. (Maybe do an example consisting of some English monarchs?) Given this parent relation, we can define the ancestor relation as follows:

    ancestor(X,Y) :- parent(X,Y).     ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y). 
This says that X has ancestor Y if X has parent Y; and X has ancestor Y if there is a Z such that X has parent Z and Z has ancestor Y. Given a definition for the parent relation as follows:
parent(elizabeth_II, charles_??). etc. 
we can query about ancestors, such as:
:- ancestor(elizabeth_II,X). 
and find all of Queen Elizabeth's ancestors.

This works very nicely in Prolog, since the parent graph is (essentially) a tree. However, if we try the same definition of transitive closure for a graph that is not acyclic like a tree, we can be in trouble. Say we have a relation owes(X,Y) which indicates that X owes money to Y, and we want to define a predicate avoids(X,Y), meaning that X tries to avoid running into Y. The definition is that people avoid someone to whom they owe money, and they avoid anyone that someone to whom they owe money avoids:

avoids(X,Y) :- owes(X,Y). avoids(X,Y) :- owes(X,Z), avoids(Z,Y). 
This definition has the same form as the ancestor definition. The problem here is that the owes relation may be cyclic. It is possible for Andy to owe money to Bill, Bill to own money to Carl and Carl to owe money to Bill:
owes(andy,bill). owes(bill,carl). owes(carl,bill). 
and if we ask who Andy avoids:
| ?- avoids(andy,X). 
we get:
| ?- avoids(andy,X).  X = bill;  X = carl;  X = bill;  X = carl;  X = bill;  X = carl;  X = bill; .... 
an infinite loop.

If we would like to use Prolog as an engine for deductive databases, this shows up a serious problem: that a user can write a simple specification (using only atoms and variablse) and yet Prolog won't give an answer, but will wander off to (or toward) infinity. One couldn't afford to give such a system to a naive database user. There it is important that any query come back with some answer. Think of an SQL system that sometimes went into an infinite loop. It wouldn't be much used, and certainly not by naive users.

This problem of infinite looping is a well-known problem in Prolog and Prolog programmers learn to program around it. The usual fix is to add an extra argument to the avoids/2 predicate that keeps the list of people encountered in the process of finding the avoidees. Then if one of them is encountered again, the search is made to fail at that point, since it is known that all avoidees from that one have already been found. I.e.,

avoids(X,Y,L) :- owes(X,Y), \+ member(Y,L). avoids(X,Y,L) :- owes(X,Z), \+ member(Z,L), avoids(Z,Y,[Z|L]). 
Here we've used the Prolog predicate member/2, and the Prolog builtin, \+, which implements not. \+ member(Y,L) succeeds just in case member(Y,L) fails, and fails if it succeeds. Now with this program and the corresponding query, we get:
| ?- avoids(andy,X,[]).  X = bill;  X = carl;  no | ?- 
This fix works to avoid the infinite loop, but it sometimes has some undesirable properties. There are graphs for which the computation will be exponential in the number of arcs. Consider the case in which Andy owes money to Bill and Bob, and Bill and Bob owe money to Carl; Carl owes money to Dan and Dave, and Dan and Dave owe money to Evan; Evan owes money to Fred and Frank, and Fred and Frank owe money to George; and so on. The graph of the owes relation can be pictured as in Figure 2.2.
Figure 2.2: Graph on which Prolog is exponential
\begin{figure}\centerline{\epsfbox{figures/expgraph.eps}} \end{figure}

5 graph database examples

This post is part of our ReadWriteCloud channel, which is dedicated to covering virtualization and cloud computing. The channel is sponsored by Intel and VMware. Read the white paper about how Intel Xeon processors help organizations get unprecedented levels of performance.

Matrix graph thumbnail 150x150 Of the major categories of NoSQL databases - document-oriented databases, key-value stores and graph databases - we've given the least attention to graph databases on this blog. That's a shame, because as many have pointed out it may become the most significant category.

Graph databases apply graph theory to the storage of information about the relationships between entries. The relationships between people in social networks is the most obvious example. The relationships between items and attributes in recommendation engines is another. Yes, it has been noted by many that it's ironic that relational databases aren't good for storing relationship data. Adam Wiggins from Heroku has a lucid explanation of why that is here. Short version: among other things, relationship queries in RDBSes can be complex, slow and unpredictable. Since graph databases are designed for this sort of thing, the queries are more reliable.

Google has its own graph computing system called Pregel (you can find the paper on the subject here), but there are several commercial and open source graph databases available. Let's look at a few.

Neo4j

This is one of the most popular databases in the category, and one of the only open source options. It's the product of the company Neo Technologies, which recently moved the community edition of Neo4j from the AGPL license to the GPL license (see our coverage here). However, its enterprise edition is still proprietary AGPL. Neo4j is ACID compliant. It's Java based but has bindings for other languages, including Ruby and Python.

Neo Technologies cites several customers, though none of them are household names.

Here's a fun illustration of how relationship data in graph databases works, from an InfoQ article by Neo Technologies COO Peter Neubauer:

Neo4j

FlockDB

FlockDB was created by Twitter for relationship related analytics. Twitter's Kevin Weil talked about the creation of the database, along with Twitter's use of other NoSQL databses, at Strange Loop last year. You can find our coverage here.

There is no stable release of FlockDB, and there's some controversy as to whether it can be truly referred to as a graph database. In a DevWebPro article Michael Marr wrote:

The biggest difference between FlockDB and other graph databases like Neo4j and OrientDB is graph traversal. Twitter's model has no need for traversing the social graph. Instead, Twitter is only concerned about the direct edges (relationships) on a given node (account). For example, Twitter doesn't want to know who follows a person you follow. Instead, it is only interested in the people you follow. By trimming off graph traversal functions, FlockDB is able to allocate resources elsewhere.
This lead MyNoSQL blogger Alex Popescu to write: "Without traversals it is only a persisted graph. But not a graph database."

However, because it's in use at one of the largest sites in the world, and because it may be simpler than other graph DBs, it's worth a look.

AllegroGraph

AllegroGraph is a graph database built around the W3C spec for the Resource Description Framework. It's designed for handling Linked Data and the Semantic Web, subjects we've written about often. It supports SPARQL, RDFS++, and Prolog.

AllegroGraph is a proprietary product of Franz Inc., which markets a number of Semantic Web products - including its flagship set of LISP-based development tools. The company claims Pfizer, Ford, Kodak, NASA and the Department of Defense among its AllegroGraph customers.

GraphDB

GraphDB is graph database built in .NET by the German company sones. sones was founded in 2007 and received a new round of funding earlier this year, said to be a "couple million" Euros. The community edition is available under an APL 2 license, while the enterprise edition is commercial and proprietary. It's available as a cloud-service through Amazon S3 or Microsoft Azure.

InfiniteGraph

InfiniteGraph is a proprietary graph database from Objectivity, the company behind the object database of the same name. Its goal is to create a graph database with "virtually unlimited scalability."

According to Gavin Clarke at The Register: "InfiniteGraph map is already being used by the CIA and Department of Defense running on top of the existing Objectivity/DB database and analysis engine."

Others

There are many more graph databases, including OrientDB, InfoGrid and HypergraphDB. Ravel is working on an open source implementation of Pregel. Microsoft is getting into the game with the Microsoft Reasearch project Trinity.