Why native is the best graph database bet.
But not all graph databases are created the same. Developers are going to have to be clear on not just why graph databases, but which graph database. What makes a good technology choice in the context of enterprise graph? Primary is the native or non-native design of the database management system. That is whether a system is built ground-up for graphs, or whether graphs are added as an extension to some other model.
In this article, we'll explore some fundamental computer science and system design themes pertaining to database management system (DBMS) design. We'll see how each DBMS is designed to support its primary or native model and consider the pitfalls when we try to stretch that native model to other non-native models.
Many of us have studied Computer Science or similar degrees and been exposed to algorithms and data structures. When evaluating an algorithm or data structure we consider its complexity in processing time or memory space requirements. You may recall the 'Big O' notation used as shorthand for expressing complexity.
* O(l) or constant if accessing an element in a structure has constant cost even as the number of elements grow.
* O(log n) if accessing an element in a structure takes one more operation when the number of elements doubles.
* O(n) when the number of operations required to access an element grows proportionally with the number of elements.
* O(m log n) an operation costs an additional multiple on top O(log n).
* O(nm) when the number of operations required to access an element grows exponentially (or other polynomial growth) with the number of elements.
In general, we prefer cheaper characteristics unless the number of elements is small and bounded, in which case a modern computer can execute them rapidly.
Using theory as a guide, we can see how this plays out in practice, in the context of building a database management system. Let's begin with a humble unordered singly-linked list.
The linked-list is very cheap for uncontended inserts at cost O(l). But for reads it's O(n) meaning the performance degrades as the size of the list grows.
Does this mean we shouldn't use linked lists? Not at all. But we must be aware that they're appropriate for uncontended write scenarios, and less good for contended writes and less good for reads.
Tree structures can help redress the imbalance between reading and writing that characterises our list. Both reads and writes to a tree are typically 0(log n). Write contention is also reduced because there are more attachment points to a tree (even a simple binary tree) than a list.
B-tree variants have long been popular with database engineers, with the non-leaf nodes providing an index over the actual records contained in the leaf nodes. Cleverly this allows the more numerous index records to be held in fast memory for speedy processing while relegating far less numerous data records to disk.
Tree structures are fine for one-off lookups or writes at cost O(log n). Even for larger data sets of billions and beyond the number of operations required does is manageable for single lookups.
However, if we think about trees in the context of a graph database where many traversals are required to make sense of connected data we'd ideally prefer lower cost complexity access. The difference between O(n), which is the cost for n hops at O(l) natively, and 0(m log n) in terms of IO operations to disk rapidly adds up and degrades query performance.
Of course, in any DBMS there will be a mixture of structures in its design to suit its native workload. It is through careful choice of such data structures in the context of the native data model and workload that a coherent DBMS design emerges. Each DBMS duly has its strengths: the model for which it is native.
The native graph database advantage
From query language through to the database management engine and file system, and from clustering to backup and monitoring, a native graph database is designed from the ground-up to serve graph workloads safely and efficiently.
Native graph technology tends to perform queries faster, scale better (retaining their hallmark query speed even as the dataset grows), and run far more efficiently upon less hardware. The affordances for graph queries are better and the operational surface is highly affined to graph workloads.
A native query language (e.g. openCypher) is a key element of getting value from your graph data. A native query planner atop a native store can perform optimisations that aren't available to non-native stores. We have seen results for simple graph analytics where the native solution is hundreds of thousands of times faster than a non-native store on their own benchmarks.
Underpinning the stack, native graph storage is designed to use the file system in a way that is expressly sympathetic towards graphs. Traversing across a relationship costs O(l) irrespective of the size of the graph because of the way the graph is laid out on disk and in-memory (using very few indexes) therefore traversing multiple hops carries a predictable cost of O(n). Query latency is proportional to how much of the graph we choose to search, not to the overall size of the data--a big win.
Mechanically, the cost of each traversal is tiny because of mechanical sympathy between the software and hardware. Native graph databases optimise for the fastest thing a computer can do: fetch and dereference pointers to navigate the graph efficiently and at extremely high speeds. By contrast, to reify columnar, relational, document, or keyvalue data as a graph database, the database management system must perform costly translations to and from the primary internal model of the database.
Native graph databases include transactional mechanisms to ensure that data is stored safely and is impervious to network blips, server failures, and contention from concurrent transactions. Consensus protocols like Raft and patterns like transaction log shipping allow the clusters of native graph databases to offer strong consistency guarantees (e.g. Neo4j provides causal consistency).
Furthermore, a native storage stack allows adaption of the implementation to the evolving hardware architectures of tomorrow. As memory and disk technology evolves, native graph implementation evolves to take advantage of those to support graph workloads. In the future, for example, we fully expect to adapt Neo4j's native storage model to emergent novel disk storage platforms and memory architectures such as non-volatile RAM.
But let's see what that means when we adapt other DBMS designs to run graph workloads.
Graph queries on non-native graph gatabases
Our brief reflection of classic data structures is a highly simplified but useful reminder that we choose data structures for a purpose and there is no single data structure that is universally applicable. When designing a DBMS we select the data structures to support the data model and query workload. We may choose quite different structures for column stores, document stores and relational databases. Then we optimise the system design to be as efficient as possible for our chosen data model. The resulting software is native: for columns, for documents, or for tables. But it is not native for graphs.
To support graph workloads on these non-native graph databases, we must extend them in some way, piggybacking graph features on the native model. Non-native graph databases typically take one of two approaches: adding a graph API on top of an existing database management system, or adding graph features into a non-graph query language.
We should not dismiss these as mere implementation details. The implementation of a DBMS empowers it for some workloads (those for which it is native) and disempowers it for other workloads (those for which it is not native). We can see this clearly when extending two popular non-graph models as graph databases.
Graph API atop a column store
Popular column stores provide a nested hashmap data model natively. Usually based on a hash ring structure, they provide a simple way of spreading data around a cluster, and providing opportunities for fault tolerance.
When converting a column store to act as a graph database we deploy a graph. This library provides a graph API and query language and is then mapped onto the underlying database through some glue code.
Generally a hashmap has excellent O(l) semantics for reads and writes. However, a good hash function also tries to spread similar data widely to avoid clashes. Unfortunately, this reduces locality.
In a graph locality is key: we traverse along relationships to neighbouring nodes and it is highly beneficial if those nodes have spatial locality for efficient memory access.
Each O(l) operation lookup in a hash ring column store becomes an over-the-network operation and thus mechanically expensive in practice. This means graph traversal performance is often poor and requires large hardware investment for even modest traversal performance. Even where clever mappings between the API and underlying database radically denormalises data to retain locality, we still hit a cliff when we reach the limits of that denormalisation.
We at Neo4j have seen this first hand: our own early implementations of a graph database (back in 2000-2003) were non-native, with a graph API atop a relational database. When the queries exceeded around depth three, they degraded substantially in performance and spawned our efforts to build a truly graph-native engine.
Finally, since the underlying fault tolerance infrastructure is designed to store and retrieve single data values rather than graphs, it is easy to corrupt data in such systems under normal operation. The typical eventual consistency model simply isn't strong enough for graphs.
Document store with graph query operator
Popular document stores use B-Trees to index documents in the DBMS rather like an advanced filing cabinet. As a tree-based system, looking up a document costs O(log n). This is probably fine for one-off lookups or writes, but it becomes O(m log n) when using indexes to imitate traversing a graph of m hops. Traversal performance suffers greatly at such cost as early as depths two or three.
Some document stores offer a graph lookup operator to try to help users link data in some basic way. When examined, we discover that the implementation creates the equivalent of an RDBMS join table to link graph elements. Clearly, the affordance is limited in functionality. It does not provide the kind of rich vocabulary necessary to extract true value from graphs, simply syntactic sugar over a user-level convention for linked documents.
Perhaps worst of all when trying to subvert a document DBMS for other purposes, the underlying DBMS engine simply doesn't know about relationships (never mind graphs). Since it's native for documents, a user can corrupt the graph (e.g. leave dangling links) and the engine is unable to intervene to maintain, or even check for, consistency. Over time the graph becomes islands of orphans and dangling links that decimate its value.
'Good enough' is not good enough
It's often convenient to think that non-native graph technology may be 'good enough', particularly if you already have the non-native technology installed for its native use-case.
But our data grows over time. Our datasets are more variably structured, interconnected and interrelated than ever before. If you want to make sense of that data, you should integrate native graph technology into your systems.
We also know that the value is in the connections, and a non-native approach hamstrings that value. A native graph database will serve you better over the long-term and won't require extraordinary hardware investments.
Knowing which graph architecture is appropriate is not always clear, but where business needs to connect and make connections work native graph nine times out of ten, due to its integrity, performance, efficiency and scaling benefits, emerges as really the only option.
Dr. Jim Webber is Chief Scientist of Neo4j, the world's leading graph database (http://neo4j.com/). Jim is co-author of O'Reilly's Graph Databases (http://graphdatabases.com/)
Neo4j's Chief Scientist, Dr. Jim Webber, says that for developers to get the most out of the connections in their data it's native, with the right integrity, performance, efficiency and scaling that will deliver graph project success.
(l.) Using conflict-free replicated data types (CRDTs) can help alleviate write contention as can lock-free programming strategies, so it's not all bad news. But it does show that when theory meets practice, the big O notation provides design guidance not an ironclad guarantee.
Caption: An unordered singly-linked list.
Caption: A binary tree
|Printer friendly Cite/link Email Feedback|
|Date:||Jul 1, 2017|
|Previous Article:||Broadband Forum makes rapid progress.|
|Next Article:||Could WannaCry, AdylKuzz and UIWIX have been prevented?|