Janusgraph
Last updated
Was this helpful?
Last updated
Was this helpful?
JanusGraph is a scalable optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. JanusGraph is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.
Official website
Docs
Schema
Config reference
Eventual Consistency & ConsistencyModifier.LOCK
Storage Backends BigTable, Hbase, Cassandra, ScyllaDB, BerkeleyDB, DynamoDB, InMemory
Index Backends Elastic, Solr, Lucene
Transactions
Bulk Loading
Optimizing Reads & Writes
TinkerPop3 docs
SQL2Gremlin
Practical Gremlin Tutorial and Book
Gremlin Recipes
JanusGraph Tutorial
Janusgraph Utils
Testing
Write Performance
Loading data from file
docs improvement
docs improvement
JanusGraph can work with embedded gremlin server and it is able to connect to remote standalone gremlin servers
Possible to use JanusGraph Embedded Gremlin with multiple graphs with multiple config files
It is strongly encouraged to explicitly define all schema elements and to disable automatic schema creation by setting schema.default=none
in the JanusGraph graph configuration.
query.force-index=false/true. Whether JanusGraph should throw an exception if a graph query cannot be answered using an index. Doing solimits the functionality of JanusGraph’s graph queries but ensures that slow graph queries are avoided on large graphs. Recommended for production use of JanusGraph.
Enabling the storage.batch-loading
configuration option will have the biggest positive impact on bulk loading times for most applications. Enabling batch loading disables Janus internal consistency checks in a number of places. Most importantly, it disables locking. In other words, Titan assumes that the data to be loaded into Titan is consistent with the graph and hence disables its own checks in the interest of performance. Important: Enabling storage.batch-loading
requires the user to ensure that the loaded data is internally consistent and consistent with any data already in the graph. In particular, concurrent type creation can lead to severe data integrity issues when batch loading is enabled. Hence, we strongly encourage disabling automatic type creation by setting schema.default = none
in the graph configuration. Batch loading disabling locks.
PermamentLockExceptions will appear only on properties marked as ConsistencyModifier.LOCK
Without graph.commit() you can get cached result if handled in same thread
When updating an element that is guarded by a uniqueness constraint, JanusGraph uses the following protocol at the end of a transaction when calling tx.commit()
:
Acquire a lock on all elements that have a consistency constraint
Re-read those elements from the storage backend and verify that they match the state of the element in the current transaction prior to modification. If not, the element was concurrently modified and a PermanentLocking exception is thrown.
Persist the state of the transaction against the storage backend.
Release all locks.
Important: Enabling storage.batch-loading
requires the user to ensure that the loaded data is internally consistent and consistent with any data already in the graph. In particular, concurrent type creation can lead to severe data integrity issues when batch loading is enabled.
Batching - Looking at our script, you’ll notice that I’m including more than one addV per call. The exact number you’ll want to send over at once may vary, but the basic idea holds that there are performance benefits to be gained from batching. In addition to batching, note that I chained all of the mutations into a single traversal. This amortizes the cost of traversal compilation, which can be non-trivial when you’re going for as high of throughput as possible. Note that Gremlin is quite powerful and you can mix reads and writes into the same traversal, extending way beyond my simple insert example. So keep that in mind as you write your mutating traversals. The chosen batch size 10 is rather arbitrary so plan to test a few different sizes when you’re doing performance tuning.
"Florian Hockmann lis 14 2017 09:27
@alimuzaffarkhan I would try it without any locks as they can slow down parallel inserts a lot. That's why it's often a good idea to handle de-duplication in your client application or at least make it robust against duplicate data. See also Chapter 29 of the docs for this topic. Even if you really need the locks, it might be interesting to benchmark performance without locks as that will tell whether locks are responsible for the low performance or whether there's another issue.
Thijs lis 14 2017 10:19
@alimuzaffarkhan I have also struggled with the performance due to unique constraints which caused a lot of locking. After a while I removed all constraints and build a de-duplication mechanism to merge duplicate nodes (when detected) and now I can insert thousands of vertices and edges a second. I build a linked data platform and hence when I query for some URI I perform a dedup-action if multiple nodes are found. I use additional timestamp properties and some other arbitrary rules to determine which elements to keep. This way I can have a reliable eventual consistent linked data platform.My setup: 1 cassandra node (non-unique-indexes on my URI-identifiers), 1 es node (indexes on all properties), 1 kafka node, zookeeper, scala-data-listeners importing data from multiple sources and pushing it to Kafka, scala-graph-importer listening to Kafka and importing incoming records to the graph. I stream the Kafka-topics in batches of 100 and am also commiting to the graph by batch (committing single records is slow). I also stream the batches concurrently and this is where a little optimization/tuning can be done, I currently have set the parallelism to 30 (so processing 30x100 records/sec, where a record results in committing one or more vertices and zero or more edges).Currently I am running this on a single machine.So my dedup-mechanism is ad-hoc executed but you could also scan for duplicate id's after a certain delay since the moment a new id was inserted last (perhaps do this in batches)."
Transactions are started automatically with the first operation executed against the graph. One does NOT have to start a transaction manually. The method newTransaction
is used to start only
fold/unfold get or create upsert
"I generally wouldn't recommend embedding Janus in your app though unless you have a really good reason to." - by Ted Wilmes
"By default, communication with Gremlin Server occurs over WebSockets and exposes a custom sub-protocol for interacting with the server."
Airline reservations and routing: a graph use case
JanusGraph Journey
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend for JanusGraph
Ted Wilmes on the state of JanusGraph 2018