Janusgraph

Introduction

JanusGraph is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. JanusGraph is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.

Official website http://janusgraph.org/

Traversal promises

http://tinkerpop.apache.org/docs/3.3.0/upgrade/#_traversal_promises

gremlin> promise = g.V().out().promise{it.next()}
==>java.util.concurrent.CompletableFuture@4aa3d36[Completed normally]
gremlin> promise.join()
==>v[3]
gremlin> promise.isDone()
==>true
gremlin> g.V().out().promise{it.toList()}.thenApply{it.size()}.get()
==>6

Important

  • JanusGraph can work with embedded gremlin server and it is able to connect to remote standalone gremlin servers

  • Possible to use JanusGraph Embedded Gremlin with multiple graphs with multiple config files

  • It is strongly encouraged to explicitly define all schema elements and to disable automatic schema creation by setting schema.default=none in the JanusGraph graph configuration.

  • query.force-index=false/true. Whether JanusGraph should throw an exception if a graph query cannot be answered using an index. Doing solimits the functionality of JanusGraph’s graph queries but ensures that slow graph queries are avoided on large graphs. Recommended for production use of JanusGraph.

  • Enabling the storage.batch-loading configuration option will have the biggest positive impact on bulk loading times for most applications. Enabling batch loading disables Janus internal consistency checks in a number of places. Most importantly, it disables locking. In other words, Titan assumes that the data to be loaded into Titan is consistent with the graph and hence disables its own checks in the interest of performance. Important: Enabling storage.batch-loading requires the user to ensure that the loaded data is internally consistent and consistent with any data already in the graph. In particular, concurrent type creation can lead to severe data integrity issues when batch loading is enabled. Hence, we strongly encourage disabling automatic type creation by setting schema.default = none in the graph configuration. Batch loading disabling locks.

  • PermamentLockExceptions will appear only on properties marked as ConsistencyModifier.LOCK

  • Without graph.commit() you can get cached result if handled in same thread

  • Transactions are started automatically with the first operation executed against the graph. One does NOT have to start a transaction manually. The method newTransaction is used to start multi-threaded transactions only

  • fold/unfold get or create upsert https://stackoverflow.com/a/46053115/7120456

  • When updating an element that is guarded by a uniqueness constraint, JanusGraph uses the following protocol at the end of a transaction when calling tx.commit():

    1. Acquire a lock on all elements that have a consistency constraint

    2. Re-read those elements from the storage backend and verify that they match the state of the element in the current transaction prior to modification. If not, the element was concurrently modified and a PermanentLocking exception is thrown.

    3. Persist the state of the transaction against the storage backend.

    4. Release all locks.

Bulk/Batch loading

https://docs.janusgraph.org/latest/bulk-loading.html

https://docs.janusgraph.org/latest/limitations.html#_batch_loading_speed

Important: Enabling storage.batch-loading requires the user to ensure that the loaded data is internally consistent and consistent with any data already in the graph. In particular, concurrent type creation can lead to severe data integrity issues when batch loading is enabled.

https://github.com/IBM/janusgraph-utils/blob/master/doc/users_guide.md#import-csv-file-to-janusgraph

https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/graph/dgl/dglOverview.html

Batching - Looking at our script, you’ll notice that I’m including more than one addV per call. The exact number you’ll want to send over at once may vary, but the basic idea holds that there are performance benefits to be gained from batching. In addition to batching, note that I chained all of the mutations into a single traversal. This amortizes the cost of traversal compilation, which can be non-trivial when you’re going for as high of throughput as possible. Note that Gremlin is quite powerful and you can mix reads and writes into the same traversal, extending way beyond my simple insert example. So keep that in mind as you write your mutating traversals. The chosen batch size 10 is rather arbitrary so plan to test a few different sizes when you’re doing performance tuning.

https://www.experoinc.com/post/janusgraph-nuts-and-bolts-part-1-write-performance

"Florian Hockmann lis 14 2017 09:27 @alimuzaffarkhan I would try it without any locks as they can slow down parallel inserts a lot. That's why it's often a good idea to handle de-duplication in your client application or at least make it robust against duplicate data. See also Chapter 29 of the docs for this topic. Even if you really need the locks, it might be interesting to benchmark performance without locks as that will tell whether locks are responsible for the low performance or whether there's another issue.

Thijs lis 14 2017 10:19 @alimuzaffarkhan I have also struggled with the performance due to unique constraints which caused a lot of locking. After a while I removed all constraints and build a de-duplication mechanism to merge duplicate nodes (when detected) and now I can insert thousands of vertices and edges a second. I build a linked data platform and hence when I query for some URI I perform a dedup-action if multiple nodes are found. I use additional timestamp properties and some other arbitrary rules to determine which elements to keep. This way I can have a reliable eventual consistent linked data platform.My setup: 1 cassandra node (non-unique-indexes on my URI-identifiers), 1 es node (indexes on all properties), 1 kafka node, zookeeper, scala-data-listeners importing data from multiple sources and pushing it to Kafka, scala-graph-importer listening to Kafka and importing incoming records to the graph. I stream the Kafka-topics in batches of 100 and am also commiting to the graph by batch (committing single records is slow). I also stream the batches concurrently and this is where a little optimization/tuning can be done, I currently have set the parallelism to 30 (so processing 30x100 records/sec, where a record results in committing one or more vertices and zero or more edges).Currently I am running this on a single machine.So my dedup-mechanism is ad-hoc executed but you could also scan for duplicate id's after a certain delay since the moment a new id was inserted last (perhaps do this in batches)."

Standalone Gremlin Server

"I generally wouldn't recommend embedding Janus in your app though unless you have a really good reason to." - by Ted Wilmes https://www.experoinc.com/post/janusgraph-nuts-and-bolts-part-1-write-performance

https://github.com/JanusGraph/janusgraph/issues/1108

"By default, communication with Gremlin Server occurs over WebSockets and exposes a custom sub-protocol for interacting with the server." http://tinkerpop.apache.org/docs/3.2.3/reference/#starting-gremlin-server http://tinkerpop.apache.org/docs/3.2.3/reference/#connecting-via-console http://tinkerpop.apache.org/docs/3.2.3/reference/#_connecting_via_rest http://docs.janusgraph.org/0.2.0/server.html#_websocket_versus_http http://docs.janusgraph.org/0.2.0/server.html#_janusgraph_server_as_a_websocket_endpoint http://docs.janusgraph.org/0.2.0/server.html#_advanced_janusgraph_server_configurations

Presentations about Janusgraph

Last updated