-
Notifications
You must be signed in to change notification settings - Fork 13
Home
The design of Related is highly opinionated and the intention is not for Related to be a hammer for all nails. The goal is specifically to be an easy to use, yet powerful, tool for creating social applications. Quality over quantity! If you're doing DNA sequencing or are trying to figure out who is a terrorist you should probably look into another tool. If you're building the next Facebook or a cool new semantic web product, you've probably come to the right place.
- Simple and lightweight. If you already have Redis installed the only step to get started with Related is "gem install related". If you are familiar with Active Record then Related will probably feel very natural to you.
- Extremely fast JSON output. The main use case for Related is to build web APIs and to output large amounts of JSON data very efficiently. Timestamps are for example stored as ISO 8601 strings, which is the same format used when serializing timestamps in JSON, making a costly post processing step on the data unnecessary. Related can very easily return several thousand nodes in JSON format in less than 100ms and is easy to integrate with a cache layer like memcache. For most use cases bandwidth to the API consumer will be the limiting factor rather than the performance of Related.
- Distributed architecture and easy to shard. All graph walking is currently done on the client side and IDs are Base62 encoded MD5 strings rather than auto incrementing integers, which makes it fairly simple to use Related in a distributed setup. Since Redis is so damn fast, doing the graph walking is not really a performance problem and if you have a sharded setup where nodes are stored in different Redis instances Related can potentially visit many nodes in parallel. The only limiting factor will once again be bandwidth.
- Powerful and easy to use querying. Querying the graph should not require that you learn a new programming language or even understand much about graph algorithms at all. Related provides ready-made features like the Related::Follower module and query methods like shortest_path_to.
- Real-time stream processing. It should be easy to process in real-time the data that gets put into Related in an efficient, easy to use/understand way. Most non-trivial applications will need to aggregate the graph data in one way or another and the closer to "real-time" that aggregation can happen the better. Related should provide the primitives needed to do that. Implementing a real-time version of the Google Page Rank algorithm or the Twitter Trending Topics algorithm on a large scale graph should be trivial using Related (or at least, that's the goal).
Graphs and graph processing has been the defining technical feature behind many of the most successful internet companies in the last 10 years (Google, Facebook, Twitter, etc.). In short, the vision behind Related is to bring large scale graphs and graph processing to the masses and allow anyone to play and experiment with graph technology.
The intention is for Related to be fairly easy to shard efficiently. It can't shard at all right now, but the ambition is to support sharding in the 1.0 release. To be able to shard in a useful way any system design needs to make trade-offs. Related relies heavily on the very efficient set operations that Redis provides, but with the obvious downside that those operations can't be efficiently supported in a sharded environment in a general way.
There are currently two proposed solutions to solving this:
-
Do set operations on the server side in Redis when possible (when the involved keys happen to exist on the same server) and emulate the set operation on the client side when it's not possible. For many applications like most social networks for example, this solution will be more than efficient enough. The drawback is of course that it will become less efficient the more shards you add and for networks with a very large number of relationships for each node (like a Twitter clone for example) it might not work very well at all at scale.
-
Store all node and relationship properties ("entities" in Related parlance) in a partitioned key space (using Redis::Distributed) which will allow you to store as many nodes and relationships as you want and scale infinitely to an unlimited number of servers. But store the set keys that defines the "links" between nodes on a single master server (or a replicated master-slave setup). In such a setup the only limitation will be how much data you can store on the master server, and since the sets are fairly compact compared to the entity properties in most applications, that should work fine most of the time. All set operation queries and graph traversal stuff will go to the master server and everything else will hit the sharded servers.
The intention is to implement both strategies to allow you to select the one that makes the most sense for your application.
Related does not implement any kind of indexing of the data you store in a node or relationship. Which means the only way to access that data is by knowing the ID of a node or relationship and then to query the graph to get to the data. There are no plans to implement any index functionality in Related.
The reason for that is that most useful indexing is either too application specific or too heavy/complex/inefficient to be added to Related. Some applications will need full text search, some will need geo spatial indices, some will not need any indices at all. Some will need to index the data in real time, some will rather index the data in offline background jobs. To support all those use cases in an optimal way is not realistic. Related is a graph database, not a full text search solution. Some people might not like it and I know it might be controversial, but my opinion is that indexing does not belong in the database. Another strong reason is that it does not play very well with caching. If you want to cache an object in memcached for example, each time you want to retrieve that object you will need its ID to look it up in memcached. If the only way to get that ID is to query an index and that index is a part of your database, then some of the benefit of having a caching layer in front of your database gets diminished. So the recommendation is: "Use the right tool for the right job and everything will work much better!".