A fascinating story about use of Cassandra for analyzing sensor data from boilers to predict their failuresin UK homes by British Gas appeared here.
The design of Cassandra is intuitively clear to me in its use of a single primary index to distribute the query load among a set of nodes that can be scaled up linearly. It uses a ring architecture based on consistent hashing. It emphasizes Availability and Partition-Tolerance over Consistency in the CAP theorom.
The data structure is a two level hash table, with the first level key being the row key, and the second level key being the column key.
Where Cassandra differs from a SQL db is in the flexibility of the data model. In SQL one can model complex relationships, which allow for complex queries using joins to be done. Cassandra has support for CQL (Cassandra Query Language) which is like SQL but does not support joins or transactions. The impact is that the queries with CQL cannot be as flexible (or adhoc) as those for SQL. The kind of queries that can be done have to be planned in advance. Doing other queries would be inefficient. However this drawback is mitigated by use of Spark along with Cassandra. In my understanding the Spark cluster is run in a parallel Cassandra cluster.
Why are joins important ? It goes back to relationships in an E-R diagram. Can’t we just model entities ? When we store Employees in one table and Departments in another in a SQL db, each row has an id which is a shorthand for the employee or the department. This simplification forces us to look up both tables again via a join in a query – say when asking for all employees belong to (only) the finance department. But tables like departments may be small in size so they could be replicated in memory for quickly recovering associations. And tables like employees can be naturally partitioned by the employee id which is unique. This means that SQL and complex relationships may not be needed for number of use cases. If ACID compliance is also not a requirement, then nosql is a good bet. Cassandra differs from MongoDB in that it can scale much better.
Quote from British Gas: “We’re dealing largely with time series data, and Spark is 10 to 100 times quicker as it is operating on data in-memory…Cassandra delivers what we need today and if you look at the Internet of Things space; that is what is really useful right now.”
Here’s a blog that triggered this thought along with a talk by Rachel@datastax, who also assured me that Cassandra has been hardened for security and has Kerberos support in the free version.
British Gas operates Hive, a competitor to Nest for thermostats. Note that couple months back British Gas reported 2200 of its accounts were compromised.