Category: big data

Apache Druid – horizontally scalable time series database

Machine generated data such as clickstreams or database update streams, consists of many rows, which consist of 3 parts

  • timestamp
  • text columns or attributes, called dimensions
  • numerical values such as counts of hits, words, characters etc, called metrics

The desire to rapidly aggregate over such data with a low latency gave rise to Druid.

The data is append heavy and ingestion is a problem, as is querying the data, especially at low-latency.

Druid has two subsystems –

  • a write-optimized subsystem in the real-time nodes
  • a read-optimized subsystem in the historical nodes

The data is stored in S3 or HDFS in a column-oriented format.

A good explanatory article that goes into the Druid internal architecture –

Druid is different from Flink and Spark streaming in that it is not a streaming system. Flink can apply real-time data transformations on the data, which can then be ingested into Druid via Kafka, to power real-time dashboards.