Apache Druid – horizontally scalable time series database

Machine generated data such as clickstreams or database update streams, consists of many rows, which consist of 3 parts

timestamp
text columns or attributes, called dimensions
numerical values such as counts of hits, words, characters etc, called metrics

The desire to rapidly aggregate over such data with a low latency gave rise to Druid.

The data is append heavy and ingestion is a problem, as is querying the data, especially at low-latency.

Druid has two subsystems –

a write-optimized subsystem in the real-time nodes
a read-optimized subsystem in the historical nodes

The data is stored in S3 or HDFS in a column-oriented format.

A good explanatory article that goes into the Druid internal architecture – https://towardsdatascience.com/apache-druid-part-1-a-scalable-timeseries-olap-database-system-af8c18fc766d

Druid is different from Flink and Spark streaming in that it is not a streaming system. Flink can apply real-time data transformations on the data, which can then be ingested into Druid via Kafka, to power real-time dashboards.

Secure Machinery

On the evolution of security and intelligent machinery

Apache Druid – horizontally scalable time series database

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply