Machine generated data such as clickstreams or database update streams, consists of many rows, which consist of 3 parts
- timestamp
- text columns or attributes, called dimensions
- numerical values such as counts of hits, words, characters etc, called metrics
The desire to rapidly aggregate over such data with a low latency gave rise to Druid.
The data is append heavy and ingestion is a problem, as is querying the data, especially at low-latency.
Druid has two subsystems –
- a write-optimized subsystem in the real-time nodes
- a read-optimized subsystem in the historical nodes
The data is stored in S3 or HDFS in a column-oriented format.
A good explanatory article that goes into the Druid internal architecture – https://towardsdatascience.com/apache-druid-part-1-a-scalable-timeseries-olap-database-system-af8c18fc766d
Druid is different from Flink and Spark streaming in that it is not a streaming system. Flink can apply real-time data transformations on the data, which can then be ingested into Druid via Kafka, to power real-time dashboards.
