Month: October 2020

Delta Lake and Spark for threat detection and response at scale

Notes on a talk on the data platform for Threat detection and response at scale at Spark+AI Summit, 2018.

The threat detection problem, use-cases and scale.

  • It’s important to focus on and build the data platform first else one can get siloed into narrow set of addressable use-cases.
  • we want to detect attacks,
  • contextualize the attacks
  • determine root cause of an attack,
  • determine what the scope of the incident might be
  • determine what we need to contain it
  • Diverse threats require diverse data sets
  • the Threat signal can be concentrated or spread in time
  • Keylines visualization library is used to build a visualization of detection, contextualization, containment

Streaming is a simple pattern that takes us very far for detection

  • Streams are left-joined with context and filtered or inner-joined with indicators
  • Can do a lot with this but not everything
  • Graphs are key. Graphs at scale are super hard.

Enabling triage and containment with search and query

  • to triage the detection, it comes down to search and query.
  • ETM does 3.5million records/sec. 100TB of data a day. 300B events a day.
  • 11 trillion rows, 0.5PB of data.

Ingestion architecture – tries to solve all these problems and balance issues.

  • data comes into s3 in a consistent json wrapper
  • there’s a single ETL job that takes all the data and writes it into a single staging table which is partitioned by date and event-type, has a long retention
  • table is optimized to stream new data in and stream data out of, but can be queried as well. you can actually go and query it using sql function
  • highest value data – we write parsers, we have discrete parsing streams and put them into a common schema and put it into separate delta tables. well parsed, well structured.
  • use optimizations from delta, z-odering.. to index over columns that are common predicates. search by IP address, domain names – those are what we order by
  • indexing and z-ordering – take advantage of data skipping
  • sometimes we parser code gets messed up.
  • single staging table.. is great . we just let the fixed parser run forward, we have all the data corrected, then we are back-corrected. don’t have to repackage code and run as a batch job. we literally just fix code and run it in the same model that’s it.
  • off of these refined tables or parsed data sets, this is where the detection comes in.
  • we have a number of detection streams in batches, that do the logic and analysis. facet-faced or statistical.
  • alerts that come out of this go to their own alert table. goes to delta again. long retection, consistent schema. another streaming job then does de-duplication and whitelisting and writes out alerts to our alert management system. we durably store all the alerts, whether or not de-duped/whitelisted
  • allows us to go back and fix things if things are not quite correct, accidentally.
  • all this gives us operational sanity, and a nice feedback loop

Thanks to z-ordering, it can go from scanning 500TB to 36TB.

  • average case is searching over weeks or months. it makes it usable for ad-hoc refinements.
  • simple, unified platform.

Michael: Demo on interactive queries over months of data

  • first attempt is sql SELECT on raw data. takes too long, cancelled. second attempt uses HMS, still too long, cancelled. why is this so hard ?
  • observation: every data problem is actually two problems 1) data engineering and 2) data science. most projects fail on step 1.
  • doing it with delta – the following command takes 17s and fires off spark job to put the data in a common schema.

CREATE TABLE connections USING delta AS SELECT * from json.'/data/connections'


SELECT * FROM connections WHERE dest_port = 666

this is great to query the historical data quickly.. however batch alone is not going to cut it as we may have attacks going on right now. but delta plugs into streaming as well:

INSERT INTO connections SELECT * from kafkaStream

Now we’ve Indexed both batch and streaming data.

We can run a python netviz command to visualize the connections.

Here’s a paper on the Databricks Delta Lake approach – .