Nathan Danneman

Data Machines

Spill Trees at Scale with Hierarchical Divisive Clustering: Catching Domain Squatting, Credential Misuse, and Other Attacks

Cyber analysts often trust outputs more if they are from intuitable models that generate clear comparisons. In various settings, kNN-based methods have satisfied the need for understandable results; however, brute-force solutions are O(n^2), while modern solutions are either complicated to implement, do not scale, or are very sensitive to tuning parameter specification. In this talk, I discuss ongoing work on an approach that pre-processes data into a spill tree-like structure using clustering, and then post-processes with a neighbors-of-neighbors strategy. Overall, this method give strong accuracy across a wide range of parameter settings, is simple to implement, and suitable for cloud-scale data. The discussion ends with real-work (obfuscated) example applications to identifying domain squatting and credential misuse in big cyber data.