A Data Pipeline for Behavioral Clustering and Classification in Enterprise Networks

David Pekarek

Enterprise networks are typically both large and noisy, with high volumes of distinct users and assets performing widely varying actions. In such networks the identification of relevant subpopulations is crucial, particularly to avoid the perils of Simpson's paradox. In this talk I present a modular data pipeline for determining subpopulations of network assets. These subpopulations are identified according to behavioral classes defined with configurable custom featuresets. Classification results can be used as prefilters for follow-on analyses, as input data for anomaly detection algorithms, and as enrichment during hunt operations. The value of this approach is supported by results from real customer enterprises.