Tadesse Zemichael

and

Rachel Allen

Heterogenous Graph Embedding for Malicious Azure Sign-in Detection (pdf, video)

Azure active directory (Azure-AD) is an identity and access management service, that helps users to access external and internal resources such as Office365, and SaaS applications. The Sign-in logs in the Azure-AD log identify who the user is, how the application is used for the access, and the target accessed by the identity [1]. At a given time t, a service s is requested by user u from device d using the authentication mechanism of a to be either allowed or blocked. Previous works on anomalous authentication detection include applying blackbox ML models on handcrafted features extracted from authentication logs or rule-based models [8]. The closest work on using graphs for malicious authentication detection includes [9], where a graph is built for each user login log and then graph features are extracted as the next step to be used for similarity metrics. Our work closely follows the success of heterogenous GNN embedding on cyber applications such as fraud detection [2,7], and cyber-attack detection on prevalence datasets. Unlike earlier models, this work uses heterogeneous graphs for authentication graph modeling and relational GNN embedding for capturing relations among different entities. This allows us to take advantage of relations among users/services, and at the same time avoids the feature extracting phase [8]. In the end, the model learns both from structural identity and the unique feature identity of individual users. The drawback of a rule-based or feature-based system is, that it fails to generalize for new attacks and rules need to be maintained often. An evolving attack and connected malicious users across the network are hard to detect through feature/rule-based methods. This work presents a heterogenous relational convolutional graph embedding approach for malicious Azure-AD sign-in detection. First, to overcome node feature sparsity and capture activity aggregation is done based on windows time t and node tuples (User, Device, Service). The nodes are separated with target node “authentication” to capture dynamic sign-in behavior and other static nodes (user, device, and service). This allows us to associate all time-changing features with authentication nodes and eliminates modeling the dynamic evolving nature of the graph, as every authentication is distinct in the time domain. Finally, a heterogenous relational graph convolution network (R-GCN) [5] is trained to output the embedding of “authentication”, where the embedding of authentication is fed into a binary classifier or anomaly detection algorithm for scoring purposes. We report a comparison of the model's performance on real data extracted from real-world azure authentication logs.