Shobha Venkataraman Traffic Analysis for Network Security using Learning Theory and Streaming Algorithms Degree Type: Ph.D. in Computer Science Advisor(s): Avrim Blum, Dawn Song Graduated: December 2008 Abstract: A recurring problem in network traffic analysis is to automatically distinguish legitimate traffic from malicious or spurious traffic. This problem arises in several guises in network security (e.g., spam mitigation, worm detection), and is, at core, a machine learning or data mining problem. However, traffic analysis for network security has many fundamental challenges that are not present in typical machine learning or data mining problems, and a blackbox application of classical algorithms may not address these challenges adequately. For example, many standard machine learning algorithms may not scale to the volume and diversity of network traffic, or perform well in the presence of a malicious adversary who aims to evade detection. It is, therefore, necessary to design algorithms that meet these challenges, and provide formal guarantees on how well they have been met by the algorithms and the extent to which they can be met by any algorithm. In this thesis, we consider four problems in network security with these challenges, and we use tools from computational learning theory and streaming algorithm design to address them. In each of these four problems, the difference between the malicious traffic and normal traffic is characterized by a specific structure of the traffic distributions: temporal structure, structure in content, structure in communication patterns of hosts and network structure given by host IP addresses. We present both efficient algorithms as well as fundamental lower bounds for these problems: In the stepping-stones problem, we use the temporal structure of the traffic – in particular, the inter-packet timing delays – to identify pairs of streams that are likely to be stepping-stones. We provide algorithms with strong upper bounds on the number of packets they need to observe, to detect attacks with given false positive and false negative rates. We also present lower bounds showing how an adversary, with sufficient chaff, could evade any detection mechanism that is based only on the timing delays between packets. When generating signatures for exploits with pattern-extraction techniques, the content structure of network traffic is used to identify packets that are likely to be worms. A sequence of prior work has alternately developed pattern-extraction algorithms for signature-generation, and attacks on these pattern-extraction algorithms. We present lower bounds showing how any pattern-extraction algorithm could be misled, in the presence of an adversary with sufficient control over the malicious data. We present efficient streaming algorithms to identify superspreaders, which are sources that contact many distinct destinations in a short time period. The communication structure of most hosts on Internet makes finding superspreaders of interest to security applications, as they are likely indicators of worms, scanning, or other malicious activity. Our experimental results on real network traces show that our algorithms are substantially more efficient than earlier approaches. Finally, we study the problem of tracking regions of the IP address space that send malicious traffic. In the first part of our work, we focus on spam traffic, and explore whether the history and structure of IP addresses could be used to distinguish spammers from senders of legitimate mail. In the second part, we design online algorithms that, with low space requirements, can dynamically track IP prefixes that originate the malicious traffic, and provide a near-optimal prediction of IP addresses that send malicious traffic and normal traffic. Our experimental results demonstrate that our algorithm finds prefixes that are orders of magnitude more accurate than fixed commonly-used IP prefixes. Thesis Committee: Avrim Blum (Co-Chair) Dawn Song (Co-Chair) Bruce Maggs Phillip B. Gibbons (Intel Research – Pittsburgh) Subhabrata Sen (AT&T Labs – Research) Oliver Spatscheck (AT&T Labs – Research) Peter Lee, Head, Computer Science Department Randy Bryant, Dean, School of Computer Science Keywords: Network security, machine learning, learning theory, traffic analysis, streaming algorithms CMU-CS-08-157.pdf (881.13 KB) ( 148 pages) Copyright Notice