Jimeng Sun Incremental Pattern Discovery on Streams, Graphs and Tensors Degree Type: Ph.D. in Computer Science Advisor(s): Christos Faloutsos Graduated: December 2007 Abstract: Incremental pattern discovery targets streaming applications where the data continuously arrive incrementally. The questions are how to find patterns (main trends) incrementally; or how to efficiently update the old patterns when new data arrive; or how to utilize the patterns to solve other problems such as anomaly detection? As examples, 1) a sensor network monitors a large number of distributed streams (such as temperature and humidity); 2) network forensics monitor the Internet communication patterns to identify attacks; 3) cluster monitoring examines the system behaviors of a number of machines for potential failures; 4) social network analysis monitors a dynamic graph for communities and abnormal individuals; 5) financial fraud detection tries to find fraudulent activities from a large number of transactions. We first investigate a powerful data model, tensor stream (TS), where there is one tensor per timestamp. To capture diverse data formats, we have a zero-order TS for a single time-series (e.g., the stock price for Google over time), a first-order TS for multiple time-series (sensor measurement streams), a second-order TS for a matrix (graphs), and a high-order TS for a multiarray (Internet communication network, source-destination-port). Second, we develop different online algorithms on TS: 1) the centralized and distributed SPIRIT for mining a 1st-order TS, as well as its extensions for local correlation function and privacy preservation; 2) the compact matrix decomposition (CMD) and GraphScope for a 2nd-order TS; 3) the dynamic tensor analysis (DTA), streaming tensor analysis (STA) and window-based tensor analysis (WTA) for a high-order TS. All the techniques are extensively evaluated for real applications such as network forensics, cluster monitoring. In particular, this CMD achieves orders of magnitude improvements in space and time over the previous state of the art, and identifies interesting anomalies. GraphScope detects interesting communities and change-points on several time-evolving graphs such as Enron email graph and another network traffic flow graph. DTA, STA and WTA are all online methods for higher-order data that scale well with time, provide fundamental tradeoffs with each other, which have also been applied to a number of applications, such as social network community tracking, anomaly detection in data centers and network traffic monitoring. Thesis Committee: Christos Faloutsos (Chair) Tom Mitchell Hui Zhang David Steier (PWC) Philip Yu (IBM) Peter Lee, Head, Computer Science Department Randy Bryant, Dean, School of Computer Science Keywords: Data mining, stream mining, incremental learning, clustering, tensor CMU-CS-07-149.pdf (5.77 MB) ( 201 pages) Copyright Notice