My research has centered on statistical classification methods and their applications to a variety of challenging problems in the real world, including automated text categorization and clustering, corpus-based learning for cross-language information retrieval, novel-event detection and tracking from sequential data, information extraction from Internet/Web environments, intelligent email filtering and prioritization, and statistical reasoning based on protein/gene expressions. Details about my research areas are:
- Adaptive Information Filtering: Automatically monitoring a stream of documents (e.g., news stories, news groups, etc) to find just those stories that are interesting to you. Learning from example what kinds of documents you find interesting.
- Cross-Language Information Retrieval: Using queries in one language (such as English) to search for documents in different languages (such as German, Arabic, Spanish, Chinese, etc). We use statistical techniques to learn mappings between language pairs from bilingual parallel text as training data.
- Discovery from Protein Sequences: Pattern recognition from protein sequences and automated mapping between sequences, folding structures and biological functions is a new line of research where we actively collaborate with biologists.
- Scalable Text Categorization: Using machine learning algorithms (regression models, nearest neighbor methods, support vector machines, Hidden Markov Models, and so on ) to classify documents into a pre-defined taxonomy of categories (such as the Yahoo! hierarchy) is an open challenge. We are currently focusing on large-scale hierarchical categorization and personalized email filtering & routing.
- Topic Detection & Tracking: Adapting supervised and unsupervised learning techniques to temporally-ordered data streams (such as newswire data or radio or television broadcasts), to automatically detect novel events, track the new trends for events of user's interest, and filter important information
- Web Mining for Question Answering: The web provides a rich source of statistical information (e.g. tables, graphs). Our challenge is to extract, aggregate