- Mining and modeling of attributed networks
- Detecting opinion spam in online review sites
- Spam detection in online media
- Analysis of large-scale brain networks
- Finding and visualizing anomalies in temporal data
- Visualization of communities in massive graphs
- Anomaly detection in complex graphs
- Fraud detection in social security
- Vulnerability and resilience in large dynamic graphs
- Study and analysis of online question answering sites
- Behavioral analysis and prediction in online media
Funding and Projects
We are thankful to following funding agencies for their support to our research.
Ongoing projects of our group include:
- NSF CAREER 1452425 [Read abtract]
Anomaly mining is the task of finding irregularities in the data. It finds applications in a plethora of domains, such as security, finance, astronomy, and medicine. Despite its immense popularity, however, it remains an extremely challenging task for many real world applications. For many practitioners, the task is poorly defined and under-specified as existing definitions and solutions are often too simplistic and do not directly correspond to the needs of modern applications. This project takes the essential steps to bridge the gap between research and practice to dramatically improve the usability, effectiveness, and interpretability of anomaly mining techniques, and to ultimately mature the field into a more valuable contributor to the larger world. It promises significant impact on many concrete problems, such as insider threat, tax evasion, and health-care fraud detection, important for the government, industry, and the society. Collaborations with industry and hospital partners aim to shepherd innovations into deployed technology, with tangible impact on security and healthcare.
The primary agenda to achieve these goals involves developing a new framework for anomaly mining that utilizes multiple heterogeneous data sources and techniques in a corroborative fashion to fundamentally reframe our understanding and ability to define, detect, and describe real-world anomalies. The project formalizes novel definitions of complex anomalies that fuse multiple data sources, and invents complex anomaly detection algorithms that further present descriptions that provide rationale for the detected anomalies. Research also explores and models anomaly ensembles that systematically harness evidences from multiple detection techniques. Ultimately, this project strives to push the boundaries of anomaly mining as a field through this quest for principled foundations and practices.
- NSF IIS 1408287 [Read abtract]
Given user reviews on Web sites such as Yelp, Amazon, and TripAdvisor, which ones should one trust? Online reviews have become an important resource for public opinion sharing. They influence our decisions over an extremely wide spectrum of daily and professional activities: e.g., where to eat, where to stay, which products to purchase, which doctors to see, which books to read, which universities to attend, and so on. However, the credibility and trustworthiness of online reviews are at stake. It is well known that a large body of reviews is fabricated — either by owners, competitors, or entities paid by those — to create false perception on the actual quality of the products and services. What is more, opinion fraud is prevalent; while credit card fraud is as rare as 0.2% or less, it is estimated that 20-30% of the reviews on well-known service sites could be fake. This poses a serious risk to businesses and the public, from investing on a low-quality product to consulting an incompetent doctor for diagnosis and treatment. Like other kinds of fraud, opinion fraud is a serious legal offense. In fact, it is currently being recognized as a serious issue in law enforcement by policymakers. Thus solving this problem is of great importance to businesses and the general public alike. Accurately spotting opinion fraud will enable site owners to provide trustworthy content, maintain the integrity of their service, and protect the online citizens from unfair (or potentially harmful) products and services. Businesses will also benefit from reviews with reliable feedback. Honest businesses will be indirectly rewarded, as it will no longer be easy for unscrupulous businesses to benefit from fake reviews. The research outcomes will thus contribute significantly to the healthy growth of the Internet commerce. Educational activities include incorporating research findings in graduate level courses, educating public on fraudulent behavior and misinformation, and providing publicly available educational materials including lectures and manuscripts.
Given the critical issues of opinion fraud in online communities, how can one identify fake reviews and attribute responsible culprits behind them? By conjoining expertise of the PIs over various modalities of deception footprints ranging over language, user behavior, and relational information, this project presents a research program that will result in much needed solutions to this emergent, prevalent, and socially impactful problem. The ultimate goal is to create a unified detection framework via synergistic integration of multiple information sources; from linguistics, user behavior, and network effects, to obtain the best of all worlds. The main idea is to formulate the problem as a relational inference task on composite heterogeneous networks, providing a principled, extensible approach that can blend and reinforce all the above cues towards effective and robust detection of fraud. From a scientific point of view, the research brings together three disciplines: natural language analysis, behavioral modeling, and graph mining. The outcome is a suite of novel, principled, and scalable techniques and models that will enhance our understanding of the creation and dissemination of opinion fraud and misinformation in general at a large scale. The PIs will collaborate with industry partners such as Yelp, Google, and Amazon, directly solicit online fake reviews, and conduct well-designed user studies for testing and validation of their techniques.
- DARPA TC Project (Contract No. FA8650-15-C-7561) [Read abtract]
The goal of this project is to explore and create a suite of technologies called MARPLE that can radically harden enterprise security by large scale automation of the task of detecting sophisticated cyber threats as a first step to remediating and preventing subsequent cyber exploits.
Enterprises today, largely use perimeter-based controls for their defense. These tools, typically, come from different vendors, are fragmented and provide a narrow view of the activities in one part of the enterprise infrastructure. In doing so, they ignore three crucial observations: (a) first, the activities in an enterprise are not all independent of one another but are some times causally related (b) second, normal day to day operations of enterprise activities are done in relatively small and stable interaction neighborhoods (c) third, in contrast, cyber threats often cross such neighborhoods. So, while the tools in use today, generate a multitude of event alert streams, logs and audit records that contain potentially actionable intelligence, the inability to consolidate and correlate these events and data automatically at line speeds and present them to the security analyst in a semantically-meaningful manner robs security analysts and administrators of a valuable tool to defend enterprise networks.
To this end, MARPLE aims combine ideas from four disparate areas to explore a radical and game-changing approach to cyber security, namely, Causality Tracking from Distributed Systems, Heterogeneous Information Networks (HINs) from Data Mining and Knowledge Discovery, Efficient Implementations of Large Graphs and Graph Analytics and Policy Learning and Enforcement. Specifically,
- Causality Tracking: Based on the motivation above, our basic hypothesis is that tracking causal linkages and provenance across enterprise activities, at a very fine level of granularity, can reveal and identify the common communication/computation structures in an enterprise while at the same time forcing malicious actors to expose themselves.
- Heterogeneous Information Networks (HIN): modeling such causal linkages as an heterogeneous information network model, leveraging the rich semantics of typed nodes and links in a network, uncovers surprisingly rich knowledge from the network. Such modeling enables the application of new principles and powerful methodologies for mining interconnected data, including: (1) rank-based clustering and classification; (2) meta-path-based similarity search and mining; (3) relation strength-aware mining etc.
- Big Data Implementations of HINs and Graph Analytics for Security: We propose new algorithms, extending the current state of the art, for heterogeneous graphs for the use cases of connectivity anomaly detection, clustering and role discovery, pattern discovery and attack similarity detection through graph matching which have a direct application for many of the security use cases.
- Policy Learning and Enforcement: We propose new and scalable policy learning algorithms, and efficient enforcement mechanisms for local and global provenance-based information-flow policies at various granularities.
If our research is successful, enterprises will (a) be able to detect attacks that they are missing today (b) detect existing attacks faster and earlier in the kill-chain life cycle (c) formulate and implement a scalable enterprise-wide defense so that intelligence from one part of the network can be used enterprise-wide and (d) enable sharing of actionable threat and vulnerability intelligence with peer enterprises.
- ARO Young Investigator Program (Contract No. W911NF-14-1-0029)
- PNC (Financial Services Innovation Center)
- PwC (Risk and Regulatory Services Innovation Center)
In the past, our research was also funded by:
- ONR SBIR grant (Contract No. N00014-14-P-1155)
- Stony Brook University Office of Vice President for Research
- Northrop Grumman Aerospace Systems (NGAS)
- Facebook (faculty gift)