Researchers of various research areas (e.g., defect prediction, sentiment mining, developer social networks) analyze software projects to develop new ideas or test their assumptions by performing case studies. But to analyze a software project, two different steps need to be taken:

  1. collection of the project data, including, e.g., pre-processing steps and synthesis of intermediate results and
  2. performing the analysis on basis of this data.

Currently, the tooling for these steps is very versatile, which raises the problem that performed studies are often not replicable. Therefore, performing a meta-analysis is often not possible, but needed to create, e.g., benchmarks for approaches. Hence, we developed our platform called SmartSHARK which could help in improving the validity and replicability of software mining studies. SmartSHARK combines the two essential steps into one platform: on the one hand, it enables researchers to easily collect data from various repositories. On the other hand, the platform uses Apache Spark as analytical backend to analysz the collected data.
SmartSHARK is able to collect project-level data from:

  • Version Control Systems (VCSs)
  • Issue Tracking Systems (ITSs)
  • Mailing Lists

Furthermore, it collects:

  • Abstract Syntax Tree (AST) statistics
  • Product metrics, on different levels of abstraction (e.g., class-level, method-level)
  • Clone data (detection of Type-2 clones)
  • Clone metrics

This data is stored in a MongoDB. Furthermore, the data is connected with each other, which makes the analysis easier. On the analysis side, Apache Spark provides us with the needed efficiency and algorithms to analyze such an amount of data.
An instance of SmartSHARK is currently deployed and can be reached via:
An old version of SmartSHARK can be found at:

Main menu 2

2011 © Software Engineering For Distributed Systems Group