
To assess the usefulness of the labelled part of the dataset in practice, we show three applications of different nature and apply them to enhance the MAR search engine Footnote 1 : detection of dummy models, single-label classification to infer model categories, as well as multi-label classification to infer relevant tags. We contribute the first version of ModelSet, a large dataset of labelled models which comprises 5,466 Ecore meta-models and 5,120 UML models. It is accompanied with a supporting tool implemented as an Eclipse plug-in.

Staruml interface change 2.5 and 2.5.1 software#
We propose a methodology to speedup the process of labelling software models. Altogether, this paper makes the following contributions: Moreover, we provide a number of additional labels, like tags to describe the main topics of a model. We have used this tool to label Ecore meta-models and UML models with their category (e.g., DSLs for Petri nets, UML for modelling an ATM). We have created an Eclipse plug-in as a concrete instantiation of the method, including features like automatic model grouping by similarity, visualizations, and label review. To address this shortcoming, we have devised an interactive, semi-automatic labelling method based on grouping similar models using a search engine. The main difficulty is that labelling a single model can be hard and time consuming due to the domain expertise required to explore and understand the model and assign a proper label. In this paper, we tackle the creation of labelled datasets of software models. A well-known example is the Defects4J dataset, which has fueled research in program repair (e.g., ). In the software engineering domain, many approaches (either ML-based or not) rely on public datasets designed for concrete applications.

For instance, one of the milestones of the ML community was the creation of large datasets, like ImageNet , which contains thousands of manually labelled images. This scenario contrasts with the situation in other application areas. Most datasets have a small size (e.g., 555 labelled meta-models ), while others are not curated (e.g., the UML dataset proposed in contains 90,000 models, but neither availability nor navigability are guaranteed and only 23,000 models can be downloaded, and only 3000 of them are EMF-valid the rest crashed or it is not possible determine the tool to edit them easily). While there exist a few model datasets freely available, their quality is not adequate. For instance, a neural network has been used to classify meta-models into application domains , clustering techniques have shown its usefulness in organizing collections of models , and graph kernels have been proposed as a means to characterize similar models .Īn important limitation of current applications of ML for MDE is the lack of large datasets (either labelled or unlabelled) from which rich ML models can be trained. At the same time, artificial intelligence (AI) and machine learning (ML), its most current branch, have shown their potential to enhance software engineering approaches in many areas , but their application for addressing tasks in the modelling domain is still relatively recent. Such models can be created with general-purpose modelling languages (e.g., UML) or using a domain-specific language (DSL). Model-driven engineering (MDE) is a software development paradigm that advocates the use of models as active elements in the development cycle. The dataset and the tooling are available at and a live version at. We use ModelSet to train models able to infer useful metadata to navigate search results. We showcase the usefulness of the dataset by applying it in a real scenario: enhancing the MAR search engine. We have evaluated the ability of our labelling method to create meaningful groups of models in order to speed up the process, improving the effectiveness of classical clustering methods.

Staruml interface change 2.5 and 2.5.1 plus#
We have built an Eclipse plug-in to support the labelling process, which we have used to label 5,466 Ecore meta-models and 5,120 UML models with its category as the main label plus additional secondary labels of interest.

To create it we have devised a method designed to facilitate the exploration and labelling of model datasets by interactively grouping similar models using off-the-shelf technologies like a search engine. In this work, we present ModelSet, a labelled dataset of software models intended to enable the application of ML to address software modelling problems. There are several reasons for this, including the lack of large collections of good quality models, the difficulty to label models due to the required domain expertise, and the relative immaturity of the application of ML to MDE. The application of machine learning (ML) algorithms to address problems related to model-driven engineering (MDE) is currently hindered by the lack of curated datasets of software models.
