A small EMF story

I am currently working on a project where I am trying to use machine learning and natural language processing techniques to analyze forum-style conversations between software designers. Having focused my time in grad school on software modelling and transformations, this is an entirely new field for me and I’m learning a whole bunch of new tools and techniques. I am also programming heavily, something I missed a lot during the later years of my degree. So far I’ve been using Java, but I expect to have to do a lot of scripting as well.

Ultimately, I need to pose my question in a format such as ARFF, used by the Weka tool suite. As a task, this is similar in spirit to creating e.g., a query in DIMACS to a SAT solver, or an SMT-LIB spec for Z3. Basically, the game is: start from raw data and structure it in a machine usable format. For machine learning, there is an additional, highly non-trivial step, where the structured data needs to be described in terms of “features“. That’s where the magic really happens. But this is not a story about me learning machine learning. This is a story about me programming a project that (as of now) has about 10k lines of code and will only get bigger.

My research pedigree being in software modelling, I got familiar very early on (back in NTUA when I was working with Kostas) with the Eclipse Modelling Framework. EMF allows creating Ecore models (a simple version of UML class diagrams) and generating code from them. It does all sorts of other lovely things (such as supporting persistence, model transformations, etc, etc) but for now, that’s all I’m using it for. Specifically, I have found it extremely useful for describing and structuring the data in a consistent, clean way.

Here was the metamodel I’m using to store the information I parse from a discussion (click to enlarge):


Basically: a Discussion consists of an ordered set of Comments, each one with a unique author, a Participant. Each Comment consists of a set of Utterances, representing the unit of analysis for machine learning: a specific thing someone said that could be a DesignPoint or not. A DesignPoint (DP) is a particular decision that people in the discussion have identified that needs to be made. Cal, a summer intern at UofT, and I have spent some time doing manual open coding to characterize such DPs in conversations in terms of the enumerations on the right of the model, e.g., what Theme  the DP is about, what Form it takes, etc.

From this metamodel, I’ve generated a whole lot of code, which I then use to parse discussions into Discussion models. Once a discussion is parsed, I generate an ARFF file, where each Utterance is represented by a line of metadata. I then use Weka to look for patterns that characterize specifically those Utterances that are DesignPoints.

This is the first point where EMF has been extremely useful to me. It has allowed me to develop my application in the spirit of MVC. I use diagrams to conceptualize and represent the data and I let EMF generate code to manage the consistency and storage of the data. I don’t have to worry about, e.g. making sure that a Participant object has a reference to exactly that Comment object that has a reference back to it. EMF takes care of generating that code. I can then focus exclusively on the “controller” (doing the parsing) and the “view” (generating ARFF, logs, etc.).

Now, did I mention I’m a newcomer to machine learning? Well, the obvious problem with my metamodel above is that it does not say anything about Utterances that are not DesignPoints.  In fact the metadata that I am storing for DesignPoints are not useful for distinguishing Utterances that are DesignPoints from those that are not. (What they are arguably good for is characterizing the different kinds of DesignPoints, once you’ve discovered them.)

Having realized this, I had to change my structure of my code, to introduce a way to capture information about all Utterances. I played around with different designs how to do this. My biggest concern was that I didn’t want to break my existing code for parsing discussion files into Discussion objects. This was the second point where EMF shone. Using the model I was able to toy with different designs, at a high level of abstraction without the code getting in the way. Ultimately I decided on the design below (again, click to enlarge):


Basically, I added a superclass to DesignPoint called UtteranceDescription and moved to it those characteristics that can be used to describe Utterances that are not DesignPoints.  All hail Barbara Liskov!

Editing the model and generating the new code took me less than 5 minutes. There was only one hiccup after code generation:


The reason was that I drag-and-dropped the two attributes from the DesignPoint class to the UtteranceDescription class. The problem went away once I deleted them from the one and added them to the other. I could have fixed the problem by editing the generated code but hey, I’m a software modelling person. We don’t like that.

And that was the third point where EMF was awesome. I was able to do the change very fast and with (almost) no compile errors. For such a change, the git diff (excluding the models)  was about 2k lines long, all generated code.

So to summarize, three major things EMF did for me:

  1. It steered me into thinking clearly about the “M” part of MVC, avoiding muddling my code.
  2. It allowed me to conceptualize the data at a high level of abstraction, making it easy to model and change its structure when necessary.
  3. It helped me execute a complex change to my codebase very fast without messing up the underlying consistency of my data representation.

Further reading: What every Eclipse developer should know about EMF

Gail Murphy’s keynote at ICSE’16

During ICSE’16, Gail gave a keynote talk. I collected tweets from the #icse16 hashtag and put them together in Storify.

You can find the result here:


Gail’s slides are here:


Webinar: MMINT — A Graphical Tool for Interactive Model Management

Model Management addresses the accidental complexity caused by the proliferation of models in software engineering. It provides a high-level view in which entire models and their relationships (i.e., mappings between models) can be manipulated using transformations, such as match and merge, to implement management tasks. Other model management frameworks focus on support for the programming required to prepare for model management activities at the expense of the user environment needed to facilitate these activities. In this demo, we show MMINT- a model management tool that provides a graphical and interactive environment for model management. MMINT has many features that help manage complexity while reducing effort in carrying out model management tasks.

Code available on GitHub:

Additional information about MMINT:

Additional information about MU-MMINT:

Additional information about Design-Time Uncertainty:


On publishing the “obvious”

As a requirement for the course “Advanced Propositional Reasoning” that I’m taking, today I did a presentation of a very very interesting paper titled Empirical study of the anatomy of modern SAT solvers. You can find my slides here.

Earlier in the day, I almost caught the end of Alicia‘s practice talk for her paper On the Perceived Interdependence and Information Sharing Inhibitions of Enterprise Software Engineers that will appear at CSCW 2012. Part of her most recent results is apparently some data that substantiates certain things which are supposedly “common knowledge”. (I won’t elaborate, wait for her paper in CSCW.)

It’s interesting how that closely parallels the situation of the findings of the SAT solvers paper. That paper also substantiates with empirical data some things assumed to be “common knowledge”. What I want to say is that basically, that’s a very good thing! Sure, if your empirical investigation finds something that refutes commonly held perceptions, that’s more exciting (and I guess more publishable). But confirming anecdotal knowledge with hard scientific investigation is just as important!

This points to the often talked about(*) “problem” with empirical research: it takes big amounts of energy to conduct and there is a good chance the results will be considered unimpressive.  But empirical research is of paramount importance to Software Engineering! I won’t argue for that, I will just point to this page.

(*) See? Anecdotal knowledge! 🙂