I am currently working on a project where I am trying to use machine learning and natural language processing techniques to analyze forum-style conversations between software designers. Having focused my time in grad school on software modelling and transformations, this is an entirely new field for me and I’m learning a whole bunch of new tools and techniques. I am also programming heavily, something I missed a lot during the later years of my degree. So far I’ve been using Java, but I expect to have to do a lot of scripting as well.
Ultimately, I need to pose my question in a format such as ARFF, used by the Weka tool suite. As a task, this is similar in spirit to creating e.g., a query in DIMACS to a SAT solver, or an SMT-LIB spec for Z3. Basically, the game is: start from raw data and structure it in a machine usable format. For machine learning, there is an additional, highly non-trivial step, where the structured data needs to be described in terms of “features“. That’s where the magic really happens. But this is not a story about me learning machine learning. This is a story about me programming a project that (as of now) has about 10k lines of code and will only get bigger.
My research pedigree being in software modelling, I got familiar very early on (back in NTUA when I was working with Kostas) with the Eclipse Modelling Framework. EMF allows creating Ecore models (a simple version of UML class diagrams) and generating code from them. It does all sorts of other lovely things (such as supporting persistence, model transformations, etc, etc) but for now, that’s all I’m using it for. Specifically, I have found it extremely useful for describing and structuring the data in a consistent, clean way.
Here was the metamodel I’m using to store the information I parse from a discussion (click to enlarge):
Basically: a Discussion consists of an ordered set of Comments, each one with a unique author, a Participant. Each Comment consists of a set of Utterances, representing the unit of analysis for machine learning: a specific thing someone said that could be a DesignPoint or not. A DesignPoint (DP) is a particular decision that people in the discussion have identified that needs to be made. Cal, a summer intern at UofT, and I have spent some time doing manual open coding to characterize such DPs in conversations in terms of the enumerations on the right of the model, e.g., what Theme the DP is about, what Form it takes, etc.
From this metamodel, I’ve generated a whole lot of code, which I then use to parse discussions into Discussion models. Once a discussion is parsed, I generate an ARFF file, where each Utterance is represented by a line of metadata. I then use Weka to look for patterns that characterize specifically those Utterances that are DesignPoints.
This is the first point where EMF has been extremely useful to me. It has allowed me to develop my application in the spirit of MVC. I use diagrams to conceptualize and represent the data and I let EMF generate code to manage the consistency and storage of the data. I don’t have to worry about, e.g. making sure that a Participant object has a reference to exactly that Comment object that has a reference back to it. EMF takes care of generating that code. I can then focus exclusively on the “controller” (doing the parsing) and the “view” (generating ARFF, logs, etc.).
Now, did I mention I’m a newcomer to machine learning? Well, the obvious problem with my metamodel above is that it does not say anything about Utterances that are not DesignPoints. In fact the metadata that I am storing for DesignPoints are not useful for distinguishing Utterances that are DesignPoints from those that are not. (What they are arguably good for is characterizing the different kinds of DesignPoints, once you’ve discovered them.)
Having realized this, I had to change my structure of my code, to introduce a way to capture information about all Utterances. I played around with different designs how to do this. My biggest concern was that I didn’t want to break my existing code for parsing discussion files into Discussion objects. This was the second point where EMF shone. Using the model I was able to toy with different designs, at a high level of abstraction without the code getting in the way. Ultimately I decided on the design below (again, click to enlarge):
Basically, I added a superclass to DesignPoint called UtteranceDescription and moved to it those characteristics that can be used to describe Utterances that are not DesignPoints. All hail Barbara Liskov!
Editing the model and generating the new code took me less than 5 minutes. There was only one hiccup after code generation:
The reason was that I drag-and-dropped the two attributes from the DesignPoint class to the UtteranceDescription class. The problem went away once I deleted them from the one and added them to the other. I could have fixed the problem by editing the generated code but hey, I’m a software modelling person. We don’t like that.
And that was the third point where EMF was awesome. I was able to do the change very fast and with (almost) no compile errors. For such a change, the git diff (excluding the models) was about 2k lines long, all generated code.
So to summarize, three major things EMF did for me:
- It steered me into thinking clearly about the “M” part of MVC, avoiding muddling my code.
- It allowed me to conceptualize the data at a high level of abstraction, making it easy to model and change its structure when necessary.
- It helped me execute a complex change to my codebase very fast without messing up the underlying consistency of my data representation.
Further reading: What every Eclipse developer should know about EMF