Property connecting entities like process to their performance measures with DMOP

Submitted by AgnieszkaLawrynowicz on Fri, 14/10/2011 - 10:51

 

Hi All,

We are currently writing a parser that will generate an RDF representation of a data mining process from the *.xml representation of the Rapid Miner process. We would like to somehow connect the information on the performance of the process to the process, but currently I am not sure which property to use in DMOP to represent it. There is no property like hasPerformace.

I would be happy to receive any ideas.

BR,

Agnieszka 

PS. I am creating also an issue for that in Cicero connected to PerformanceMeasure class. 

Property connecting entities like process to their performance

I am tempted to say that we don't need such a specific property. The main properties we use for processes such as experiments or operator executions are hasInput and hasOutput, both of which can take multiple values. A DM experiment typically takes (at least) a data set as input, and has (at least) two outputs: the induced hypothesis (e.g. a classification model) and an estimate of the quality of this hypothesis (e.g. predictive accuracy of the learned classifier). We can say: p1 hasOutput h1 and p1 hasOutput m1, where p1 is an instance of DM-Process, h1 an instance of DM-Hypothesis, and m1 an instance of HypothesisEvaluationMeasure (which subsumes your PerformanceMeasure).

Now that brings us to one of the many gaps in DMOP that need to be filled if it is to be at all useful -- in this case, the branch  rooted in the concept of HypothesisEvaluationMeasure. For the moment, we have two subclasses ModelEvaluationMeasure and PatternSetEvaluationMeasure, both of which need to be elaborated more fully before we can assign a specific performance measure to a learned model. Depending on the task (descriptive or predictive, and in the latter case classification, regression, ranking, etc.) we would use a different measurement protocol, and this has to be specified in our experimental record. Thus the measure m1 will be an instance of some subclass of HypothesisEvaluationMeasure, which is characterized by what is being measured (e.g. predictive accuracy, interestingness), the criterion and formula used (e.g. mean-squared error, F1), and of course the actual computed value. This is closely related to another issue I raised elsewhere in the forum on modeling dataset characteristics, most of which are also measures of some kind. In short, characterizing DM experiment inputs or outputs boils down to modeling observational meta-data exactly as other scientists do with base-level (e.g. physical or ecological) data. With this in mind, we should consider reusing and adapting existing data models in ontologies like OBOE and Seronto. That would leave us free to focus on the substance of the HypothesisEvaluationMeasure sub-hierarchy.

At this point I cannot resist the temptation to call for volunteers from the data mining community. We are all familiar with subsets of this broad class of evaluation measures, depending on our favorite DM tasks, algorithms and models; this is an excellent opportunity to pool together our diverse experiences in order to build a comprehensive and coherent model of (induced) hypothesis evaluation measures into DMOP.

User login