DMOP: a sampling of open issues

Submitted by hilario on Sun, 02/10/2011 - 22:27

 

To kickstart the DMO-Foundry forum, here is a (far from exhaustive) list of open issues concerning DMOP. You are welcome to initiate, or contribute to, a discussion thread on any of these issues (as well as any others you find worth discussing):

  • Modeling data characteristics and algorithm characteristics: The key issue here is not the content, i.e. which characteristics to model, but how to model them. Very briefly stated, the issue is: should dataset or algorithm characteristics be represented as properties or as entities in themselves? DMOP currently represents them as properties but a case could be made for reifying them.
  • DM-Hypothesis hierarchy: The hypotheses induced in data mining can be classified along different dimensions: breadth of scope (global models  or local pattern sets), the goal or task addressed (e.g., classification, association), or the hypothesis structure (e.g. decision tree, linear combination of kernels). Since ontologies support multiple inheritantce, each hypothesis can belong to several classes; however the primitive hierarchy should be a pure tree. Which of these dimensions should be used to define the primitive hierarchy?
  • Modeling processes and workflows: In its current state, DMOP has a model of processes (= ground workflows). This was a priority requirement for meta-mining, which consists in analysing past DM experiments (DM processes together with their input data, results, performance measures and other meta-data) in order to extract workflow patterns that correlate with good or bad performance. To represent more abstract process specifications (workflows and workflow templates), we need a way of representing workflow parameters and local variables in OWL, plus a way of mapping them to their instantiations in DM processes. We've started exploring OWL-S for this but are looking for motivated volunteers to give us a hand.
  • Provenance meta-data: A DMOP reviewer has deplored the lack of provenance meta-data and human-readable documentation in DMOP entries. We need to agree on a strategy for keeping track of changes in the content as well as the documentation of each concept. This issue will gain importance as more and more volunteers participate in ontology development.
  • DM-Algorithm hierarchy: should be both comprehensive and in-depth. At this stage, we chose to emphasize depth rather than breadth.  Among data processing algorithms, we focused mainly on feature extraction, feature weighting and feature selection; under induction algorithms, we gave priority to classification and pattern discovery. Consequently, many nodes under the DM-Algorithm concept are empty placeholders that we hope will be filled with voluntary contributions. A few examples are: FeatureDiscretizationAlgorithm, MissingValueImputationAlgorithm, ClusteringAlgorithm, RegressionModelingAlgorithm, and HypothesisEvaluationAlgorithm. For that matter, any node or branch that you think could stand improvement or any missing node is a potential object of collaborative authoring.

 

User login