Metamodeling in DMOP
To explain metamodeling in DMOP, let's recall four of its core concepts: DM-Task, DM-Algorithm (an implementation of which is called a DM-Operator), DM-Data and DM-Hypothesis (i.e., either a (global) model or a (local) pattern). Any DM process is the achievement of a sequence or graph of tasks, each of which is achieved by executing a set of operators (algortihm implementations). The two endpoints of a DM-Process are DM-Data (input) and DM-Hypothesis (output); however, within the process, an instance of DM-Data can be an output of a subprocess (e.g. data processing) and an instance of DM-Hypothesis can be an input of a subprocess (e.g. model evaluation). For this reason, we group DM-Data and DM-Hypothesis (and their many subclasses) under a derived concept called IO-Object.
Now, from an ontological point of view, only a process -- the execution of an operator or a workflow -- can actually take inputs and produce outputs; thus the properties hasInput and hasOutput are reserved for DM processes. Entities such as tasks, algorithms, operators, and workflows (which are continuants, not occurrents) do not actually consume inputs or generate outputs, they only specify type or class constraints on the expected inputs and outputs. For these entities, we use the properties specifiesInputType and specifiesOutputType.
The properties specifiesInputType and specifiesOutputType play a major role in the organization of DMOP. Everything starts with the definition of data mining tasks. Each DM task is defined by the types of input/output it specifies, and the whole DM-Task hierarchy is structured by these I/O specifications, from the topmost level that encompasses all DM tasks down to the leaf level of specialized tasks like FeatureDiscretization or ModelPruning. These I/O specifications are then propagated by means of property chains to the algorithms that address these tasks, and by extension to the operators that implement these algorithms. As a result, the DM-Task concept induces its hierarchical structure on the upper levels of the DM-Algorithm and DM-Operator hierarchies.
In the initial design of DMOP, this property chain led to the issue of whether to model DM algorithms as classes or individuals. We all know that a DM algorithm (e.g. Quinlan's C4.5 algorithm) is an individual algorithm, not a collection of distinct and independent algorithms. However, we resorted to representing algorithms also as classes because we want to be able to say: C4.5 specifiesInputClass CategoricalLabeledDataSet (the class of datasets with categorical labels). The same goes for the property specifiesOutputType, which for C4.5 will be some subclass of ClassificationModel (instances of which are the individual models generated by executing any implementation of C4.5 using any valid dataset type and hyperparameter settings). In both cases, the value of the specifiesI/OType property is a class. Since classes cannot be assigned as property values to individuals, we created an artificial C4.5Algorithm class (with a single instance C4.5) to which we can properly make this assignment in OWL.
This ontological hack has led to a number of other technical problems that I will not delve into here, because over and above these, it is now clear to me that our "algorithms as classes and instances" representation was due to an unfortunate confusion on the the semantic intent of the properties specifiesInputType and specifiesOutputType. As we saw above, an algorithm is not a process and cannot consume an input or produce an output in reality; it can only specify a type of input or output. For instance, while we can say that a running or terminated process P001 hasInput Iris (an instance of the CategoricalLabeledDataSet class), we can only say that the algorithm C4.5 specifiesInputClass CategoricalLabeledDataSet: it makes no sense to say that C4.5 specifiesInputClass Iris. From this it becomes clear that hasInput/Output and specifiesInput/OutputType cannot have the same range, IO-Object, that we assigned to them. As the range of hasInput/Output, CategoricalLabeledDataset (a descendant class of IO-Object) is meant to be instantiated by datasets like Iris; as the value of specifiesInputType, CategoricalLabeledDataset is no longer a set of datasets, but rather the class CategoricalLabeledDataSet viewed as an instance of a metaclass -- the class of classes that represent input and output objects.
This is where metamodeling comes in, at least in the weak form of punning that is available in OWL2. Punning allows us to use the same term to designate both an individual and a class. We can thus use names referring to the IO-Object class and all its subclasses in two different senses: on the one hand as classes, and on the other as instances of a metaclass that we shall call IO-Class. It is this class that is the intended and semantically proper range of specifiesInput/OutputType as opposed to IO-Object,the range of hasInput/Output.
I have started to remodel DMOP based on this correction, with the metaclass IO-Class composed of "punned instances" representing classes like IO-Class, DM-Data and all its subclasses, DM-Hypothesis, its subclasses DM-Model and DM-PatternSet and all their subclasses. Now we can say: C4.5 (an instance of DM-Algorithm) specifiesInputType CategoricalLabeledDataSet (an instance of IO-Class). We can do the same for algorithm or task classes, e.g. DataProcessingTask (a class) specifiesInputType value DataSet. The hidden worm here is that the individual DataSet is not meant to represent itself alone (the individual value DataSet) but also the individuals representing all the subclasses of DataSet in the baselevel hierarchy. In short when we say that a data processing task specifies DataSet as its input type, we mean this specification to also include individuals representing the classes UnlabeledDataSet, LabeledDataSet, CategoricalLabeledDataSet, ContinuousDataSet, and so on. Should we therefore reintroduce some kind of hierarchical structure into this metaclass? I haven't seen this done in the rare real-world use cases I have seen of metamodeling in OWL2. One example is the paper by B. Glimm, S. Rudolph and J. Voelker presented in ISWC-2010: Integrated Modeling and Diagnosis in OWL2, ISWC-2010, which presents an approach to metamodeling that is far more principled and thorough than the one I have in mind. However, efficiency issues raised in the paper (Section 7) make "the approach not really feasible in practice", as the authors admit.
All suggestions, comments or references to related work on this issue will be most welcome.