Metamodeling in DMOP

Submitted by hilario on Mon, 28/11/2011 - 10:58

 

To explain metamodeling in DMOP, let's recall four of its core concepts: DM-Task, DM-Algorithm (an implementation of which is called a DM-Operator), DM-Data and DM-Hypothesis (i.e., either a (global) model or a (local) pattern). Any DM process is the achievement of a sequence or graph of tasks, each of which is achieved by executing a set of operators (algortihm implementations). The two endpoints of a DM-Process are DM-Data (input) and DM-Hypothesis (output); however, within the process, an instance of DM-Data can be an output of a subprocess (e.g. data processing) and an instance of DM-Hypothesis can be an input of a subprocess (e.g. model evaluation). For this reason, we group DM-Data and DM-Hypothesis (and their many subclasses) under a derived concept called IO-Object.

Now, from an ontological point of view, only a process -- the execution of an operator or a workflow -- can actually take inputs and produce outputs; thus the properties hasInput and hasOutput are reserved for DM processes. Entities such as tasks, algorithms, operators, and workflows (which are continuants, not occurrents) do not actually consume inputs or generate outputs, they only specify type or class constraints on the expected inputs and outputs. For these entities, we use the properties specifiesInputType and specifiesOutputType.

The properties specifiesInputType and specifiesOutputType play a major role in the organization of DMOP. Everything starts with the definition of data mining tasks. Each DM task is defined by the types of input/output it specifies, and the whole DM-Task hierarchy is structured by these I/O specifications, from the topmost level that encompasses all DM tasks down to the leaf level of specialized tasks like FeatureDiscretization or ModelPruning. These I/O specifications are then propagated by means of property chains to the algorithms that address these tasks, and by extension to the operators that implement these algorithms. As a result, the DM-Task concept induces its hierarchical structure on the upper levels of the DM-Algorithm and DM-Operator hierarchies.

In the initial design of DMOP, this property chain led to the issue of whether to model DM algorithms as classes or individuals. We all know that a DM algorithm (e.g. Quinlan's C4.5 algorithm) is an individual algorithm, not a collection of distinct and independent algorithms. However, we resorted to representing algorithms also as classes because we want to be able to say: C4.5 specifiesInputClass CategoricalLabeledDataSet (the class of datasets with categorical labels). The same goes for the property specifiesOutputType, which for C4.5 will be some subclass of ClassificationModel (instances of which are the individual models generated by executing any implementation of C4.5 using any valid dataset type and  hyperparameter settings).  In both cases, the value of the specifiesI/OType property is a class. Since classes cannot be assigned as property values to individuals, we created an artificial C4.5Algorithm class (with a single instance C4.5) to which we can properly make this assignment in OWL.

This ontological hack  has led to a number of other technical problems that I will not delve into here, because over and above these, it is now clear to me that our "algorithms as classes and instances" representation was due to an unfortunate confusion on the the semantic intent of the properties specifiesInputType and specifiesOutputType.  As we saw above, an algorithm is not a process and cannot  consume an input or produce an output in reality; it can only specify a type of input or output.  For instance, while  we can say that a running or terminated process P001 hasInput Iris (an instance of the CategoricalLabeledDataSet class), we can only say that the algorithm C4.5 specifiesInputClass  CategoricalLabeledDataSet: it makes no sense to say that C4.5 specifiesInputClass Iris. From this it becomes clear that hasInput/Output and specifiesInput/OutputType cannot have the same range, IO-Object, that we assigned to them. As the range of hasInput/Output, CategoricalLabeledDataset (a descendant class of IO-Object) is meant to be instantiated by datasets like Iris; as the value of specifiesInputType, CategoricalLabeledDataset is no longer a set of datasets, but rather the class CategoricalLabeledDataSet viewed as an instance of a metaclass -- the class of classes that represent input and output objects. 

This is where metamodeling comes in, at least in the weak form of punning that is available in OWL2. Punning allows us to use the same term to designate both an individual and a class. We can thus use names referring to the IO-Object class and all its subclasses in two different senses: on the one hand as classes, and on the other as instances of a metaclass that we shall call IO-Class. It is this class that is the intended and semantically proper range of specifiesInput/OutputType as opposed to IO-Object,the range of hasInput/Output.

I have started to remodel DMOP based on this correction, with the metaclass IO-Class composed of "punned instances" representing classes like IO-Class, DM-Data and all its subclasses, DM-Hypothesis, its subclasses DM-Model and DM-PatternSet and all their subclasses. Now we can say: C4.5 (an instance of DM-Algorithm) specifiesInputType CategoricalLabeledDataSet (an instance of IO-Class). We can do the same for algorithm or task classes, e.g. DataProcessingTask (a class) specifiesInputType value DataSet. The hidden worm here is that the individual DataSet is not meant to represent itself alone (the individual value DataSet) but also the individuals representing all the subclasses of DataSet in the baselevel hierarchy. In short when we say that a data processing task specifies DataSet as its input type, we mean this specification to also include individuals representing the classes UnlabeledDataSet, LabeledDataSet, CategoricalLabeledDataSet, ContinuousDataSet, and so on. Should we therefore reintroduce some kind of hierarchical structure into this metaclass? I haven't seen this done in the rare real-world use cases I have seen of metamodeling in OWL2. One example is the paper by B. Glimm, S. Rudolph and J. Voelker presented in ISWC-2010: Integrated Modeling and Diagnosis in OWL2, ISWC-2010, which presents an approach to metamodeling that is far more principled and thorough than the one I have in mind. However, efficiency issues raised in the paper (Section 7) make "the approach not really feasible in practice", as the authors admit.

All suggestions, comments or references to related work on this issue will be most welcome.

on what is an algorithm (and then metamodelling, or not)

There are actually several issues at hand in your description. (i) philosophically, it is not clear at all what the ontological status of algorithm and a particular algorithm is, hence, it is unclear what Quinlan's C4.5 is, (ii) ontologically, instances are those things that cannot be instantiated (e.g., my MacBook is an instance, but MacBook is a type of computer that can have instances, thus it is a class/universal), (iii) there is a difference between instance-class vs. part-whole, and (iv) on metamodelling. Some detail:

 

- (i) it has been proposed that an algorithm is a mathematical object, an idea, a concept, and a concrete abstraction, and to date, there is no convergence toward one or the other view or understanding of it (see e.g., [1,2,3]). What people do agree on is that there is a distinction between algorithm, program code, and the execution of the program code on a computer. So we could have an algorithm, say, Quinlan's C4.5, either as class that is instantiated or it is an instance that is implemented, in, say, some Java code or in some C++ code (written text, possibly an instance of 'information artifact', but both are an implementation of Quinlan's C4.5), and executed on the computer with an individual process for each time it's being executed. Perhaps an analogy may yield some insight: draw a parallel between an algorithm and a conceptual data model, the SQL statements that create the relational database (cf. Java/C++ code), and the running database in the DBMS (cf. program execution), and then one could consider Quinlan's C4.5 just as much an instance as the conceptual data model for, say, my bacteriocins database or your university’s database. But the problem with that is that I can rewrite my conceptual data model represented in EER into a semantically equivalent one in ORM, and thus have two distinct models for the 'same thing'. So, is there then a 'metathing' and the EER and ORM models are the instances of that elusive object? And, likewise, is there a 'metathing' to Quinlan's C4.5? (After all, there are different notations for writing down an algorithm.) This brings us to algorithms like Quinlan's C4.5 as ideas or concepts (hence, OWL classes), not mathematical objects or concrete abstractions (OWL individual). However, as said, there are several views on its ontological status, and for the time being, you will not be able to please everyone regardless which one you choose.

 

- (ii) There's a lot more to the previous point that can/ought to be investigated to obtain a conclusive answer, and depending on the ontological status of Algorithm and a particular algorithm, and their relation to program code, and the execution, something like Quinlan's C4.5 is either an instance or a class, but, by ontology, it cannot be both. If we loosen that a bit for practically obtaining something in DMOP within a deadline: it cannot be both at the same time in the same ontology. If we keep both--i.e., e.g., Quinlan's C4.5 as instance and as class--in the ontology because we don't know what it is, we're encoding our confusion, which sooner or later will have to be cleaned up one way or the other and in the meantime bound to introduce some errors by those who are not well-versed in such a double representation. On the other hand, if we know it is a class/instances but we explicitly communicate we're fiddling with the actual encoding at the back end, why, and how, perhaps that might work.

 

- (iii) you mention "not a collection of distinct and independent algorithms" in the same breath at the notion of algorithm as instance. This has nothing to do with the ontological nature of an algorithm being an instance or a class. Sure, a DM algorithm is not an example of what is generally denoted with member-collection or object-aggregate; that is, it is unlike a musician who is a member of an orchestra (or, at the instance level: e.g., pianist#1 member of the Royal Philharmonic Orchestra) and a ship member of a fleet, or, roughly, object-collective noun.

Are you saying that every DM algorithm is a monolithic thing without sub-algorithms? I can imagine that some algorithm X may be incorporating algorithm Y and Z (they do in optimization, but I'm not familiar with DM algorithms, so please correct me if I'm wrong). Assuming this, then the next aspect is the inclusion of part/sub-algorithm, possibly with a sequential structure among the algorithms Y and Z as part of X. Either way, this is orthogonal to the class-or-instance issue, as one can assert it both at the class-level 'X hasPart some Y', or, if algorithms are instances, 'x hasPart y'--which one to use depends on (i).

 

- (iv) Now, with respect to punning and Glimm et al's metamodeling [4], or the more straight-forward approach by Welty [5]: this is exceedingly suitable for OntoClean (let's leave the species example be), because it has a set of metarules, which, thanks to the re-coding, now can be pushed into the OWL ontology and be subjected to the implemented reasoning algorithms for DL, as opposed to the alternative of having to spend time on designing a new reasoner on top of the existing ones solely to handle OntoClean's rules.

Do you have a set of 'meta rules' for DMOP classes? If so, are they intrinsically there, alike in OntoClean "an anti-rigid class cannot subsume a rigid class", or are they an artifact of the class/instance w.r.t. the algorithms?

Let me assume for the moment we need it and are willing to put up with Glimm et al's metamodelling "admin baggage axioms" we have to add to the ontology (cf. the real axioms about the subject domain). Then we have to come up with a method to do this in a consistent way to add them, manage updates, and track those 'admin axioms', as the paper is a bit sketchy on that aspect. Also, it definitely needs SROIQ features, so eventually, some scalability assessments will have to be made.

 

 

References:

[1] http://www.cse.buffalo.edu/~rapaport/584/whatisanalg.html (a long list of papers on “what is an algorithm” of Rapaport’s course on the Philosophy of Computer Science at SUNY Buffalo, USA)

[2] Turner, Raymond and Eden, Amnon, "The Philosophy of Computer Science", The Stanford Encyclopedia of Philosophy (Summer 2009 Edition), Edward N. Zalta (ed.), URL = <http://plato.stanford.edu/archives/sum2009/entries/computer-science/>.

[3] Lando, P, Lapujade, A., Kassel, G, Fuerst, F. Towards a general ontology of computer programs. In: Proceedings of the 2nd International Conference on Software and Data Technologies (ICSOFT’07). Barcelona, Spain, July 25-27, 2007.

[4] Glimm, B, Rudolph, S., Voelker, J. Integrated metamodeling and diagnosis in OWL 2. Proc. of ISWC’10. Springer LNCS 6496, 257-272

[5] Welty, Chris. 2006. OntOWLClean: Cleaning OWL Ontologies with OWL. In, Proceedings of FOIS-2006. Pp. 347-359. Baltimore:IOS Press. November, 2006.

 

Hopefully this breakdown gives some more insight and arguments (I'm aware, it's not an answer, yet),
Regards,
Maria

Metamodeling in DMOP

Maria, thanks for the very instructive comments.

- (i) [...] What people do agree on is that there is a distinction between algorithm, program code, and the execution of the program code on a computer.

DMOP's concepts reflect faithfully this distinction between algorithm (DM-Algorithm), program code (DM-Operator) and the execution of the program code (DM-Operation).

So we could have an algorithm, say, Quinlan's C4.5, either as class that is instantiated or it is an instance that is implemented, in, say, some Java code or in some C++ code (written text, possibly an instance of 'information artifact', but both are an implementation of Quinlan's C4.5), and executed on the computer with an individual process for each time it's being executed.

I agree that there is no single right answer, but we prefer to view an algorithm as an instance rather than a class; C4.5, for example, is an individual artifact that was designed by Quinlan some 20 years ago, has had different implementations (not instantiations) in different programming languages, etc. C4.5 in a different algorithmic notation would still be C4.5, and we would still be able to distinguish it from very similar algorithms (e.g. its predecessor, ID3) based on their properties as described in DMOP. Sure there are algorithm classes; C4.5 is just one instance of the broad class of decision trees, or the narrower class of univariate decision trees, or of even finer-grained classes we can imagine such as "Quinlan-Algorithm", of which ID3, C4.5 and C5 would be distinct instances.

- (ii) There's a lot more to the previous point that can/ought to be investigated to obtain a conclusive answer, and depending on the ontological status of Algorithm and a particular algorithm, and their relation to program code, and the execution, something like Quinlan's C4.5 is either an instance or a class, but, by ontology, it cannot be both.

As explained in my previous post, with the algorithm as instance option, we could not say, e.g. "C4.5 (an individual) specifiesInputClass CategoricalLabeledDataSet (a class) - at least not in OWL DL. Following Noy et al.'s "Approach 2" [1], we created a class, C4.5Algorithm, with a single instance, C4.5, to solve this "classes as property values" problem. But we were aware that it was nothing more than an ontological hack, which we've always tried to redesign away. With a bit of luck, the metamodeling approach will allow us to do so.

- (iii) you mention "not a collection of distinct and independent algorithms" in the same breath at the notion of algorithm as instance. This has nothing to do with the ontological nature of an algorithm being an instance or a class. [...]

Actually, I was still referring to the instance vs class problem. When I said that an algorithm is an instance and not a collection of distinct and independent algorithms, I was only relying on the common view of OWL classes as sets or collections of individuals.  Nothing to do with the member-collection type of part-whole relations. We do come to grips with part-whole problems though, and not just wrt algorithms -- but as you say, that is completely orthogonal to the class vs instance problem. (By the way, there's an issue in the Cicero platform concerning subprocess and part-of that could make use of your expertise :-)).

- (iv) Now, with respect to punning and Glimm et al's metamodeling [...]
Do you have a set of 'meta rules' for DMOP classes? If so, are they intrinsically there, alike in OntoClean "an anti-rigid class cannot subsume a rigid class", or are they an artifact of the class/instance w.r.t. the algorithms?

I found Glimm et al.'s paper while searching for a solution to the inefficiency of our initial DMOP metamodel. Our goals and use cases are clearly different, and we have no meta-rules of the kind they have. They built a full meta-ontology because their goal was diagnosis and repair of ontologies. DMOP's metamodel focuses exclusively on the IO-Object class, because we need to be able to talk not only of instances of IO-Object classes (e.g. Iris is an instance of the class CategoricalLabeledDataSet) but also of IO-Object classes as instances of a metaclass (e.g. CategoricalLabeledDataSet is an instance of the metaclass  IO-Class). For all the reasons you give, and due to the efficiency problems they mention in their paper, I don't think their approach is suitable for our purposes, though I still think they did a remarkable job.

[1] N. Noy, M. Uschold, C. Welty (2005). Representing Classes as Property Values on the Semantic Web. http://www.w3.org/TR/2005/NOTE-swbp-classes-as-values-20050405.

metamodelling?

Melanie, I had another look at the "we want to be able to say: C4.5 (an individual) specifiesInputClass CategoricalLabeledDataSet (the class of datasets with categorical labels)." , and I think what you really want to express, which this axiom does not, however.
Tuples (relational instances) have only instances participating in it, classes only classes. If that is not convenient practically in the ontology development exercise, then for the former the punning is a hack and for the latter there are the nominals as a hack. But, ontologically, it's not right and, depending ont the motivation, may be considered modelling 'laziness'/shortcuts/collapsing a few mental steps into one and/or a limitation in Protege.
So, with C4.5 as individual, then we'd have an assertion C4.5 specifiesInput (IrisDataSet or ... or ... ), which is a bit cumbersome to represent in the ontology, so you want to refer to those datasets as a single entity, being the set of the individual data sets that can go with C4.5, i.e., CategoricalLabeledDataSet. But this does not mean that inherently an individual 'relates to a class through an [arbitrary] object property'. What we actually have is a regular instantiation relation. Or, at least, that may be the case.

This brings me to the second point. In Description logic notation (and latex), role/relationship/objectproperty R and class C, then  "\exists R.C" denotes the class of objects that all have a relation to the objects of C through R---there does not have to be a separate named class for that, but we could introduce one, e.g., call that class "D", and let us have object "a" as instance of that. Then, with
1. (\exists R.C)(a)
2. D \equiv \exists R.C
3. D(a)
line 1 is semantically the same as 2+3.
Or, given OWL 2 functional syntax for class assertions, ClassAssertion( CE a ), where the "CE" stands for any class expression, then
ClassAssertion( ObjectSomeValuesFrom(R C) a ) 
is a valid OWL expression. So, we also can say
ClassAssertion( ObjectSomeValuesFrom(specifiesInputClass CategoricalLabeledDataSet) C4.5 ) 
or: 'C4.5 is an instance of the class of algorithms that specify the input class CategoricalLabeledDataSet', whichever the name of that class is, if any.
The Protege interface doesn't allow you to assert a complex expression like that in the class hierarchy, though you can do that as a specification of a domain or range, as well as for instances in the 'class expression editor' of the popup in the individuals tab. The only way to do it in the class hierarchy is to introduce D, along the line of line 2 above. So, possibly that's where the perceived need came from to have the algorithms both as a class and an instance, because the tool forces you to invent a name for it, and what is more natural than to give it the same name as the instance you wanted to model (i.e., the "artificial C4.5Algorithm class," as you mentioned in the initial post).

If that is the case, then from an ontologyical viewpoint, you don't need the punning and all the admin baggage that goes with it. (This does not imply you'd have to throw it out, as there still my be good practical reasons to represent it this way.)

Regards, Maria

metamodeling and parts

Melanie, thank you for the clarifications.

As long as yourselves as domain experts agree on the algorithm as instance, that's fine with me.

When you say that "(e.g. Iris is an instance of the class CategoricalLabeledDataSet) but also of IO-Object classes as instances of a metaclass (e.g. CategoricalLabeledDataSet is an instance of the metaclass  IO-Class).", yes, sure the Iris Data Set is an instance of the class CategoricalLabeledDataSet, but I don't follow why the class CategoricalLabeledDataSet really must be an instance of IO-Class, but that it cannot be a subclass.

Regarding the parthood and 'subprocess', and the explanation behind the link, being "Indeed, hasSubprocess shouldn't be a subproperty of hasPart whose domain should be Continuant (though it is blank in DMOP). I can't think of a suitable superproperty right now but topObjectProperty. hasParticipant is a property whose domain is Occurrent but whose range is Continuant. What we want is a property whose domain and range are both Occurrents.": hasPart (and partOf) in mereology is not constrained to continuants, but can be any two entities--whichever category they are. So you ver well can introduce a sub-property of haspart/partof that has its domain and range restricted to occurrents only. In [1], we used the term "involved in" (largely motivated by its usage in conceptual data modelling and cognitive science) and put it as a sub-property of part-of. Whether one prefers to call the relation between a process and its part-processes involved-in or sub-process-of doesn't really matter then (though some might confuse the label of sub-process with a subsumption than parthood initially).

[1] Keet, C.M. and Artale, A. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied Ontology, 2008, 3(1-2): 91-110.

Re: metamodeling and parts

"... yes, sure the Iris Data Set is an instance of the class CategoricalLabeledDataSet, but I don't follow why the class CategoricalLabeledDataSet really must be an instance of IO-Class, but that it cannot be a subclass."

Maybe CategoricalLabeledDataSet can be a subclass of the metaclass IO-Class, but the reason we resorted to punning is that we want to be able to assign the value CategoricalLabeledDataSet to the propery specifiesInputObject of an algorithm instance -- and this is accepted by OWL only if CategoricalLabeledDataSet is itself an instance and not a class.

" ... hasPart (and partOf) in mereology is not constrained to continuants, but can be any two entities--whichever category they are."

Thanks, that's exactly the clear answer we needed. So I'll keep hasSubprocess as a subproperty of hasPart with no qualms.

User login