Modeling Data Characteristics

Submitted by hilario on Fri, 07/10/2011 - 18:37

 

Dataset characteristics are meta-features (e.g., number of instances, number of features) used to describe a dataset. They play an important role in the meta-analysis of data mining experimental results, as they serve to determine regions of the instance space where specific algorithms are locally superior to others. Dataset characteristics have been a constant focus of meta-learning research since the 1990s.

DMOP integrates many of the dataset characteristics used in previous projects like Statlog and Metal as well as geometrical measures of data complexity introduced by Ho and Basu [2006]. In its current state, DMOP represents most dataset characteristics as simple object or datatype properties, e.g. the Iris dataset can be described as  (in Manchester syntax):

  • Iris hasNumberOfInstances value 150
  • Iris hasNumberOfClasses value 3
  • Iris hasPointsPerDimension value 37.5
  • Iris hasFeatureEfficiency value 0.087
  • etc.

This is the simplest and most straightforward representation one can think of, and this simplicity is its strength. However, it has obvious limits like the impossibility of specifying exactly how a particular meta-feature value was measured, what units were used, etc. If we need this added expressivity, then (meta-)feature reification is a better alternative.

In fact the above issue is no different from the general question of how to model experimental or observational data, whether at the meta-level or in any base-level domain (biology, astronomy, finance, etc.). For instance, the Extensible Observation Ontology (OBOE, http://ecoinformatics.org/oboe/oboe.1.0/oboe-core.owl) models each empirical observation as shown in the figure below.

"An observation is made of an entity (e.g., biological organisms, geographic locations, environmental features) and serves to group a set of measurements together to form a single “observation event”. A measurement assigns a value to a characteristic of the observed entity (e.g., the height of a tree), and can also include standards (e.g., units) and collection protocols. An observation can occur within the surrounding context of other observations, where context can be viewed as a form of dependency [1], and context often includes a named relationship (e.g., “partOf”, “within”) between observed entities."
[Fig. 1 and quoted text from Saunders et al., Protégé Extensions for  Scientist-Oriented Modeling of Observation and Measurement Semantics, OWLED-2011.]

In many cases the above representation, which has been designed for ecological data, seems overly complex for abstract entities like datasets. While it makes perfect sense to record multiple observations of an evolving entity (e.g., to measure a tree's height or diameter over time), it is hardly justifiable for a dataset, which is generally perceived as a static entity. As soon as we add a new instance, it is no longer the same dataset --- it becomes a different sample, albeit of the same population. The same can be said of a mined hypothesis or model: a decision tree becomes a different model if we add a single new leaf. On the other hand, the same model characteristic can yield two measurements (and therefore two observations) depending on the standards or criteria are used. For instance, the characteristic HypothesisComplexity of the same decision tree can have a value of 3 using TreeDepth as the criterion and value 17 using NumberOfLeaves. In such cases the proposed representation exhibits the adequate level of expressivity. I'm not sure though if the expressive power provided for these (relatively rare) cases justifes the additional complexity. However, a clear advantage is the resulting representational homogeneity for all observed data, whether domain data or meta-data.

I see a slight problem with the hasValue property of Measurement. Its given range is Entity, which is not always true. In the majority of cases, observed values are datatypes such as integers, reals, strings. Unless I am mistaken, it is not possible to declare hasValue as an object and a datatype property at the same time in OWL. Or is this kind of punning now legal in OWL 2? If not, what is the workaround?

Any comments on the pros and cons of reifying dataset characteristics (and perhaps algorithm and hypothesis characteristics) based on the OBOE model?

on meta-features

The issue of how to represent meta-features is more important and complex than it may seem. A previous message posed the question if meta-features should be considered in a different manner. My view is that there should exist a clear understanding (and explanation) of the reasons why a meta-feature is relevant when it comes to learning at the meta-level. Most research until now has generated meta-features "hoping" that the corresponding information will help to differentiate performance of different models.  Hence, a meta-feature should be built as a more complex object that has not only a definition (i.e., formula) to compute it, but also an indication of its scope, according to the data distribution. In other words, we need to know when the meta-feature is useful or not.     

Re: on meta-features

You're absolutely right. The proposed meta-features for datasets, algorithms and models or pattern sets should be taken as working hypotheses in the current state of the art. Despite the number of meta-feature evaluation and selection experiments done so far, we still cannot say for certain which meta-features are definitely correlated (positively or negatively) with learning performance. For this reason, I agree completely that a meta-feature should be represented as a self-standing entity that can itself be annotated with properties that can explain or qualify/quantify in some way its meta-analytical utility.

This is exactly what we did in the latest update of DMOP (see it on the online browser or download it here):  meta-features have been reified and are now descendant classes of the high-level DOLCE concept Quality; their instances are the different values they take in DM objects (e.g. class entropy of the Iris dataset). With this representation, we can annotate each meta-feature with the formula used to quantify it, and each meta-feature value with its putative or observed impact on data mining performance. We are still looking for an exploitable representation of mathematical formulas; these are currently stored as strings in an annotation property called latexFormula and are useful only for documentation purposes. The observed behavior of meta-features should not be recorded directly in the ontology but in DM experiment databases built on DMOP's conceptual framework. When queries on such databases allow us to detect statistically significant trends, this kind of experimentally derived knowledge can be used to refine the ontology (e.g., by revising an empirical measure of meta-feature utility or by deprecating meta-features that have been proven worthless).

Data Cubes

An interesting, relevant work (pointed out by Aldo Gangemi) is The RDF Data Cube vocabulary (http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/ma...). 

 

User login