Transcription

Data Mining Design and Systematic Modelling Co Yannic Kropp Bernhard ThalheimChristian Albrechts University Kiel, Department of Computer Science, D-24098 Kiel, ormatik.uni-kiel.deAbstract. Data mining is currently a well-established technique and supported by many algorithms. Itis dependent on the data on hand, on properties of the algorithms, on the technology developed so far, andon the expectations and limits to be applied. It must be thus matured, predictable, optimisable, evolving,adaptable and well-founded similar to mathematics and SPICE/CMM-based software engineering. Datamining must therefore be systematic if the results have to be fit to its purpose. One basis of this systematicapproach is model management and model reasoning. We claim that systematic data mining is nothing elsethan systematic modelling. The main notion is the notion of the model in a variety of forms, abstraction andassociations among models.Keywords: data mining, modelling, models, framework, deep model, normal model, modellingmatrixtypically used for the solution of data mining andanalysis tasks. It is neglected that an algorithm also hasan application area, application restrictions, datarequirements, results at certain granularity andprecision. These problems must be systematicallytackled if we want to rely on the results of mining andanalysis. Otherwise analysis may become misleading,biased, or not possible. Therefore, we explicitly treatproperties of mining and analysis. A similar observationcan be made for data handling.Data mining is often considered to be a separatesub-discipline of computer engineering and science.The statistics basis of data mining is well accepted. Wetypically start with a general (or better generic) modeland use for refinement or improvement of the model thedata that are on hand and that seem to be appropriate.This technique is known in sciences under severalnames such as inverse modelling, generic modelling,pattern-based reasoning, (inductive) learning, universalapplication, and systematic modelling.Data mining is typically not only based on onemodel but rather on a model ensemble or model suiteThe association among models in a model suite isexplicitly specified. These associations provide anexplicit form via model suites. Reasoning techniquescombine methods from logics (deductive, inductive,abductive, counter-inductive, etc.), from artificialintelligence (hypothetic, qualitative, concept-based,adductive, etc.), computational methods (algorithmics[6], topology, geometry, reduction, etc.), and cognition(problem representation and solving, causal reasoning,etc.).These choices and handling approaches need asystematic underpinning. Techniques from artificialintelligence, statistics, and engineering are bundledwithin the CRISP framework (e.g. [3]). They can beenhanced by techniques that have originally beendeveloped for modelling, for design science, businessinformatics, learning theory, action theory etc.We combine and generalize the CRISP, heuristics,modelling theory, design science, business informatics,1 IntroductionData mining and analysis is nowadays wellunderstood from the algorithms side. There arethousands of algorithms that have been proposed. Thenumber of success stories is overwhelming and hascaused the big data hype. At the same time, brute-forceapplication of algorithms is still the standard. Nowadaysdata analysis and data mining algorithms are still takenfor granted. They transform data sets and hypothesesinto conclusions. For instance, cluster algorithms checkon given data sets and for a clustering requirementsportfolio whether this portfolio can be supported andprovide as a set of clusters in the positive case as anoutput. The Hopkins index is one of the criteria thatallow to judge whether clusters exist within a data set.A systematic approach to data mining has already beenproposed in [3, 17]. It is based on mathematics andmathematical statistics and thus able to handle errors,biases and configuration of data mining as well. Ourexperience in large data mining projects in archaeology,ecology, climate research, medical research etc. hashowever shown that ad-hoc and brute-force mining isstill the main approach. The results are taken forgranted and believed despite the modelling,understanding, flow of work and data handling pitfalls.So, the results often become dubious.Data are the main source for information in datamining and analysis. Their quality properties have beenneglected for a long time. At the same time, moderndata management allows to handle these problems. In[16] we compare the critical findings or pitfalls of [21]with resolution techniques that can be applied toovercome the crucial pitfalls of data mining inenvironmental sciences reported there. The algorithmsthemselves are another source of pitfalls that areProceedings of the XIX International Conference“Data Analytics and Management in Data IntensiveDomains” (DAMDID/RCDL’2017), Moscow, Russia,October 10-13, 2017273

statistics, and learning approaches in this paper. First,we introduce our notion of the model. Next we showhow data mining can be designed. We apply thisinvestigation to systematic modelling and later tosystematic data mining. It is our goal to develop aholistic and systematic framework for data mining andanalysis. Many issues are left out of the scope of thispaper such as a literature review, a formal introductionof the approach, and a detailed discussion of datamining application cases.adequateness, dependability, well-formedness, scenario,functions and purposes, backgrounds (grounding andbasis), and outer directives (context and community ofpractice). It covers all known so far notions inagriculture, archaeology, arts, biology, cs,environmental sciences, farming, geosciences, historicalsciences, languages, mathematics, medicine, oceansciences, pedagogical science, philosophy, physics,political sciences, sociology, and sports. The modelsused in these disciplines are instruments used in certainscenarios.Sciences distinguish between general, particularand specific things. Particular things are specific forgeneral things and general for specific things. The sameabstraction may be used for modelling. We may startwith a general model. So far, nobody knows how todefine general models for most utilization scenarios.Models function as instruments or tools. Typically,instruments come in a variety of forms and fulfill manydifferent functions. Instruments are partiallyindependent or autonomous of the thing they operateon. Models are however special instruments. They areused with a specific intention within a utilizationscenario. The quality of a model becomes apparent inthe context of this scenario.It might thus be better to start with generic models.A generic model [4, 26, 31, 32] is a model whichbroadly satisfies the purpose and broadly functions inthe given utilization scenario. It is later tailored to suitthe particular purpose and function. It generallyrepresents origins of interest, provides means toestablish adequacy and dependability of the model, andestablishes focus and scope of the model. Genericmodels should satisfy at least five properties: (i) theymust be accurate; (ii) the quality of generic modelsallows that they are used consciously; (iii) they shouldbe descriptive, not evaluative; (iv) they should beflexible so that they can be modified from time to time;(v) they can be used as a first “best guess”.2 Models and ModellingModels are principle instruments in mathematics, dataanalysis, modern computer engineering (CE), teachingany kind of computer technology, and also moderncomputer science (CS). They are built, applied, revisedand manufactured in many CE&CS sub-disciplines in alarge variety of application cases with differentpurposes and context for different communities ofpractice. It is now well understood that models aresomething different from theories. They are oftenintuitive, visualizable, and ideally capture the essence ofan understanding within some community of practiceand some context. At the same time, they are limited inscope, context and the applicability.2.1 The Notion of the ModelThere is however a general notion of a model and of aconception of the model:A model is a well-formed, adequate, and dependableinstrument that represents origins [9, 29, 30].Its criteria of well-formedness, adequacy, anddependability must be commonly accepted by itscommunity of practice within some context andcorrespond to the functions that a model fulfills inutilization scenarios.A well-formed instrument is adequate for a collectionof origins if it is analogous to the origins to berepresented according to some analogy criterion, it ismore focused (e.g. simpler, truncated, more abstract orreduced) than the origins being modelled, and itsufficiently satisfies its purpose.Well-formedness enables an instrument to bejustified by an empirical corroboration according to itsobjectives, by rational coherence and conformityexplicitly stated through conformity formulas orstatements, by falsifiability or validation, and bystability and plasticity within a collection of origins.The instrument is sufficient by its qualitycharacterization for internal quality, external quality andquality in use or through quality characteristics hensibility, parsimony, robustness, novelty etc.Sufficiency is typically combined with some assuranceevaluation (tolerance, modality, confidence, andrestrictions).2.3 Model SuitesMost disciplines integrate a variety of models or asociety of models, e.g. [7, 14] Models used in CE&CSare mainly at the same level of abstraction. It is alreadywell-known for threescore years that they form a modelensemble (e.g. [10, 23]) or horizontal model suite (e.g.[8, 27]). Developed models vary in their scopes,aspects, and facets they represent and their abstraction.A model suite consists of a set of models {M1,.,Mn}, of an association or collaboration schema amongthe models, of controllers that maintain consistency orcoherence of the model suite, of application schematafor explicit maintenance and evolution of the modelsuite, and of tracers for the establishment of thecoherence.Multi-modelling [11, 19, 24] became a culture inCE&CS. Maintenance of coherence, co-evolution, andconsistency among models has become a bottleneck indevelopment. Moreover, different languages with2.2 Generic and Specific ModelsThe general notion of a model covers all aspects of274

different capabilities have become an obstacle similar tomulti-language retrieval [20] and impedancemismatches. Models are often loosely coupled. Theirdependence and relationship is often not explicitlyexpressed. This problem becomes more complex ifmodels are used for different purposes such asconstruction of systems, verification, optimization,explanation, and documentation.of the definitional frame within a model developmentprocess. They define also the capacity and potential of amodel whenever it is utilized.Deep models and the modelling matrix also definesome frame for adequacy and dependability. This frameis enhanced for specific normal models. It is then usedfor a statement in which cases a normal modelrepresents the origins under consideration.2.4 Stepwise Refinement of Models2.6 Deep Models and Matrices in ArchaeologyRefinement of a model to a particular or special modelprovides mechanisms for model transformation alongthe adequacy, the justification and the sufficiency of amodel. Refinement is based on specialization for bettersuitability of a model, on removal of unessentialelements, on combination of models to provide a moreholistic view, on integration that is based on binding ofmodel components to other components and onenhancement that typically improves a model to becomemore adequate or dependable.Control of correctness of refinement [33] forinformation systems takes into account (A) a focus onthe refined structure and refined vocabulary, (B) a focusto information systems structures of interest, (C)abstract information systems computation segments,(D) a description of database segments of interest, and(E) an equivalence relation among those data of interest.Let us consider an application case. The CRC 1266 1“Scales of Transformation – HumanEnvironmental Interaction in Prehistoric andArchaic Societies”investigates processes of transformation from 15,000BCE to 1 BCE, including crisis and collapse, ondifferent scales and dimensions, and as involvingdifferent types of groups, societies, and socialformations. It is based on the matrix and a deep modelas sketched in Figure 1. This matrix determines whichnormal models can still be considered and which not.The initial model for any normal model accepts thismatrix.2.5 Deep Models and the Modelling MatrixModel development is typically based on an explicitand rather quick description of the ‘surface’ or normalmodel and on the mostly unconditional acceptance of adeep model. The latter one directs the modelling processand the surface or normal model. Modelling itself isoften understood as development and design of thenormal model. The deep model is taken for granted andaccepted for a number of normal models.The deep model can be understood as the commonbasis for a number of models. It consists of thegrounding for modelling (paradigms, postulates,restrictions, theories, culture, foundations, conventions,authorities), the outer directives (context andcommunity of practice), and basis (assumptions, generalconcept space, practices, language as carrier, thoughtcommunity and thought style, methodology, pattern,routines, commonsense) of modelling. It uses acollection of undisputable elements of the backgroundas grounding and additionally a disputable andadjustable basis which is commonly accepted in thegiven context by the community of practice. Educationon modelling starts, for instance, directly with the deepmodel. In this case, the deep model has to be acceptedand is thus hidden and latent.A (modelling) matrix is something within or fromwhich something else originates, develops, or takesfrom. The matrix is assumed to be correct for normalmodels. It consists of the deep model and the modellingscenarios. The modelling agenda is derived from themodelling scenario and the utilization scenarios. Themodelling scenario and the deep model serve as a partFigure 1 Modeling in archaeology with a matrixWe base our consideration on the matrix and thedeep model on [19] and the discussions in the CRC.Whether the deep model or the model matrix isappropriate has already been discussed. The finalversion presented in this paper illustrates de/en

description of six main parameters (e.g. for inductivelearning [34]):(a) The data analysis algorithm: Algorithmdevelopment is the main activity in data miningresearch. Each of these algorithms transfers data andsome specific parameters of the algorithm to a result.(b) The concept space: the concept space defines theconcepts under consideration for analysis based oncertain language and common understanding.(c) The data space: The data space typically consists ofa multi-layered data set of different granularity. Datasets may be enhanced by metadata that characterize thedata sets and associate the data sets to other data sets.(d) The hypotheses space: An algorithm is supposed tomap evidence on the concepts to be supported orrejected into hypotheses about it.(e) The prior knowledge space: Specifying thehypothesis space already provides some priorknowledge. In particular, the analysis task starts withthe assumption that the target concept is representablein a certain way.(f) The acceptability and success criteria: Criteria forsuccessful analysis allow to derive termination criteriafor the data analysis.Each instantiation and refinement of the six parametersleads to specific data mining tasks.The result of data mining and data analysis is describedwithin the knowledge space. The data mining andanalysis task may thus be considered to be atransformation of data sets, concept sets and hypothesissets into chunks of knowledge through the applicationof algorithms.Problem solving and modelling considers,however, typically six aspects [16]:(1) Application, problems, and users: The domainconsists of a model of the application, a specification ofproblems under consideration, of tasks that are issued,and of profiles of users.(2) Context: The context of a problem is anything whatcould support the problem solution, e.g. the sciences’background, theories, knowledge, foundations, andconcepts to be used for problem specification, problembackground, and solutions.(3) Technology: Technology is the enabler and definesthe methodology. It provides [23] means for the flow ofproblem solving steps, the flow of activities, thedistribution, the collaboration, and the exchange.(4) Techniques and methods: Techniques and methodscan be given as algorithms. Specific algorithms are dataimprovers and cleaners, data aggregators, erminers, and termination algorithms.(5) Data: Data have their own structuring, their qualityand their life span. They are typically enhanced bymetadata. Data management is a central element ofmost problem solving processes.(6) Solutions: The solutions to problem solving can beformally given, illustrated by visual means, andpresented by models. Models are typically only normalmodels. The deep model and the matrix is alreadyprovided by the context and accepted by the community2.7 Stereotyping of a Data Mining ProcessTypical modeling (and data mining) processes followsome kind of ritual or typical guideline, i.e. they arestereotyped. The stereotype of a modelling process isbased on a general modelling situation. Most modellingmethodologies are bound to one stereotype and onekind of model within one model utilization scenario.Stereotypes are governing, conditioning, steering andguiding the model development. They determine themodel kind, the background and way of modellingactivities. They persuade the activities of modelling.They provide a means for considering the economics ofmodelling. Often, stereotypes use a definitional framethat primes and orients the processes and that considersthe community of practice or actors within the modeldevelopment and utilization processes, the deep modelor the matrix with its specific language and model basis,and the agenda for model development. It might beenhanced by initial models which are derived fromgeneric models in accordance to the matrix.The model utilization scenario determines thefunction that a model might have and therefore also thegoals and purposes of a model.2.8 The AgendaThe agenda is something like a guideline for modelingactivities and for model associations within a modelsuite. It improves the quality of model outcomes byspending some effort to decide what and how muchreasoning to do as opposed to what activities to do. Itbalances resources between the data-level actions andthe reasoning actions. E.g. [17] uses an agent approachwith preparation agents, exploration agents, descriptiveagents, and predictive agents. The agenda for a modelsuite uses thus decisions points that require agendacontrol according to performance and trospective monitoring about performance for the datamining process, coordinated control of the entire miningprocess, and coordinated refinement of the models.Such kind of control is already necessary due to theproblem space, the limitations of resources, and theamount of uncertainty in knowledge, concepts, data,and the environment.3 Data Mining Design3.1 Conceptualization of Data Mining and AnalysisThe data mining and analysis task must be enhanced byan explicit treatment of the languages used for conceptsand hypotheses, and by an explicit description ofknowledge that can be used. The algorithmic solution ofthe task is based on knowledge on algorithms that areused and on data that are available and that are requiredfor the application of the algorithms. Typically, analysisalgorithms are iterative and can run forever. We areinterested only in convergent ones and thus needtermination criteria. Therefore, conceptualization of thedata mining and analysis task consists of a detailed276

of practice in dependence of the needs of thiscommunity for the given application scenario.Therefore, models may be the final result of a datamining and analysis process beside other means.4.1 Setting the Deep Model and the MatrixThe problem to be tackled must be clearly stated independence on the utilization scenario, the tasks to besolved, the community of practice involved, and thegiven context. The result of this step is the deep modeland its matrix. The first one is based on the background,the specific context parameter such as infrastructure andenvironment, and candidates for deep models.Comparing these six spaces with the sixparameters we discover that only four spaces areconsidered so far in data mining. We miss the user andapplication space as well as the representation space.Figure 2 shows the difference.Figure 2 Parameters of Data Mining and the ProblemSolving Aspects3.2 Meta-models of Data MiningAn abstraction layer approach separates the applicationdomain, the model domain and the data domain [17].This separation is illustrated in Figure 3.Figure 4 The Phases in Data Mining Design (Noniterative form)The data mining tasks can be now formulated basedon the matrix and the deep model. We set up thecontext, the environment, the general goal of theproblem and also criteria for adequateness anddependability of the solution, e.g. invariance propertiesfor problem description and for the task setting and itsmathematical formulation and solution faithfulnessproperties for later application of the solution in thegiven environment. What is exactly the problem, theexpected benefit? What should a solution look like?What is known about the application?Deep models already use a background consisting ofan undisputable grounding and a selectable basis. Theexplicit statement of the background provides ns, conceptions, practices, etc. Without thebackground, the results of the analysis cannot beproperly understood. Models have their profile, i.e.goals, purposes and functions. These must be explicitlygiven. The parameters of a generic model can be eitherorder or slave parameters [12], either primary orsecondary or tertiary (also called genotypes orphenotypes or observables) [1, 5], and either ruling (ororder) or driven parameters [12]. Data mining can beenhanced by knowledge management techniques.Additionally, the concept space into which the datamining task is embedded must be specified. Thisconcept space is enhanced during data analysis.Figure 3 The V meta-model of Data Mining DesignThe data mining design framework uses the inversemodeling approach. It starts with the consideration ofthe application domain and develops models asmediators between the data and the application domainworlds. In the sequel we are going to combine the threeapproaches of this section. The meta-model correspondsto other meta-models such as inductive modelling orhypothetical reasoning (hypotheses development,experimenting and testing, analysis of results, interimconclusions, reappraisal against real world).4 Data Mining: A Systematic Model-BasedApproachOur approach presented so far allows to revise and toreformulate the model-oriented data mining process onthe basis of well-defined engineering [15, 25] oralternatively on systematic mathematical problemsolving [22]. Figure 4 displays this revision. We realizethat the first two phases are typically implicitly assumedand not considered. We concentrate on the non-iterativeform. Iterative processes can be handled in a similarform.277

The result of the entire data mining process heavilydepends on the appropriateness of the data sets, theirproperties and quality, and more generally the dataschemata with essentially three components: applicationdata schema with detailed description of data types,metadata schema [18], and generated and auxiliary dataschemata. The first component is well-investigated indata mining and data management monographs. Thesecond and third components inherit research resultsfrom database management, from data mart orwarehouses, and layering of data. An essential elementis the explicit specification of the quality of data. Itallows to derive algorithms for data improvement and toderive limitations for applicability of algorithms.Auxiliary data support performance of the algorithms.Therefore typical data-oriented questions are: Whatdata do we have available? Is the data relevant to theproblem? Is it valid? Does it reflect our expectations?Is the data quality, quantity, recency sufficient? Whichdata we should concentrate on? How is the datatransformed for modelling? How may we increase thequality of data?4.2 Stereotyping the ProcessThe general flow of data mining activities is typicallyimplicitly assumed on the basis of stereotypes whichform a set of tasks, e.g. tasks of prove in whateversystem, transformation tasks, description tasks, andinvestigation tasks. Proofs can follow the classicaldeductive or inductive setting. Also, abductive,adductive, hypothetical and other reasoning techniquesare applicable. Stereotypes typically use model suites asa collection of associated models, are already biased bypriming and orientation, follow policies, data miningdesign constraints, and framing.Data mining and analysis is rather stereotyped. Forinstance, mathematical culture has already developed agood number of stereotypes for problem formulation. Itis based on a mathematical language for the formulationof analysis tasks, on selection and instantiation of thebest fitting variable space and the space of opportunitiesprovided by mathematics.Data mining uses generic models which are thebasis of normal models. Models are based on aseparation of concern according the problem setting:dependence-indicating, dependence-describing, separation or partition spaces, pattern kinds, reasoningkinds, etc. This separation of concern governs theclassical data mining algorithmic classes: associationanalysis, cluster analysis, data grouping with or withoutclassification, classifiers and rules, dependences amongparameters and data subsets, predictor analysis, synergetics, blind or informed or heuristic investigation ofthe search space, and pattern learning.4.4 The Data Mining Process ItselfThe data mining process can be understood as acoherent and stepwise refinement of the given modelsuite. The model refinement may use an explicittransformation or an extract-transform-load processamong models within the model suite. Evaluation andtermination algorithms are an essential element of anydata mining algorithm. They can be based on qualitycriteria for the finalized models in the model suite, e.g.generality, error-proneness, stability, selectionproneness, validation, understandability, repeatability,usability, usefulness, and novelty.Typical questions to answer within this processare: How good is the model suite in terms of the tasksetting? What have we really learned about theapplication domain? What is the real adequacy anddependability of the models in the model suite? Howthese models can be deployed best? How do we knowthat the models in the model suite are still valid? Whichdata are supporting which model in the model suite?Which kind of errors of data is inherited by which partof which model?The final result of the data mining process is then acombination of the deep model and the normal modelwhereas the first one is a latent or hidden component inmost cases. If we want, however, to reason on theresults then the deep model must be understood as well.Otherwise, the results may become surprising and maynot be convincing.4.3 Initialization of the Normal Data ModelsData mining algorithms have their capacity andpotential [2]. Potential and capacity can be based onSWOT (strengths, weaknesses, opportunities, andthreats), SCOPE (situation, core competencies,obstacles, prospects, expectation), and SMART (howsimple, meaningful, adequate, realistic, and trackable)analysis of methods and algorithms. Each of thealgorithm classes has its strengths and weaknesses, itssatisfaction of the tasks and the purpose, and its limitsof applicability. Algorithm selection also includes anexplicit specification of the order of application of thesealgorithms and of mapping parameters that are derivedby means of one algorithm to those that are an input forthe others, i.e. an explicit association within the modelsuite. Additionally, evaluation algorithms for thesuccess criteria are selected. Algorithms have their ownobstinacy, their hypotheses and assumptions that mustbe taken into consideration. Whether an algorithm canbe considered depends on acceptance criteria derived inthe previous two steps.So, we ask: What kind of model suite architecture suitsthe problem best? What are applicable developmentapproaches for modelling? What is the best modellingtechnique to get the right model suite? What kind ofreasoning is supported? What not? What are thelimitations? Which pitfalls should be avoided?4.5 Controllers and SelectorsAlgorithmics [6] treats algorithms as general solutionpattern that have parameters for their instantiation,handling mechanisms for their specialization to a givenenvironment, and enhancers for context injection. So,an algorithm can be derived based on explicit selectorsand control rules [4] if we neglect context injection. We278

can use this approach for data mining design (DMD).For instance, an algorithm pattern such as regressionuses a generic model of parameter dependence, is basedon blind search, has parameters for si

Christian Albrechts University Kiel, Department of Computer Science, D-24098 Kiel, Germany . [email protected] [email protected] . Abstract. Data mining is currently a well-established technique and supported by many algorithms. It is dependent on the data on han