Mechanizing Hypothesis Formation: Principles and Case Studies

by David Chudán, Jan Rauch, Milan Šimůnek, Petr Máša

Length: 346 pages
Edition: 1
Language: English
Publisher: CRC Press
Publication Date: 2022-10-20
ISBN-10: 0367549808
ISBN-13: 9780367549800
Sales Rank: #0 (See Top 100 Books)

Mechanizing hypothesis formation is an approach to exploratory data analysis. Its development started in the 1960s inspired by the question “can computers formulate and verify scientific hypotheses?“. The development resulted in a general theory of logic of discovery. It comprises theoretical calculi dealing with theoretical statements as well as observational calculi dealing with observational statements concerning finite results of observation. Both calculi are related through statistical hypotheses tests. A GUHA method is a tool of the logic of discovery. It uses a one-to-one relation between theoretical and observational statements to get all interesting theoretical statements. A GUHA procedure generates all interesting observational statements and verifies them in a given observational data. Output of the procedure consists of all observational statements true in the given data. Several GUHA procedures dealing with association rules, couples of association rules, action rules, histograms, couples of histograms, and patterns based on general contingency tables are involved in the LISp-Miner system developed at the Prague University of Economics and Business. Various results about observational calculi were achieved and applied together with the LISp-Miner system.

The book covers a brief overview of logic of discovery. Many examples of applications of the GUHA procedures to solve real problems relevant to data mining and business intelligence are presented. An overview of recent research results relevant to dealing with domain knowledge in data mining and its automation is provided. Firsthand experiences with implementation of the GUHA method in the Python language are presented.

Cover
Title Page
Copyright Page
Dedication
Preface
Table of Contents
1. Introduction
	1.1 Mechanizing Hypothesis Formation
		1.1.1 Questions of logic of discovery
		1.1.2 Logic of discovery and observational calculi
		1.1.3 GUHA method—tool of logic of discovery
		1.1.4 Notes to history and overview of results
	1.2 Data Mining
		1.2.1 Discipline of informatics
		1.2.2 CRISP-DM
		1.2.3 Rules discovery
		1.2.4 Exception rules and action rules
		1.2.5 Subgroup discovery
		1.2.6 Presented GUHA procedures and data mining
	1.3 Business Intelligence and Data Science
		1.3.1 Business Intelligence and GUHA procedures
		1.3.2 Data science and mechanizing hypothesis formation
	1.4 Data Matrix
		1.4.1 Data matrix—an example
		1.4.2 Data matrix—definition
		1.4.3 Boolean attributes
		1.4.4 Data sub-matrix
	1.5 Data Matrix and Items of Domain Knowledge
		1.5.1 Groups of attributes
		1.5.2 Transformations of attributes
		1.5.3 Global properties of attributes
		1.5.4 Mutual dependence of attributes
	1.6 Goals, Structure, and Using the Book
		1.6.1 Goals and structure
		1.6.2 Using the book
2. Datasets
	2.1 Which Datasets and Where are they Used
	2.2 Adult Datas
		2.2.1 Adult Dataset—basic info
		2.2.2 Adult Dataset—derived attributes
		2.2.3 Adult Dataset—items of domain knowledge
	2.3 UK Car Accidents Dataset
		2.3.1 Accidents data matrix and groups of attributes
		2.3.2 Group Date_Time
		2.3.3 Group Driver
		2.3.4 Group Conditions
		2.3.5 Group Vehicle
		2.3.6 Group Authorities
		2.3.7 Group Consequences
	2.4 STULONG Dataset
		2.4.1 Entry data matrix and groups of attributes
		2.4.2 Group Personal
		2.4.3 Group Anamnesis
		2.4.4 Group Risks
		2.4.5 Group Measurement
		2.4.6 Group Alcohol consumption
		2.4.7 Group Blood pressure
		2.4.8 Group Biochemical examination
	2.5 Fictive Hotel Dataset
		2.5.1 HotelPlusExternal data matrix and groups of attributes
		2.5.2 Group Guest
		2.5.3 Group Domicile
		2.5.4 Group Meteo
		2.5.5 Group Questionnaire
		2.5.6 Group Stay
		2.5.7 Group Check-in
		2.5.8 Group Price
Section I: The Guha Procedures
	3. Principle and Simple Examples
		3.1 GUHA Procedures Principle
		3.2 Association Rules and 4ft-Miner
		3.3 Histograms and CF-Miner
		3.4 Pairs of Attributes and KL-Miner
		3.5 Couples of Association Rules and SD4ft-Miner
		3.6 Couples of Histograms and SDCF-Miner
		3.7 Couples of Pairs of Attributes and SDKL-Miner
		3.8 Action Rules and Ac4ft-Miner
	4. Common Features
		4.1 Overview of Procedures and Patterns
		4.2 Contingency Tables
		4.3 Principles of Patterns Evaluation
		4.4 Set of Relevant Boolean Attributes
			4.4.1 Literals and types of coefficients
			4.4.2 Set of relevant literals
			4.4.3 Example of partial cedents
			4.4.4 Set of relevant partial cedents
			4.4.5 Set of relevant cedents
		4.5 Missing Information
			4.5.1 Data matrices with missing information
			4.5.2 Secured completion of missings and Boolean attributes
	5. LISp-Miner System
		5.1 Overview of LISp-Miner
			5.1.1 Teaching and research tool
			5.1.2 Home page
		5.2 Requirements and Prerequisites
		5.3 Main Concept
			5.3.1 Context diagram
			5.3.2 Analysed data
			5.3.3 Metabase
			5.3.4 Knowledgebase
			5.3.5 Context diagram of GUHA-procedure
			5.3.6 LM Workspace module
			5.3.7 Data-mining automation module
		5.4 EverMinerSimple Demo
		5.5 System Design and Implementation
			5.5.1 Programming language and environment
			5.5.2 Implementation layers
			5.5.3 Bitstrings
Section II: Applying the Guha Procedures
	6. Examples Overview
		6.1 Overview of 4ft-Miner Application Examples
			6.1.1 4ft-Miner and arules
			6.1.2 Applying important features of GUHA association rules
			6.1.3 Mining for exception GUHA association rules
		6.2 Overview of CF-Miner Application Examples
			6.2.1 Subgroup discovery in Adult dataset
			6.2.2 Subgroup discovery in Accidents dataset
		6.3 Overview of KL-Miner Application Examples
			6.3.1 Blood pressure—ordinal dependence and independence
			6.3.2 Subgroup discovery using range of quantifiers
		6.4 Overview of SD4ft-Miner Application Examples
			6.4.1 Comparing districts
			6.4.2 Comparing female and male drivers
		6.5 Overview of SDCF-Miner Application Examples
			6.5.1 Exceptional histograms and authorities
			6.5.2 Trends of the number of accidents and police forces
		6.6 Overview of SDKL-Miner Applications
		6.7 Overview of Ac4ft-Miner Application Examples
			6.7.1 Action rules and blood pressure
			6.7.2 Action rules and guest satisfaction
		6.8 GUHA and Business Intelligence—Overview
		6.9 GUHA and Python—CleverMiner Project
		6.10 Examples Summary
			6.10.1 Applying coefficients
			6.10.2 Applying partial cedents
		6.11 Important Notes
	7. 4ft-Miner—GUHA Association Rules
		7.1 GUHA Association Rules and 4ft-Miner Procedure
			7.1.1 GUHA association rules and related notions
			7.1.2 4ft-quantifiers for classical mode of 4ft-Miner
			7.1.3 4ft-quantifiers for histogram mode of 4ft-Miner
			7.1.4 Association rules and missing information
			7.1.5 Secured completion and association rules
			7.1.6 Ignoring missing information
			7.1.7 Prime association rules
			7.1.8 4ft-Miner input and output
		7.2 Comparing 4ft-Miner and Arules
			7.2.1 Principles of comparison
			7.2.2 Performance
			7.2.3 Comparing ignoring missings and secured completion
			7.2.4 Loss of some interesting rules
			7.2.5 Applying GUHA features
			7.2.6 Summary of comparison
		7.3 Applying 4ft-Miner in Adult Dataset
			7.3.1 Applying sequences and right cuts—extreme gain
			7.3.2 Conjunctions in succedent—very rich persons
			7.3.3 Disjunctions in succedent—rich persons
			7.3.4 Applying logical deduction—prime rules
		7.4 Applying 4ft-Miner in Accidents Dataset
			7.4.1 Exception rules—increasing columns of histogram
			7.4.2 Exception rules—lowering columns of histogram
			7.4.3 Exception from exception—increasing confidence
	8. CF-Miner—Histograms
		8.1 CF-Miner and Related Notions
			8.1.1 Conditional histogram, CF-table and CF-pattern
			8.1.2 CF-Miner input and output
			8.1.3 Range of CF-quantifiers
			8.1.4 Simple frequencies CF-quantifiers
			8.1.5 CF-quantifiers concerning steps in histogram
		8.2 Applying CF-Miner to Adult Dataset
			8.2.1 Increasing histograms
			8.2.2 Decreasing histograms
			8.2.3 First decreasing and then increasing histograms
		8.3 Applying CF-Miner to Accidents Dataset
			8.3.1 Large segments of accidents with decreasing trend
			8.3.2 Exceptions to generally decreasing trend
			8.3.3 Exceptions to a concrete decreasing trend
			8.3.4 Exceptions to exception to generally decreasing trend
	9. KL-Miner—Pairs of Categorical Attributes
		9.1 KL-Miner and Related Notions
			9.1.1 KL-Miner input and output
			9.1.2 Four types of frequencies
			9.1.3 Range of KL-quantifiers
			9.1.4 Simple frequencies KL-quantifiers
			9.1.5 Advanced KL-quantifiers
		9.2 Applying KL-Miner in STULONG Dataset
			9.2.1 Conditions indicating high ordinal dependence
			9.2.2 Conditions indicating almost ordinal independence
		9.3 Applying KL-Miner in Hotel Dataset
			9.3.1 Applying range of KL-quantifier
	10. SD4ft-Miner—Couples of GUHA Association Rules
		10.1 SD4ft-Miner and Related Notions
			10.1.1 SD4ft-Miner input and output
			10.1.2 SD4ft-quantifiers
		10.2 Applying SD4ft-Miner in Accidents Dataset
			10.2.1 Differences among districts
			10.2.2 Confidence in districts higher than in the whole dataset
			10.2.3 Confidence in districts lower than in the whole dataset
			10.2.4 Relative frequency of accidents higher for male drivers
			10.2.5 Relative frequency of accidents higher for female drivers
			10.2.6 Similarities between male and female
	11. SDCF-Miner—Couples of Histograms
		11.1 SDCF-Miner and Related Notions
			11.1.1 SDCF-Miner input and output
			11.1.2 Modes of SDCF-Miner and SDCF-tables
			11.1.3 Simple frequencies SDCF-quantifiers
			11.1.4 SDCF-quantifiers concerning steps in histogram
		11.2 Applying SDCF-Miner in Accidents Dataset
			11.2.1 Exceptions to increasing trends and authorities
			11.2.2 Differences between police forces
	12. SDKL-Miner—Couples of Pairs of Categorical Attributes
		12.1 SDKL-Miner and Related Notions
			12.1.1 SDKL-Miner input and output
			12.1.2 SDKL-quantifiers
		12.2 Applying SDKL-Miner in STULONG Dataset
			12.2.1 Drinking liquors—groups with the highest τB difference
			12.2.2 Drinking vine—groups with the highest τB difference
			12.2.3 Drinking beer—groups with the highest τB difference
	13. Ac4ft-Miner—Action Rules
		13.1 Ac4ft-Miner and Related Notions
			13.1.1 Flexible and stable attributes
			13.1.2 Changes of Boolean attributes
			13.1.3 Action rules
			13.1.4 Ac4ft-quantifiers and truthfulness of action rules
			13.1.5 Relevant changes of Boolean attribute
			13.1.6 Ac4ft-Miner procedure input and output
		13.2 Applying Ac4ft-Miner in STULONG Dataset
			13.2.1 Two analytical questions–common features
			13.2.2 BMI and decreasing probability of high blood pressure
			13.2.3 Increasing probability of average blood pressure
		13.3 Applying Ac4ft-Miner in Hotel Dataset
			13.3.1 Increasing guest satisfaction
			13.3.2 Increasing guest satisfaction and consequences
	14. GUHA Procedures and Business Intelligence
		14.1 Business Intelligence and Self Service BI
		14.2 Comparing Analysis Performed by Self Service BI and GUHA
		14.3 Scenarios of Complementary Usage of BI and GUHA
			14.3.1 Gaining insight into specific (interesting) part of the dataset
			14.3.2 Automatic BI analysis using GUHA data mining
		14.4 Examples on Accidents Dataset
			14.4.1 Automatic BI analysis using GUHA data mining
			14.4.2 Gaining inside into specific parts of dataset
		14.5 Possible Extension of the Work
	15. CleverMiner—GUHA and Python
		15.1 Why GUHA in Python
		15.2 Goals of Python Implementation of GUHA
		15.3 Data Requirements and Representation on Analyzed Data Matrix
			15.3.1 Requirements on input matrix and how to achieve it
			15.3.2 Internal representation of data matrix in CleverMiner
		15.4 CleverMiner Procedures
			15.4.1 General parameters and calling
			15.4.2 Quantifiers for individual GUHA procedures
		15.5 Calling CleverMiner Procedures
		15.6 Future Plans with CleverMiner
Section III: Related Research and Theory
	16. Artificial Data Generation and LM ReverseMiner Module
		16.1 Evolutionary Approach
		16.2 Evolutionary Operations
			16.2.1 Evolutionary Fitness
		16.3 ReverseMiner Module
			16.3.1 Evolutionary Task Definition
			16.3.2 Evolutionary process
			16.3.3 Repeatibility of evolution
		16.4 Evolution Helpers
		16.5 Artifical Data Hotel
			16.5.1 Data Specifications
			16.5.2 Requirements Checklist
			16.5.3 Evolution setup
			16.5.4 Data Generation
			16.5.5 Experiences
			16.5.6 Advantages and limitations
	17. Applying Domain Knowledge
		17.1 Expert Deduction Rules and Association Rules
			17.1.1 Informal considerations
			17.1.2 Applying expert deduction rules to association rules
		17.2 Items of Domain Knowledge and Association Rules
			17.2.1 BMI ↑↑ Diastolic—principle of application
			17.2.2 Applying 4ft-Miner
			17.2.3 Atomic consequences of BMI ↑↑ Diastolic
			17.2.4 Logical consequences of atomic consequence
			17.2.5 Consequences of BMI ↑↑ Diastolic
			17.2.6 Interpreting results of 4ft-Miner
		17.3 Expert Deduction Rules and Histograms
			17.3.1 Expert deduction and histograms—considerations
			17.3.2 Applying expert deduction rules to histograms
	18. Observational Calculi
		18.1 Definition and Overview of Results
			18.1.1 Logical calculus of association rules
			18.1.2 Classes of association rules
			18.1.3 Missing information in calculus of association rules
			18.1.4 Deduction rules in calculus of association rules
			18.1.5 Logical calculus of histograms
			18.1.6 Research challenges to observational calculi
		18.2 Expert Deduction Rules
			18.2.1 Informally on expert deduction and association rules
			18.2.2 Expert deduction rules for association rules
			18.2.3 Results on expert deduction rules for association rules
			18.2.4 Open problems and challenges
References
Index