Introduction to Data Mining

Data mining is a collection of techniques,which is used to find undiscovered patterns by manipulating large volumes of data. It is a process of mining or discovering of new information. It is used in conjunction of data warehousing to help in certain types of decisions. It is applied to operational database with individual transactions.

Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and actionable information from vast amount of data and using it to make crucial business decisions. It is a strong tool, which allows end users to directly access, and manipulate the data within data warehousing environment without the need of any other tool.

In simple words we can say data mining is a step by step process in which first we extract large amount of data from database and refine it in a way that results into discovering of new facts. So, we can say that data mining is a process, which turns our data into knowledge.

1. Data Mining Process as a Part of Knowledge Discovery Process

Data mining helps to determine previously unknown data patterns, actionable information from large databases. So, data mining process help in discovering new facts from data, also called knowledge discovery in database (KDD). KDD is a six step process. The steps are as follows :

Data selection : In data selection step large amount of data are extracted from database and try to categorize it to identify target datasets and relevant attributes.
Data cleansing or data pre-processing : As the name suggest, in this step, all the corrupted and unwanted data is removed to avoid inconsistencies.
Data enrichment or data transformation : In this step, you can add some additional information, transformation of fields can be done (For example, you extract a field “Name” and it is 40 characters long but you want it to be 25 characters long So, it needs to be transformed) or existing fields can be combined to generate new fields etc. In simple words here you can make data useful according to your use. This data is used as input for data mining.
Data mining : In this step, the process try to explore new previously unknown patterns and presented them into understandable form to end user.
Human interpretation : In this process human interference is required to study the patterns provided by previous step.
Knowledge discovery : It is the last step in which actual knowledge is During data mining process you can undo any of its steps and use new knowledge or data to redo the step. So, you can say that data mining process also true to optimize the data processing. Data mining process is shown in Figure 13.2.

2. Goals of Data Mining

Goals of data mining falls into the following classes:

Prediction : Data mining helps in prediction of behaviour of certain data attributes in the future. It is very helpful in complex data scenarios. Take the example of sales department, in which by analyzing buying transactions they can predict the behaviour and buying capacity of By analyzing and prediction they can categorize their sales activity according to area, needs of customer, life standard of customers etc., to enhance the sales. In a scientific context, certain seismic wave patterns may predict an earthquake with high probability.

Identification : One major goal of data mining is to identify the existence of an item, an event, or an activity on the bases of analysis made on different data For example, scientists are trying to identify the life on Mars by analyzing different soil patterns. Authentication is also a form of identification.
Classification : Data mining is helpful in classifying the data into different categories on the basis of certain For example, in a company HR department can grade their employees on the basis of their performance parameters like their technical knowledge, functional knowledge, discipline etc.
Optimization : Last and the most important goal of data mining is to optimize the use of limited resources like time, cost, space, manpower and machine power in such a way that it will make a boom in output such as profits, increase in sales, cutting in expenditure etc.

3. Elements of Data Mining

Data mining consists of five major elements:

Extract, transform, and load transaction data on to the data warehouse system.
Store and manage the data in a multi-dimensional database system.
Provide data access to business analysts and information technology.
Analyze the data by application software.
Present the data in a useful format, such as graphs or tables.

4. Types of Knowledge Discovered during Data Mining

Data mining process results into discovery of new knowledge. The discovered knowledge can be categorized into different forms as given below:

Association rules : Data can be mined to find relations or Association rules correlate the presence of a set of items with another range of values for another set of variables e.g. (a) If a customer buys a shirt he or she also buy a matching trouser. (b) If a customer buys a PC, he may also buy some CD’s.
Classification hierarchies : The classification process is used to create different hierarchy of classes on the basis of existing set of events or e.g. (a) Customers may be divided into several ranges of credit worthiness based on the history of previous credit transactions. (b) Stocks may be given priority in stock market on the basis of their past performance such as growth, income, and stability. (c) Employees can be graded according to their past performance.
Sequential patterns : Data can be mixed to anticipate the behaviour of e.g. If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later on developed high blood urea within a year of surgery then the patient is likely to suffer from kidney failure within the next 17 months.

Pattern within time series : If we analyze the data taken at regular intervals then we may find similarities within positions of a time series of e.g. Sales of woollen clothes are increased in winter.
Clustering : We can grouped data items together according to logical relationship known as e.g. We can categorize websites into groups from “most likely to access” to “least likely to access”. Different groups or clusters are dissimilar and records within the group are similar to each other.

5. Models of Data Mining

The three proposed models of data mining are as follows:

CRISP (Cross-Industry Standard Process for Data mining) : It was proposed in the mid 1990’s by European consortium of companies to serve as a non-proprietary standard process model for data The sequence of steps of data mining in this model is shown in Figure 13.6.

Six-sigma methodology : It is a well-structured, data-driven methodology for eliminating defects, waste, quality control problem of all kinds of business The sequence of steps of data mining in this model is shown in Figure 13.7.

SEMMA : It is somewhat similar to six-sigma It was proposed by SAS institute. The sequence of steps of data mining in this model is shown in Figure 13.8. It focuses more on technical activities, involved in data mining.

6. The most commonly used data mining techniques are as follows:

Genetic algorithms : Optimization techniques that use processes such as genetic mutation, combination and natural selection are design based on concepts of evolution.
Artificial neural networks : Non-linear predictive models that learn through training and resemble biological neural networks in structure.
Rule induction : The extraction of useful if-then rules from data based on statistical significance.
Decision trees : They are tree shaped structures that represent sets of These decisions generate rules for the classification of a dataset.

7. Data Mining Tools

Various data mining tools are as follows:

Text term searching tools.
Sequence similarity searching tools.
Sequence submission tools.
Computer assisted passenger prescreening system (CAPPS II).
Terrorism information awareness program (TIA).

Query managers and report writers.
Multidimensional databases tools.
Exploration and discovery tools.

8. Data Mining Applications

We can apply data mining technologies on different domains, some of them are as follows:

1. Retail/Marketing:

Identify buying patterns of customers.
Market basket analysis.
Predict response of mailing campaigns.
Find association among customer demographic characteristics.
Design of catalogs, store layouts and advertising campaigns.

2. Banking/Finance:

Detect patterns of fraudulent credit card use.
Predict customers likely to change their credit card affiliation.
Performance analysis of finance investments.
Identify hidden correlations between different financial indicators.
Identify stock trading rules from historical market data.

3. Insurance:

Claims analysis e., which medical procedures are claimed together.
Identify behaviour patterns of risky customers.
Identify fraudulent behaviours.
Identify customers which will buy new insurance policies.

4. Health care:

Identify the side effects of drugs.
Identify the effectiveness of a particular medical treatment.
Identify the experience of doctor required to handle patients.
Optimization of processes within a hospital.
Discovering patterns in radiological images.

5. Transportation:

Analyze loading patterns.
Determine the distribution schedules among outlets.
Identify shortest routes.

6. Manufacturing:

Optimization of resources like machines, manpower, energy, space, time etc.
Identify the methods to reduce cost of manufacturing and reduce wastage.
Identify the cause of production problems like machine failure.
Optimal design of manufacturing processes.

9. Advantages of Data Mining

Following are the advantages of data mining:

Marketing/Retailing : Data mining provides marketers and retailers with useful and accurate trends about their customers purchasing behaviour.
Banking/Finance : Data mining can assist financial institutions in areas such as credit reporting and loan information.
Law enforcement : Data mining helps law enforcers in identifying criminal suspects as well as capturing them by examining trends in location, crime type etc.
Researches : It helps researches by speeding up their data analyzing process, which helps them to do more work within some time limits.

10. Disadvantages of Data Mining

Privacy issues : With the widespread use of technology, personal privacy has always been a major concern of any It is possible that any fraud company can sold their customers information.
Security issues : Data mining gives access directly to database to different users, which may cause leakage of secured data.
Inaccurate information : Data mining is not 100% accurate. It may contain inaccurate information that leads to inconsistency.

11. Scope of Improvement in Data Mining

Scaling Up for high dimensional data or high-speed streams.
Developing a unifying theory of data mining.
Mining complex knowledge from complex data.
Mining sequence data and time series data.
Data mining in a network setting.
Distributed data mining and mining multi-agent data.
Security, privacy and data integrity.
Data mining for biological and environmental problems.
Dealing with Non-static, unbalanced and cost-sensitive data.
Data mining process related problems.

Source: Gupta Satinder Bal, Mittal Aditya (2017), Introduction to Basic Database Management System, 2nd Edition-University Science Press (2017)