Tasks and Functionalities of Data Mining

Data mining tasks are designed to be semi-automatic or fully automatic processes applied to large datasets to uncover patterns such as groups or clusters, anomalies (outliers), and dependencies like associations or sequential patterns. Once these patterns are discovered, they serve as summaries of the input data and can be further analyzed using machine learning or predictive analytics. For instance, data mining can identify multiple groups within a dataset, which can then be used by a decision support system. However, it's important to note that tasks like data collection, preparation, and reporting are not considered part of data mining.

There is often confusion between data mining and data analysis. Data mining focuses on identifying trends, correlations, or patterns using machine learning, mathematical models, and statistical techniques. On the other hand, data analysis is concerned with testing and validating statistical models for specific purposes, such as evaluating the success of a marketing campaign.

Data mining activities can generally be categorized into two main types:

Descriptive Data Mining:
This type aims to understand the current state of the data without any prior assumptions. It highlights common features or characteristics within the dataset, such as counts, averages, or other summary statistics.
Predictive Data Mining:
This type involves using historical or labeled data to predict future outcomes or trends. Predictive data mining leverages patterns within the data to forecast key metrics or events. For example, it could predict next quarter's business performance based on trends observed in previous quarters or determine if a patient is likely suffering from a particular disease based on medical examination results.

1. Class/Concept Descriptions

A class or concept refers to a dataset or group of features that define a specific category or idea. For example:

A class could represent categories like items on a store's shelf.
A concept might represent ideas such as products marked for clearance sale versus regular items.

Classes help in grouping data, while concepts aid in differentiation.

Data Characterization:
Summarizes general features of a class and generates specific rules defining it. A technique called Attribute-Oriented Induction is used to achieve this.
Data Discrimination:
Separates data into distinct groups based on differences in attribute values. It compares features of one class with others and often uses visualizations like bar charts, curves, or pie charts.

2. Mining Frequent Patterns

This involves identifying commonly occurring patterns within a dataset. These patterns reveal frequently co-occurring items or sequences.

Frequent Item Sets:
Groups of items that are often found together, such as milk and sugar purchased together in a store.
Frequent Substructures:
Data structures like trees or graphs that frequently appear within the dataset and are associated with item sets or subsequences.
Frequent Subsequences:
Regular patterns in sequences, such as a customer buying a phone and then purchasing a phone case afterward.

3. Association Analysis

Association analysis examines groups of items that frequently occur together in a transactional dataset. This method is often referred to as Market Basket Analysis because of its common use in retail sales to understand purchasing behavior.

Two key parameters are used to determine association rules:

Support: Indicates how often a specific item set appears in the database, helping to identify commonly occurring item combinations.
Confidence: Represents the conditional probability that an item is purchased when another related item is also purchased in a transaction.

For example, if customers frequently buy bread and butter together, the association rule may suggest that purchasing bread increases the likelihood of buying butter.

4. Classification

Classification is a data mining technique used to organize items into predefined categories based on their properties. Methods such as if-then rules, decision trees, or neural networks are applied to predict the class of items.

A training set, containing items with known properties and categories, is used to train the system.
Once trained, the system can classify new, unknown items based on their attributes.

5. Prediction

Prediction involves forecasting missing data values or identifying future trends. Predictions are made by analyzing an object's attributes and comparing them to the attributes of known classes.

There are two main types of predictions:

Numeric Predictions:
Predicts numerical values, often using linear regression models based on historical data. For example, predicting sales trends to help businesses prepare for future events.
Class Predictions:
Predicts missing class labels for items. This uses a training dataset where the classes of similar items are already known.

6. Cluster Analysis

Cluster analysis is a data mining technique used in fields like image processing, pattern recognition, and bioinformatics. It groups similar data items together, but unlike classification, the classes or groups are not predefined.

Clusters are formed based on similarities and differences in data attributes.
Clustering algorithms analyze features to group data without requiring prior knowledge of class labels.

For example, clustering can group customers with similar purchasing habits or identify patterns in biological data.

7. Outlier Analysis

Outlier analysis helps assess the quality of a dataset by identifying data points that deviate significantly from the norm. Too many outliers can compromise the reliability of the data and hinder pattern discovery.

Purpose: To detect unusual or unexpected data points that may indicate anomalies requiring further investigation or corrective measures.
Application: Outliers that do not fit into any group or class are flagged by the algorithms for further review.

8. Evolution and Deviation Analysis

Evolution analysis focuses on studying datasets that change over time to uncover trends and patterns.

Purpose: To identify evolving trends and changes in data, aiding in tasks like classification, clustering, and discrimination of time-based data.
Application: Evolution models track data changes, helping businesses or researchers understand how certain attributes develop or deviate over time.

9. Correlation Analysis

Correlation analysis is a statistical method to measure the strength and direction of the relationship between two variables.

Purpose: To determine how closely two attributes are related, helping to uncover meaningful connections in the data.
Applications:
- Used to analyze numerical relationships, such as sales and advertising budgets.
- Identifies connections in data structures like trees or graphs.
  Researchers can leverage correlation analysis to discover whether variables in their study influence one another.

Tasks and Functionalities of Data Mining