Numerosity Reduction in Data Mining
The data reduction process minimizes the data size, making it more manageable and suitable for analysis while maintaining data integrity. The goal is to reduce the volume of data without losing essential information. There are various techniques for data reduction, with two primary methods being Dimensionality Reduction and Numerosity Reduction.
Types of Numerosity Reduction
This method reduces data volume by using alternative, more compact forms of data representation. There are two main types of numerosity reduction:1. Parametric
This method assumes that the data fits a specific model. The parameters of the model are estimated, stored, and used to represent the data, while the rest is discarded. Common techniques include Regression and Log-Linear Models:
-
Regression:
Regression models establish a relationship between attributes.- Simple Linear Regression: Models the relationship between one dependent variable and one independent variable , represented by , where and are coefficients.
- Multiple Linear Regression: Extends this to include multiple independent variables, expressing as a function of multiple predictors.
-
Log-Linear Model:
This model estimates the probability of data points in a multidimensional space based on subsets of discretized attributes. It identifies relationships between two or more discrete attributes, helping to derive probabilities in high-dimensional data from lower-dimensional components.
2. Non-Parametric
Non-parametric numerosity reduction techniques do not rely on any predefined model. These methods provide consistent reduction across different data sizes but may not achieve as much reduction as parametric methods. The key non-parametric techniques include Histograms, Clustering, Sampling, Data Cube Aggregation, and Data Compression:
-
Histograms:
Represent data based on frequency distribution using bins. A histogram divides data into disjoint subsets (buckets) and approximates its distribution. If each bucket contains only one attribute-value pair or frequency, these are called singleton buckets. -
Clustering:
Groups data objects (tuples) into clusters where objects in the same cluster are similar, and those in different clusters are dissimilar. Similarity is measured using a distance function.- Cluster quality can be evaluated by diameter (maximum distance between two objects in the cluster) or centroid distance (average distance of objects from the cluster's centroid).
-
Sampling:
Reduces data volume by selecting a smaller subset of the data that represents the entire dataset. Common sampling methods include:- Simple Random Sample Without Replacement
- Simple Random Sample With Replacement
- Cluster Sampling
- Stratified Sampling
-
Data Cube Aggregation:
Summarizes data by reducing it to fewer dimensions through aggregation, retaining the information needed for analysis. This multidimensional aggregation organizes data in a data cube, which facilitates efficient storage and faster aggregation operations. -
Data Compression:
Reduces data size by encoding, modifying, or restructuring it to occupy less space. - Lossless Compression: Data can be fully restored to its original form.
- Lossy Compression: Data cannot be perfectly restored, sacrificing some detail for higher compression.
Difference Between Numerosity Reduction and Dimensionality Reduction
Data reduction primarily involves two approaches: Dimensionality Reduction and Numerosity Reduction.
-
Dimensionality Reduction:
- Focuses on reducing the number of attributes or features in the dataset.
- Achieved through data encoding or transformation techniques to create a simplified or "compressed" representation of the data.
- Data reduction can be:
- Lossless: The original data is perfectly reconstructed from the compressed representation.
- Lossy: The reconstructed data approximates the original, potentially losing some details while retaining essential information.
-
Numerosity Reduction:
- Aims to reduce the volume of data by summarizing or representing it in a compact form, such as models, clusters, or samples.
- Retains all original attributes but compresses the data through abstraction, leading to a smaller yet representative dataset.
- Unlike dimensionality reduction, it does not focus on reducing the number of features but instead simplifies the data representation.