Redundancy and Correlation in data mining

Samundeeswari

 Redundancy and Correlation 

During data integration in data mining, multiple data sources are often utilized, which can result in data redundancy. An attribute is considered redundant if it can be derived from a combination of other attributes. For instance, if a dataset contains 20 attributes and one of them can be calculated or inferred from a subset of the others, that attribute is deemed redundant. These redundant attributes arise because they do not add new information to the dataset. Additionally, inconsistencies in naming conventions for attributes or dimensions can further contribute to redundancy in the data.

Here’s a simple example of a table to illustrate redundancy in attributes:


Explanation:

  1. Redundant Attribute:

    • Attribute 4: Full Name is redundant because it can be derived by concatenating Attribute 2: First Name and Attribute 3: Last Name.
    • Since the information is already available in First Name and Last Name, storing Full Name creates unnecessary redundancy.
  2. Solution:

    • To reduce redundancy, remove Attribute 4: Full Name from the dataset and compute it dynamically when needed. This will save storage space and reduce the risk of inconsistencies in the data.

Detection of data redundancy involves identifying attributes or data entries that are duplicated or derived from other data. This is crucial in data integration, as redundant data increases storage requirements, decreases efficiency, and can lead to inconsistencies.

Steps to Detect Data Redundancy

  1. Analyze Dependencies Among Attributes:

    • Use functional dependency analysis to check if one attribute can be derived from others.
    • Example: If Full Name is always a combination of First Name and Last Name, Full Name is redundant.
  2. Examine Correlations:

    • Statistical methods like correlation analysis can reveal if attributes are highly correlated and potentially redundant.
    • Example: Two attributes, Temperature in Celsius and Temperature in Fahrenheit, are directly correlated and one can be derived from the other.
  3. Identify Duplicate Entries:

    • Look for duplicate rows or records across datasets.
    • Example: Two rows with identical employee details but different IDs could indicate redundancy.
  4. Check for Overlapping Data:

    • When integrating multiple data sources, overlapping datasets may cause redundancy.
    • Example: Two datasets contain customer information, but one has additional fields.
  5. Analyze Naming and Format Consistency:

    • Inconsistent naming of attributes (e.g., DOB vs. Date of Birth) might lead to storing the same information under different labels.
    • Example: Revenue in one dataset and Sales Income in another might refer to the same attribute.
  6. Use Automated Tools:

    • Employ data profiling and integration tools to detect redundancies.
    • Tools like Talend, Informatica, and Apache Nifi can help identify and resolve redundancies in large datasets.
  7. Inspect Derived Attributes:

    • Identify attributes that are calculations or transformations of other attributes.
    • Example: Profit can be derived from Revenue and Cost.

The correlation coefficient for Numerical data

The correlation coefficient is a statistical measure used to quantify the degree to which two numeric variables are linearly related. It indicates both the strength and direction of the relationship. For numeric data, the most commonly used correlation coefficient is Pearson’s correlation coefficient (denoted as rr).


Key Characteristics of the Correlation Coefficient:

  1. Range:

    • The correlation coefficient rr ranges from -1 to 1:
      • r=1r = 1: Perfect positive linear relationship.
      • r=1r = -1: Perfect negative linear relationship.
      • r=0r = 0: No linear relationship.
  2. Direction:

    • Positive Correlation (r>0r > 0):
      • As one variable increases, the other variable tends to increase.
      • Example: Height and weight.
    • Negative Correlation (r<0r < 0):
      • As one variable increases, the other variable tends to decrease.
      • Example: Speed and travel time.
  3. Strength:

    • The closer rr is to 1 or -1, the stronger the linear relationship.
    • r0r \approx 0 indicates a weak or no linear relationship.

Formula for Pearson’s Correlation Coefficient:

r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2}}

Where:

  • xix_i and yiy_i: Data points of variables XX and YY.
  • xˉ\bar{x} and yˉ\bar{y}:  Means of variables XX and YY.

Steps to Calculate rr:

  1. Compute the mean of XX (xˉ\bar{x}) and YY (yˉ\bar{y}).
  2. For each pair of data points, calculate:
    • (xixˉ)(x_i - \bar{x}) and (yiyˉ)(y_i - \bar{y}).
  3. Multiply these deviations for each pair and sum them up.
  4. Compute the sum of squares for each variable:
    • (xixˉ)2\sum (x_i - \bar{x})^2 and (yiyˉ)2\sum (y_i - \bar{y})^2.
  5. Substitute the values into the formula.
Our website uses cookies to enhance your experience. Learn More
Accept !

GocourseAI

close
send