Class Comparison Methods in Data Mining

Samundeeswari

 Class Comparison Methods in Data Mining

In many situations, users might not want just a description of one class or group. Instead, they may want to compare one class (called the target class) with others (called contrasting classes) to find differences between them. This process is called class comparison.

For class comparison to work, the target and contrasting classes must be similar—they need to share common attributes or dimensions. For instance, comparing "sales in 2023" with "sales in 2024" makes sense because both share similar dimensions like location, time, and product. However, comparing unrelated classes, like "person," "address," and "item," is not meaningful because they don't share comparable attributes.

How Class Comparison Works

In previous discussions about class characterization, we focused on summarizing and describing a single class. Class comparison takes it a step further by comparing multiple classes side by side.

Example:

Imagine we want to compare sales data from 2003 and 2004. To do this effectively:

  1. Generalization of Attributes:
    Each class is generalized to the same level of abstraction. For example, for the location dimension, both 2003 and 2004 sales data are generalized to either:

    • City level (e.g., "Vancouver")
    • State/province level (e.g., "British Columbia")
    • Country level (e.g., "Canada")

    Comparing sales in Vancouver in 2003 with sales in the United States in 2004 wouldn't be helpful because the data is generalized at different levels (city vs. country). Synchronous generalization ensures consistency and makes comparisons meaningful.

  2. Flexibility for Users:
    While automated tools can synchronize the generalization process, users should have the option to adjust the levels of abstraction if they need a custom comparison.



1. Data Collection
Relevant data is gathered from the database or data warehouse using queries. The data is divided into a target class (the main group we want to analyze) and one or more contrasting classes (groups we want to compare it with).

2. Dimension Relevance Analysis
If there are many features or dimensions in the data, a relevance analysis is performed. This ensures only the most important dimensions are included in the analysis, making the comparisons more focused and meaningful.

3. Synchronous Generalization
Generalization is done on both the target and contrasting classes to bring their data to the same level of detail.

  • For the target class, this is controlled by a threshold set by the user or an expert.
  • The contrasting classes are generalized to match the same level as the target class.
    The result is a summarized and comparable dataset (called a "relation" or "cuboid") for both groups.

4. Presentation of Comparison Results
The final comparison is shown using tables, charts, or rules. These visualizations often include a contrasting measure (like percentage differences) that highlights the distinctions between the target and contrasting classes. Users can refine the comparison by applying OLAP operations like drill-down (more detailed view) or roll-up (higher-level summary).

Example: Comparing Graduate and Undergraduate Students

If the task is to compare graduate students (target class) with undergraduate students (contrasting class):

  • Data on both groups is collected (e.g., age, GPA, major).
  • Only the most relevant dimensions (e.g., GPA and age) are kept for analysis.
  • Both groups are summarized to the same level (e.g., average GPA by major).
  • The results are displayed as a table, chart, or a set of rules (e.g., "Graduate students have higher GPAs in Computer Science, while undergraduates have higher GPAs in Humanities").

This approach helps identify meaningful differences between the two groups.


Class comparisons can be presented to users in various ways, similar to class characterizations. These include:
  • Generalized relations
  • Crosstabs
  • Bar charts
  • Pie charts
  • Curves
  • Rules

The presentation methods (except for logic rules) are the same for both class characterizations and comparisons.

This section focuses on showing class comparisons using discriminant rules.

What Are Discriminant Rules?

Discriminant rules highlight the differences between the target class and contrasting classes. They use a statistical measure, called a d-weight, to indicate how significant or interesting each feature is in distinguishing the classes. This makes it easier to understand the unique characteristics of each group in a quantitative way.

Our website uses cookies to enhance your experience. Learn More
Accept !

GocourseAI

close
send