Data Selection in Data Mining

Data selection is the process of deciding the type of data, where to get it, and the tools needed to collect it. This happens before collecting the data itself.

This process is different from:

Selective reporting: Leaving out data that doesn’t support a hypothesis.
Interactive selection: Using data for monitoring or further analysis.

Purpose of Data Selection:
The main goal is to choose the right type, source, and tools to collect data that answer the research question effectively. The choice depends on the subject area, existing research, and available data sources.

Challenges:
Problems can arise if decisions are made based on cost or convenience rather than the data's ability to answer the research question. While cost and convenience are important, they shouldn’t compromise the reliability or integrity of the research.

Issues in Data Selection

Researchers need to consider the following when selecting data:

Correct Type and Sources of Data: The data must help answer the research question effectively.
Representative Sample: The data should be collected in a way that represents the population or phenomenon being studied.
Right Tools for Data Collection: The tools used to collect data must be compatible with the type and source of data. The selection of tools and data sources should work together smoothly to ensure accurate data collection.

Types and Sources of Data

Data comes in two main types:

Quantitative Data: This includes numbers and measurable values, such as interval and ratio-level measurements.
Qualitative Data: This includes non-numerical information like text, images, audio, or video.

Different scientific fields may prefer one type of data over another. However, some researchers use both types together to gain a deeper understanding of a subject.

Examples of Data Collection:

Qualitative: Observing behaviors like child-rearing practices.
Quantitative: Measuring physical attributes like biochemical markers or body dimensions.

Sources of Data:

Field notes, journals, and laboratory records.
Specimens or direct observations of humans, animals, or plants.

Often, the type of data and its source are closely related and may influence each other.

Feature Selection in Data Analysis

Feature selection is an important area of research in fields like pattern recognition, statistics, and data mining. Its primary goal is to select a subset of input variables by removing features that provide little or no predictive value. This process enhances the clarity of the resulting models and often creates models that perform better on new, unseen data. Identifying the right set of predictive features is often a critical task in itself.

For instance, a doctor might rely on selected features to decide whether a risky surgery is necessary for a patient. In supervised learning, where data is labeled, feature selection aims to find subsets of features that improve classification accuracy.

Recent advancements have explored combining feature selection with clustering, using a unified approach. In unsupervised learning, where there are no labels, the goal is to identify features that group data into meaningful clusters. Feature selection here focuses on finding subsets of features that improve the quality of these clusters.

Traditional feature selection methods, which rely on a single evaluation criterion, often struggle to support knowledge discovery and decision-making effectively. This is because decision-makers must consider multiple, sometimes conflicting, objectives. No single criterion works best for all applications, and only the decision-maker can assign the appropriate importance to each criterion based on the specific needs of their project.

Importance of Feature Selection

Feature selection plays a key role in building effective models for several reasons. It helps reduce the number of features by eliminating irrelevant or redundant data, ensuring the model focuses only on the most important attributes. This is essential because datasets often contain more information than needed, or the wrong type of information.

For example, a dataset with 500 columns describing customer characteristics may include sparse or duplicate data that adds little value. Including such data can negatively impact the model’s performance.

Here’s why feature selection is beneficial:

Improved Model Quality: Removing noisy or redundant features makes it easier to identify meaningful patterns.
Efficiency: Using unnecessary columns increases CPU and memory usage during model training and requires more storage space for the final model. Eliminating unneeded features speeds up the process.
High-dimensional Challenges: Most algorithms struggle with high-dimensional data, often requiring larger datasets for accurate training.

Feature selection can be done manually by the analyst or automatically by the modeling tool or algorithm. Analysts may engineer features by adding or modifying data, while algorithms score features and select only the most relevant ones for the model.

In essence, feature selection addresses two major issues:

Having too much irrelevant data.
Having too little valuable data.

The goal is to identify the smallest set of important features that contribute to building an accurate and efficient model.

How Feature Selection Works in SQL Server

Feature selection is an essential step that happens before training a model. Some algorithms in SQL Server come with built-in feature selection techniques, automatically identifying and excluding irrelevant features. These algorithms use default methods to reduce features intelligently. However, users can also manually adjust parameters to influence the selection process.

During feature selection, each attribute is scored based on its importance. Only attributes with the highest scores are included in the model. SQL Server Data Mining offers multiple methods to calculate these scores. The specific method used depends on several factors:

The algorithm used in the model
The data type of the attribute
Any custom parameters set by the user

Feature selection is applied to input features, target attributes, or even states within a column. Once scoring is complete, only the selected attributes are used for building the model and making predictions.

If a target attribute doesn’t meet the selection threshold, it can still be used for predictions. However, the prediction will rely on the overall statistics in the model, rather than on specific feature interactions.

By focusing on the most important features, SQL Server ensures efficient model building and accurate predictions.

Feature Selection Scores

SQL Server Data Mining uses several well-known and reliable methods to score attributes. The choice of scoring method for a specific algorithm or dataset depends on the data types and how the columns are utilized.

1. Interestingness Score

The interestingness score ranks attributes in columns with continuous numeric data. It measures how informative or relevant an attribute is for a specific task. While novelty might be helpful for identifying outliers, attributes that help classify or differentiate items are often more valuable.

SQL Server uses an entropy-based approach to measure interestingness. Attributes with random distributions (higher entropy) provide less information and are considered less interesting. The formula for interestingness is:

Interestingness(Attribute) = - (m - Entropy(Attribute))²

Here, m represents the central entropy of the entire feature set. By comparing the entropy of an attribute with the central entropy, you can determine its value for providing information.

2. Shannon's Entropy

Shannon’s entropy quantifies the uncertainty of a random variable for a specific outcome. For instance, the uncertainty of a coin toss can be expressed as a function of the probability of landing heads. SQL Server uses this formula to calculate entropy:

H(X) = -∑ P(xi) log(P(xi))

This method is suitable for discrete and discretized attributes.

3. Bayesian with K2 Prior

SQL Server provides two scoring methods based on Bayesian networks, which represent states and transitions as a directed acyclic graph. Bayesian networks allow prior knowledge, with the K2 algorithm being a popular method for feature selection.

The K2 algorithm, developed by Cooper and Herskovits, is scalable, analyzes multiple variables, and requires ordering of input variables. It evaluates relationships between attributes to calculate probabilities, focusing on attributes that are most predictive.

This method is also available for discrete and discretized attributes.

4. Bayesian Dirichlet Equivalent with Uniform Prior (BDEU)

The BDEU scoring method is another Bayesian-based technique, using the Dirichlet distribution to evaluate conditional probabilities in a dataset. It assumes a uniform prior distribution and applies likelihood equivalence, meaning that equivalent structures (e.g., "If A Then B" and "If B Then A") cannot be distinguished based on the data alone.

The BDEU method, based on work by Heckerman, ensures that causation cannot be inferred from data unless clear evidence exists. It is ideal for evaluating relationships between discrete and discretized attributes.

Feature Selection Parameters

When using feature selection in algorithms, you can control its behavior by adjusting specific parameters. These parameters let you decide how many features are considered during the modeling process. Here’s a simplified explanation of the key parameters:

1. MAXIMUM_INPUT_ATTRIBUTES

This parameter limits the number of input columns (attributes) used in the model. If the total number of columns exceeds the limit, the algorithm automatically excludes those it considers unimportant.

2. MAXIMUM_OUTPUT_ATTRIBUTES

This parameter works similarly but applies to predictable columns. If the number of predictable attributes exceeds the specified limit, the algorithm ignores the less important ones.

3. MAXIMUM_STATES

This parameter sets a limit on the number of distinct values (states) allowed in a column. If a column has too many states, the least frequent ones are grouped together and treated as missing.

Turning Off Feature Selection

If any of these parameters are set to 0, feature selection is disabled. This means all columns and values are included in the model, but it could increase processing time and impact performance.

Optimizing Feature Selection

You can further refine the feature selection process by:

Setting modeling flags: These flags guide the algorithm to prioritize meaningful attributes.
Using distribution flags: These help the algorithm better understand the structure of the data.

Adjusting these parameters helps you focus on the most relevant features, improving your model's efficiency and accuracy.

Data Selection in Data Mining