Data Mining Task Primitives
A data mining task can be defined by a data mining query, which is entered into the data mining system. A data mining query is made up of task primitives, which are the basic components that help users communicate with the system during the data mining process. These primitives allow users to guide the system in discovering patterns or to examine the findings from different perspectives. The data mining primitives define the following:
- Set of data that is relevant to the task.
- Type of knowledge the user wants to extract.
- Background knowledge that should be considered during the process.
- Evaluation measures for judging the usefulness of the discovered patterns.
- Visual representation for displaying the patterns found.
A data mining query language can be developed to include these primitives, giving users the ability to interact with the data mining system more easily. Such a language provides the foundation for building user-friendly graphical interfaces.
Creating a complete data mining language is challenging because data mining includes many different tasks, such as describing data or analyzing changes over time. Each task requires different approaches. Designing a good data mining query language involves understanding the capabilities and limitations of the different data mining tasks, which helps the system communicate with other information systems and fit into the broader information processing environment.
List of Data mining Task Primitives
- The Set of Task-Relevant Data to be Mined
This refers to the specific parts of the database or data that the user wants to analyze. It includes the relevant database attributes or dimensions (in data warehouses) that are important for the task.
In a relational database, this data can be collected using a relational query, which might involve operations like selecting, projecting, joining, or aggregating data.
The process of collecting the data creates a new data set, known as the initial data relation. This data can be ordered or grouped based on the conditions specified in the query. This step is part of the data mining process.
The initial data relation may not always match a physical table in the database. In databases, virtual tables are called Views, so the set of relevant data for data mining is referred to as a minable view.
2. The kind of knowledge to be mined
This defines the data mining tasks to be carried out, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier detection, or trend analysis.
3.The Background Knowledge to be Used in the Discovery Process
Background knowledge about the domain being mined is helpful for guiding the data discovery process and evaluating the patterns that are found. One common form of background knowledge is concept hierarchies, which allow data to be mined at different levels of abstraction.
A concept hierarchy is a mapping of low-level concepts to higher-level, more general concepts. It helps to organize data in a way that can provide more meaningful insights.
- Rolling Up (Generalization): This process involves viewing data at higher, more general levels of abstraction. It simplifies and compresses the data, making it easier to understand and reducing the need for input/output operations.
- Drilling Down (Specialization): This is the opposite of rolling up, where higher-level concepts are replaced with more detailed, lower-level ones. Depending on the user’s perspective, there may be multiple concept hierarchies for a given attribute or dimension.
For example, a concept hierarchy for the "age" attribute might go from a broad category like "adult" to more specific ranges such as "18-25 years" or "26-35 years."
Another form of background knowledge involves user beliefs about relationships in the data, which can further guide the mining process.
4. The Interestingness Measures and Thresholds for Pattern Evaluation
Different types of knowledge may require different measures to assess their relevance or "interestingness." These measures help guide the mining process or, after patterns are discovered, evaluate their significance. For example, in association rule mining, common interestingness measures include support and confidence.
Support refers to how frequently an item appears in the dataset, while confidence measures the likelihood that a rule will hold true. If the support and confidence values of a rule fall below user-defined thresholds, the rule is considered uninteresting and may be ignored.
5. The Expected Representation for Visualizing the Discovered Patterns
This refers to how the discovered patterns will be displayed. The representation can include various formats such as rules, tables, cross-tabulations, charts, graphs, decision trees, cubes, or other visual formats.
Users should be able to specify which forms of representation to use when displaying the patterns. Some forms of representation may be more effective than others, depending on the type of knowledge being presented.