Module 2 Association Rules - Monash University

Chapter 16 Parallel Data Mining 16.1 16.2 16.3 16.4 16.5 16.6 16.7 From DB to DW to DM Data Mining: A Brief Overview Parallel Association Rules Parallel Sequential Patterns Summary Bibliographical Notes Exercises Data-Intensive Applications 16.1.

All of the three: databases, data warehouses, and data mining, deal with data. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.1. Data-Intensive Applications (contd) Databases are commonly deployed in almost every organization. In a simple form, databases are referred to as data repositories. Database processing are queries, and transactions. The data contained in a database is normally operational data. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.1. Data-Intensive Applications (contd)

Data warehouse provides information from a historical perspective, whereas an operational database keeps data of current value. The process involves: data extraction, filtering, transforming, integrating from various sources, classifying the data, aggregating and summarizing the data. The result is a data warehouse where the data is integrated, time-variant, non-volatile, and commonly subject-oriented. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.1. Data-Intensive Applications (contd) Data mining analyzes a large amount of data stored in databases to discover interesting knowledge in the form of patterns, association, changes, anomalies, significant structures, etc. Data mining is also known as knowledge discovery, or more precisely, knowledge discovery of data. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008

Data Mining: A Brief Overview 16.2. Data mining is a process for discovering useful, interesting, and sometimes surprising knowledge from a large collection of data. Data Mining Tasks Descriptive data mining: describes the data set in a concise manner and presents interesting general properties of the data; summarizes the data in terms of its properties and correlation with others. Predictive data mining: Predictive data mining builds a prediction model whereby it makes inferences from the available set of data, and attempts to predict the behaviour of new data sets.

D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.2. Data Mining: A Brief Overview (contd) Data Mining Techniques Class description or characterization summarizes a set of data in a concise way that distinguishes this class from others. Association rules discover association relationships or correlation among a set of items.

Classification analyzes a set of training data and constructs a model for each class based on the features in the data. Prediction predicts the possible values of some missing data or the value distribution of certain attributes in a set of objects. Clustering is a process to divide the data into clusters, whereby a cluster contains a collection of data that is similar to one another. Time-series analysis analyzes a large set of time series data to find certain regularities and interesting characteristics. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.2. Data Mining: A Brief Overview (contd) Querying vs. Mining Although it has been stated that the purpose of mining (or data mining) is to discover knowledge, it should be differentiated from querying (or database querying), which simply retrieves data.

In some cases, this is easier said than done. Consequently, highlighting the differences is critical in studying both database querying and data mining. The differences can generally be categorized into: unsupervised and supervised learning. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.2. Data Mining: A Brief Overview (contd) Unsupervised Learning Unsupervised learning is whereby the learning process is not guided, or even dictated, by the expected results. To put it in another way, unsupervised learning does not require a hypothesis. Exploring the entire possible space in the jungle of data might be overstating, but can be analogous that way. Association rule mining vs. Database querying: Given a database D, association rule mining produces an association rule Ar(D) = XY,

where X,Y D. A query Q(D, X) = Y produces records Y matching the predicate specified by X. The pattern XY may be based on certain criteria, such as: majority, minority, absence, exception D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.2. Data Mining: A Brief Overview (contd) Unsupervised Learning Sequential patterns vs. Database querying: Given a database D, a sequential pattern Sp(D) = O:XY, where O indicates the owner of a transaction and X,Y D. A query Q(D, X, Y) = O, or Q(D, aggr) = O, where aggr indicates some aggregate functions. Clustering

vs. Database querying: Given database D, a clustering n Cl (D)=i=1{Xi1,Xi2,K} , where it produces n clusters each of which consists of a number of items X. A query Q(D, X1) = {X2, X3, X4, }, where it produces a list of items {X2, X3, X4, } having the same cluster as the given item X1. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.2. Data Mining: A Brief Overview (contd) Supervised Learning Supervised learning is naturally the opposite of unsupervised learning, since supervised learning starts with a direction pointing to the target.

Decision tree classification vs. Database querying: Given database D, a decision tree Dt(D, C) = P, where C is the given category and P is the result properties. A query Q(D, P) = R, is where the property is known in order to retrieve records R. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.2. Data Mining: A Brief Overview (contd) Parallelism in Data Mining Large volume of data High dimension (large number of attributes) High degree of complexity (not previously found or applicable to

databases or even data warehousing) Even a simple data mining technique requires a number of iterations of the process, and each of the iterations refines the results until the ultimate results are generated Parallelism in data mining: data parallelism and result parallelism D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 Data Parallelism Data parallelism is created because the data is partitioned into a number of processors and each processor focuses on its partition of the data set. After each processor completes its local processing and produces the

local results, the final results are formed basically by combining all local results. Since data mining processes normally exist in several iterations, data parallelism raises some complexities, not commonly found in database query processing. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 Result Parallelism Result parallelism focuses on how the target results can be parallelized during the processing

stage without having produced any results or temporary results. Result parallelism works by partitioning the target results, and each processor focuses on its target result partition. Each processor will do whatever it takes to produce the result within the given range, and will take any input data necessary to produce the desired result space. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 Parallel Association Rules 16.3.

To discover rules based on the correlation between different attributes/items found in the dataset Two phases: (i) phase one: discover frequent itemsets from a given dataset, and (ii) phase two: generate rules from these frequent itemsets. The first phase is widely recognized as being the most critical, computationally intensive task. Since the frequent itemset generation phase is computationally expensive, most work on association rules, including parallel association rules, have been focusing on this phase only. Improving the performance of this phase is critical to the overall performance. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.3. Parallel Association Rules (contd) Some measurements

Support and minimum support Confidence and minimum confidence Frequent Itemset: An itemset in a dataset is considered as frequent if its support is equal to, or greater than, the minimum support threshold specified by the user. Candidate Itemset: Given a database D and a minimum support threshold minsup and an algorithm that computes F(D, minsup), an itemset I is called candidate for the algorithm to evaluate whether or not itemset I is frequent. Association rules: At a given user-specified minimum confidence threshold minconf, find all association rules R from a set of frequent itemset F such that each of the rules has confidence equal to, or greater than minconf. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.3. Parallel Association Rules (contd)

Example Transaction ID Items Purchased 100 200 300 400 500 bread, cereal, milk bread, cheese, coffee, milk cereal, cheese, coffee, milk cheese, coffee, milk bread, sugar, tea Figure 16.5. Example Dataset Frequent Itemset Support

bread cereal cheese coffee milk bread, milk cereal, milk cheese, coffee cheese, milk coffee, milk cheese, coffee, milk 60% 40% 60% 60% 80% 40% 40% 60% 60% 60%

60% Figure 16.6. Frequent Itemset Association Rules Confidence bread milk cereal milk cheese coffee cheese milk coffee milk coffee cheese milk cheese milk coffee cheese, coffee milk cheese, milk coffee coffee, milk cheese cheese coffee, milk coffee cheese, milk milk cheese, coffee

67% 100% 100% 100% 100% 100% 75% 75% 100% 100% 100% 100% 100% 75% Figure16. 7. Association Rules D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 Frequent itemset process

Iteration 1: scan the dataset and finds all frequent 1-itemset Iteration 2: join each frequent 1itemset and generates candidate 2itemset. Then it scans the dataset again, enumerates the exact support of each of these candidate itemsets and prunes all infrequent candidate 2-itemsets. Iteration 3: joins each of the frequent 2-itemset and generates the following potential candidate 3-itemset. Prunes those candidate 3-itemset that do not have a subset itemset in F2. Scans the dataset and finds the exact support of that candidate itemset. It finds that this candidate 3-itemset is frequent. In the joining phase, Cannot produce any candidate itemset for

the next iteration. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.3. Parallel Association Rules (contd) Rules generation Using the frequent itemset {cheese coffee milk}, the following three rules hold, since the confidence is 100% cheese, coffee milk cheese, milk coffee coffee, milk cheese Then we use the apriori_gen() function to generate all candidate 2itemsets, resulting {cheese milk} and {coffee milk}. After confidence calculation, the following two rules hold: coffee cheese, milk (confidence=100%)

cheese coffee, milk (confidence=75%) Therefore, from one frequent itemset {cheese coffee milk} alone, five association rules shown above have been generated (see Figure 16.7) D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.3. Parallel Association Rules (contd) Parallel Association Rules Data parallelism for association rule mining is often referred to as count distribution Result parallelism is widely known as data distribution. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008

Data Parallelism (or Count Distribution) Each processor will have a disjoint data partition to work with. Each processor, however, will have a complete candidate itemset, although with partial support or support count. At the end of each iteration, since the support or support count of each candidate itemset in each processor is incomplete, each processor will need to redistribute the count to all processors. Hence, the term count distribution is used. This global result reassembling stage is basically to redistribute the support count which often means global

reduction to get global counts. The process in each processor is then repeated until the complete frequent itemset is ultimately generated. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 Result Parallelism (or Data Distribution) Data distribution parallelism is based on result parallelism whereby parallelism is created due to the partition of the result, instead of the data. However, the term data distribution might be confused with

data parallelism (count distribution). Initially, the dataset has been partitioned. However, each processor needs to have not only its local partition, but all other partitions from other processors. At the end of each iteration, where each processor will produce its own local frequent itemset, each processor will also need to send to all other processors its frequent itemset, so that all other processors can use this to generate its own candidate itemset for the next iteration. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 Parallel Sequential Patterns 16.4.

Sequential patterns, also known as sequential rules, are very similar to association rules. They form a causal relationship between two itemsets, in a form of XY, where because X occurs, it causes Y to occur with a high probability. Association rules are intra-transaction patterns or sequences, where the rule XY indicates that both items X and Y must exist in the same transaction. Sequential pattern are inter-transaction patterns or sequences. The same rule above indicates that since item X exists, this will lead to the existence of item Y in the near future transaction. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.4. Parallel Sequential Patterns (contd)

Example D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.4. Parallel Sequential Patterns (contd) Concepts Given a set of transactions D each of which consists of the following fields: customer ID, transaction time, and the items purchased in the transaction, mining sequential patterns is used to find the intertransaction patterns/sequences that satisfy minimum support minsup, minimum gap mingap, maximum gap maxgap, and window size wsize specified by the user. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.4. Parallel Sequential Patterns (contd)

Concepts A sequence s is an ordered list of itemsets i. Containment: <(5 6) (7)> is contained in <(4 5) (4 5 6 7) (7 9 10)>, because (5 6) (4 5 6 7) and (7) (7 9 10). Whereas <(3 5)> is not contained in <(3) (5)>. Four important parameters in mining sequential patterns: support, window size, minimum gap, and maximum gap. D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.4. Parallel Sequential Patterns (contd) Example D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008

16.4. Parallel Sequential Patterns (contd) Sequential Patterns Process Phase 1: k=1 Phase 2: k>1 D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.4. Parallel Sequential Patterns (contd) Parallel Processing Data parallelism (or count distribution) Result parallelism (or data distribution)

D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.4. Parallel Sequential Patterns (contd) Parallel Processing Data parallelism (or count distribution) Result parallelism (or data distribution) D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 16.5. Parallelism in data mining

Summary Data parallelism: Data parallelism in association rules and sequential patterns is often known as count distribution where the counts of candidate itemsets in each iteration are shared and distributed to all processors. Hence, there is a synchronization phase. Result parallelism: Result parallelism, on the other hand, is parallelization of the results (i.e. frequent itemset and sequence itemset). This parallelism model is often known as data distribution, where the dataset and frequent itemsets are distributed and moved from one processor to another at the end of each iteration. Parallel association rules and parallel sequential patterns using data and result parallelism D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008 Continue to Chapter 17

Recently Viewed Presentations

  • The Research Paper

    The Research Paper

    Does your conclusion make sense and is the paper still interesting? Step Nine - MLA Format In-Text Citations - refer to the MLA guidelines. Works Cited Page Works Cited Page The works cited page is a typed, alphabetized list of...
  • UCL Centre for Medical Image Computing Poster Title

    UCL Centre for Medical Image Computing Poster Title

    Poster Title Author/s UCL Centre for Medical Image Computing. Title: PowerPoint Presentation Author: Media Resources Last modified by: ucacrpg Created Date: 7/18/2005 1:37:23 PM ... UCL Other titles: Times Arial Calibri Blank Presentation Slide 1 ...
  • Using Scientific Measurements - Belle Vernon Area School District

    Using Scientific Measurements - Belle Vernon Area School District

    Using Scientific Measurements 2.3 ... Calculators exaggerate precision Sig. Fig Examples How many sig figs are in… 203000 1000.00 0.00052 0.06400 79.810 Significant Figures Multiplication and division sig. digs = answer has no more figs. than the msmt. with the...
  • Water striders: Life on the edge Math &

    Water striders: Life on the edge Math &

    G.MG.1 Use geometric shapes, their measures, and their properties to describe objects (e.g., modeling a tree trunk or a human torso as a cylinder). G.MG.2 Apply concepts of density based on area and volume in modeling situations (e.g., persons per...
  • Building Quality Practices with the COS-Team Collaboration ...

    Building Quality Practices with the COS-Team Collaboration ...

    The COS-TC Toolkit: Background. Designed to help improve COS team collaboration and help teams implement quality COS practices. Helps those who implement, supervise, or train on the COS process to identify, observe, and assess recommended team collaboration practices in COS...
  • Chapter 011 - Reality of Consent and Writing

    Chapter 011 - Reality of Consent and Writing

    Reality of Consent and Writing ... A mistake made by both parties concerning a material fact that is important to the subject matter of the contract In Raffles v. Wichelhaus, the court held that a mutual mistake of fact excused...
  • Jim Schwab Global Alliances Director, Clarabridge jim.schwab@clarabridge.com +1.585.261.9433

    Jim Schwab Global Alliances Director, Clarabridge [email protected] +1.585.261.9433

    Cognos. Business Objects. many others…. There are many different ways to get data in and out of Clarabridge, this discussion is focusedon using the APIs. Data input will be driven by customer desire and necessity but all of the above...
  • Medical Terminology - Northwest Career and Technical

    Medical Terminology - Northwest Career and Technical

    A word root cannot stand alone; a suffix is always required. But not all medical terms have prefixes. In certain cases, vowels are added between word parts to make the term easier to pronounce. These vowels are called combining vowels....