CONFIDENCE AND SUPPORT

<< DATA MINING

ELECTRONIC DATA INTERCHANGE (EDI) >>

E-COMMERCE IT430

Lesson 35

CONFIDENCE AND SUPPORT

There are two terms/measures used in association, that is, support and confidence. Confidence' is a

measure of how often the relationship holds true e.g, what percentage of time did people who bought milk

also bought eggs. Support means what is the percentage of two items occurring together overall.

Mathematically, they can be expressed as follows if we take the example of eggs and milk:

Confidence = Transactions (eggs+milk)

Transactions (eggs or milk or both)

In case no. of transactions involving eggs and milk are 25 and those involving eggs or milk or both are 75

then confidence is 25/75*100=33.3%

Support =

Transactions (eggs+milk)

Total no. of transactions

In case no. of transactions involving eggs and milk are 10 and total no. of transactions in a day are 50 then

support is 10/50*100 = 20%

Suppose if confidence is 90% but the support is 5%., then we can gather from this that the two items have

very strong affinity or relationship with each other such that when an item is sold the other is sold together,

however, the chance of this pair being purchased out of the total no. of transactions is very slim, just 5%.

One can adjust these measures to discover items having corresponding level of association and accordingly

set marketing strategy. So, if I feed the data to the association mining tool and specify the percentage of

confidence and support, it will list down the items that have association corresponding to these percentages.

Results of association mining are shown with the help of double arrows as indicated below:

Bread ---- Butter

Computer ---- Furniture

Clothes ---- Shoes

Using the result of association mining, a marketer can take a number of useful steps to set or modify

marketing strategy. For example, items that have closeness/affinity with each other can be shelved together

to improve customer service. Certain promotional schemes can be introduced in view of the association

mining result etc.

Characterization

It is discovering interesting concepts in concise and succinct terms at generalized levels for examining the

general behavior of the data. For example, in a database of graduate students of a university the students of

different nationalities can be enrolled in different departments such as music history, physics etc. We can

apply characterization technique to find a generalized concept/answer in response to the question that how

many students of a particular country are studying science or arts. See the following example:

Student name Department

City of residence

Imran

History

Karachi

Alice

Physics

London

Ali

Literature

Lahore

Bob

Mathematics

Toronto

...

In the above example, characterization tool can, for that matter, tell us that 02 Pakistani students are

studying arts. Note that the concept of location and the field of education are generalized to Pakistan and

arts, respectively.

The two algorithms used in characterization are Version Space Search and Attribute-Oriented Induction.

144

E-COMMERCE IT430

Clustering

A cluster is a group of data objects that are similar to another within the same cluster and are dissimilar to

the objects in other clusters. For example, clusters of distinct group of customers, categories of emails in a

mailing list database, different categories of web usage from log files etc. It serves as a preprocessing step

for other algorithms such as classification and characterization. K-means algorithm is normally used in

clustering. In the example below you can see four clusters of customers based on their income level. K-

means algorithm displays the result in the format as shown in Fig. 1 below:

IIncome<10000000

ncome<1,, 0,, 00

IIncome>=10000000

ncome>=1,, 0,, 00

<=220000000

<= ,, 0,, 00

IIncome>20000000

ncome>2,, 0,, 00

IIncome>35500000

ncome>3,, 0,, 00

<=335500000

<= ,, 0,, 00

Fig. 1

Online Analytical Processing (OLAP)

OLAP makes use of background knowledge regarding the domain of the data being studied in order to

allow the presentation of data at different levels of abstraction. It is different form data mining in the sense

that it does not provide any patterns for making predictions; rather the information stored in databases can

be presented/ viewed in a convenient format in case of OLAP at different levels that facilitates decision

makers or managers. The result of OLAP is displayed in the form of a data cube as shown in Fig. 2

below:

Data Cube in OLAP

Karachi

440

345

Location (cities)

Lahore

605

825

Time Quarters

400

Grocery

Furniture

phone

computer

(Item Types)

Fig. 2

145

E-COMMERCE IT430

Note that in the above diagram, time, item type and location are the three dimensions. OLAP data cube

indicates the sale of 605 and 825 units of furniture and computers, respectively, in the first quarter of the

year in Lahore, 440 units of furniture and 345 phone sets in Karachi in the first quarter, respectively, and

400 grocery items in Lahore during second quarter. Results can be displayed through data cube against more

than three dimensions. For instance, variables, `warehouse' and `customer type' may also be added as

dimensions to view the sale results. OLAP tool allows the use of different processes, namely, drill-down,

roll-up, slice, dice etc. Using drill-down we can further dig the data to receive some specific information.

For example using that I can find the sale of furniture in a specific month of the first quarter, say, February.

Roll-up is the reverse of drill-down. In it we can sum-up or integrate the information in a particular

dimension to show the result. For example the sale of furniture or computers in a particular year (rather

than a specific quarter) can be viewed using roll-up. Similarly, through slice and dice information can be

presented which is specific to certain dimensions of the data cube.

SAS (Enterprise Miner) and DB miner are the names of two commonly used tools for data mining and

OLAP. Note that characterization can be used in respect of any data type whereas OLAP is generally used

for numeric data alone.

146

Table of Contents: