Introduction to learning

<< Handling uncertainty with fuzzy systems

Planning >>

Artificial Intelligence (CS607)

7 Introduction to learning

7.1 Motivation

Artificial Intelligence (AI) is concerned with programming computers to perform

tasks that are presently done better by humans. AI is about human behavior, the

discovery of techniques that will allow computers to learn from humans. One of

the most often heard criticisms of AI is that machines cannot be called Intelligent

until they are able to learn to do new things and adapt to new situations, rather

than simply doing as they are told to do. There can be little question that the

ability to adapt to new surroundings and to solve new problems is an important

characteristic of intelligent entities. Can we expect such abilities in programs?

Ada Augusta, one of the earliest philosophers of computing, wrote: "The

Analytical Engine has no pretensions whatever to originate anything. It can do

whatever we know how to order it to perform." This remark has been interpreted

by several AI critics as saying that computers cannot learn. In fact, it does not say

that at all. Nothing prevents us from telling a computer how to interpret its inputs

in such a way that its performance gradually improves. Rather than asking in

advance whether it is possible for computers to "learn", it is much more

enlightening to try to describe exactly what activities we mean when we say

"learning" and what mechanisms could be used to enable us to perform those

activities. [Simon, 1993] stated "changes in the system that are adaptive in the

sense that they enable the system to do the same task or tasks drawn from the

same population more efficiently and more effectively the next time".

7.2 What is learning ?

Learning can be described as normally a relatively permanent change that occurs

in behavior as a result of experience. Learning occurs in various regimes. For

example, it is possible to learn to open a lock as a result of trial and error;

possible to learn how to use a word processor as a result of following particular

instructions.

Once the internal model of what ought to happen is set, it is possible to learn by

practicing the skill until the performance converges on the desired model. One

begins by paying attention to what needs to be done, but with more practice, one

will need to monitor only the trickier parts of the performance.

Automatic performance of some skills by the brain points out that the brain is

capable of doing things in parallel i.e. one part is devoted to the skill whilst

another part mediates conscious experience.

There's no decisive definition of learning but here are some that do justice:

· "Learning denotes changes in a system that ... enables a system to do the

same task more efficiently the next time." --Herbert Simon

· "Learning is constructing or modifying representations of what is being

experienced." --Ryszard Michalski

· "Learning is making useful changes in our minds." --Marvin Minsky

159

Artificial Intelligence (CS607)

7.3 What is machine learning ?

It is a very difficult to define precisely what machine learning is. We can best

enlighten ourselves by exactly describing the activities that we want a machine to

do when we say learning and by deciding on the best possible mechanism to

enable us to perform those activities. Generally speaking, the goal of machine

learning is to build computer systems that can learn from their experience and

adapt to their environments. Obviously, learning is an important aspect or

component of intelligence. There are both theoretical and practical reasons to

support such a claim. Some people even think intelligence is nothing but the

ability to learn, though other people think an intelligent system has a separate

"learning mechanism" which improves the performance of other mechanisms of

the system.

7.4 Why do we want machine learning

One response to the idea of AI is to say that computers can not think because

they only do what their programmers tell them to do. However, it is not always

easy to tell what a particular program will do, but given the same inputs and

conditions it will always produce the same outputs. If the program gets something

right once it will always get it right. If it makes a mistake once it will always make

the same mistake every time it runs. In contrast to computers, humans learn from

their mistakes; attempt to work out why things went wrong and try alternative

solutions. Also, we are able to notice similarities between things, and therefore

can generate new ideas about the world we live in. Any intelligence, however

artificial or alien, that did not learn would not be much of an intelligence. So,

machine learning is a prerequisite for any mature programme of artificial

intelligence.

7.5 What are the three phases in machine learning?

Machine learning typically follows three phases according to Finlay, [Janet Finlay,

1996]. They are as follows:

1. Training: a training set of examples of correct behavior is analyzed and

some representation of the newly learnt knowledge is stored. This is often some

form of rules.

2. Validation: the rules are checked and, if necessary, additional training is

given. Sometimes additional test data are used, but instead of using a human to

validate the rules, some other automatic knowledge based component may be

used. The role of tester is often called the critic.

3. Application: the rules are used in responding to some new situations.

These phases may not be distinct. For example, there may not be an explicit

validation phase; instead, the learning algorithm guarantees some form of

correctness. Also in some circumstances, systems learn "on the job", that is, the

training and application phases overlap.

7.5.1 Inputs to training

There is a continuum between knowledge-rich methods that use extensive

domain knowledge and those that use only simple domain-independent

knowledge. The domain-independent knowledge is often implicit in the

algorithms; e.g. inductive learning is based on the knowledge that if something

160

Artificial Intelligence (CS607)

happens a lot it is likely to be generally true. Where examples are provided, it is

important to know the source. The examples may be simply measurements from

the world, for example, transcripts of grand master tournaments. If so, do they

represent "typical" sets of behavior or have they been filtered to be

"representative"? If the former is true then it is possible to infer information about

the relative probability from the frequency in the training set. However, unfiltered

data may also be noisy, have errors, etc., and examples from the world may not

be complete, since infrequent situations may simply not be in the training set.

Alternatively, the examples may have been generated by a teacher. In this case,

it can be assumed that they are a helpful set which cover all the important cases.

Also, it is advisable to assume that the teacher will not be ambiguous.

Finally the system itself may be able to generate examples by performing

experiments on the world, asking an expert, or even using the internal model of

the world.

Some form of representation of the examples also has to be decided. This may

partly be determined by the context, but more often than not there will be a

choice. Often the choice of representation embodies quite a lot of the domain

knowledge.

7.5.2 Outputs of training

Outputs of learning are determined by the application. The question that arises is

'What is it that we want to do with our knowledge?'. Many machine learning

systems are classifiers. The examples they are given are from two or more

classes, and the purpose of learning is to determine the common features in each

class. When a new unseen example is presented, the system uses the common

features to determine which class the new example belongs to. For example:

If example satisfies condition

Then assign it to class X

This sort of job classification is often termed as concept learning. The simplest

case is when there are only two classes, of which one is seen as the desired

"concept" to be learnt and the other is everything else. The "then" part of the rules

is always the same and so the learnt rule is just a predicate describing the

concept.

Not all learning is simple classification. In applications such as robotics one wants

to learn appropriate actions. In such a case, the knowledge may be in terms of

production rules or some similar representation.

An important consideration for both the content and representation of learnt

knowledge is the extent to which explanation may be required for future actions.

Because of this, the learnt rules must often be restricted to a form that is

comprehensible to humans.

7.5.3 The training process

Real learning involves some generalization from past experience and usually

some coding of memories into a more compact form. Achieving this

generalization needs some form of reasoning. The difference between deductive

161

Artificial Intelligence (CS607)

reasoning and inductive reasoning is often used as the primary distinction

between machine learning algorithms. Deductive learning working on existing

facts and knowledge and deduces new knowledge from the old. In contrast,

inductive learning uses examples and generates hypothesis based on the

similarities between them.

One way of looking at the learning process is as a search process. One has a set

of examples and a set of possible rules. The job of the learning algorithm is to

find suitable rules that are correct with respect to the examples and existing

knowledge.

7.6 Learning techniques available

7.6.1 Rote learning

In this kind of learning there is no prior knowledge. When a computer stores a

piece of data, it is performing an elementary form of learning. This act of storage

presumably allows the program to perform better in the future. Examples of

correct behavior are stored and when a new situation arises it is matched with the

learnt examples. The values are stored so that they are not re-computed later.

One of the earliest game-playing programs is [Samuel, 1963] checkers program.

This program learned to play checkers well enough to beat its creator/designer.

7.6.2 Deductive learning

Deductive learning works on existing facts and knowledge and deduces new

knowledge from the old. This is best illustrated by giving an example. For

example, assume:

A=B

B=C

Then we can deduce with much confidence that:

C=A

Arguably, deductive learning does not generate "new" knowledge at all, it simply

memorizes the logical consequences of what is known already. This implies that

virtually all mathematical research would not be classified as learning "new"

things. However, regardless of whether this is termed as new knowledge or not, it

certainly makes the reasoning system more efficient.

7.6.3 Inductive learning

Inductive learning takes examples and generalizes rather than starting with

existing knowledge. For example, having seen many cats, all of which have tails,

one might conclude that all cats have tails. This is an unsound step of reasoning

but it would be impossible to function without using induction to some extent. In

many areas it is an explicit assumption. There is scope of error in inductive

reasoning, but still it is a useful technique that has been used as the basis of

several successful systems.

One major subclass of inductive learning is concept learning. This takes

examples of a concept and tries to build a general description of the concept.

Very often, the examples are described using attribute-value pairs. The example

of inductive learning given here is that of a fish. Look at the table below:

herring

cat

dog

cod

whale

162

Artificial Intelligence (CS607)

Swims

yes

has fins

yes

has lungs

yes

is a fish

yes

In the above example, there are various ways of generalizing from examples of

fish and non-fish. The simplest description can be that a fish is something that

does not have lungs. No other single attribute would serve to differentiate the

fish.

The two very common inductive learning algorithms are version spaces and ID3.

These will be discussed in detail, later.

7.7 How is it different from the AI we've studied so far?

Many practical applications of AI do not make use of machine learning. The

relevant knowledge is built in at the start. Such programs even though are

fundamentally limited; they are useful and do their job. However, even where we

do not require a system to learn "on the job", machine learning has a part to play.

7.7.1 Machine learning in developing expert systems?

Many AI applications are built with rich domain knowledge and hence do not

make use of machine learning. To build such expert systems, it is critical to

capture knowledge from experts. However, the fundamental problem remains

unresolved, in the sense that things that are normally implicit inside the expert's

head must be made explicit. This is not always easy as the experts may find it

hard to say what rules they use to assess a situation but they can always tell you

what factors they take into account. This is where machine learning mechanism

could help. A machine learning program can take descriptions of situations

couched in terms of these factors and then infer rules that match expert's

behavior.

7.8 Applied learning

7.8.1 Solving real world problems by learning

We do not yet know how to make computers learn nearly as well as people learn.

However, algorithms have been developed that are effective for certain types of

learning tasks, and many significant commercial applications have begun to

appear. For problems such as speech recognition, algorithms based on machine

learning outperform all other approaches that have been attempted to date. In

other emergent fields like computer vision and data mining, machine learning

algorithms are being used to recognize faces and to extract valuable information

and knowledge from large commercial databases respectively. Some of the

applications that use learning algorithms include:

Spoken digits and word recognition

Handwriting recognition

Driving autonomous vehicles

Path finders

Intelligent homes

163

Artificial Intelligence (CS607)

Intrusion detectors

Intelligent refrigerators, tvs, vacuum cleaners

Computer games

Humanoid robotics

This is just the glimpse of the applications that use some intelligent learning

components. The current era has applied learning in the domains ranging from

agriculture to astronomy to medical sciences.

7.8.2 A general model of learning agents, pattern recognition

Any given learning problem is primarily composed of three things:

· Input

· Processing unit

· Output

The input is composed of examples that can help the learner learn the underlying

problem concept. Suppose we were to build the learner for recognizing spoken

digits. We would ask some of our friends to record their sounds for each digit [0

to 9]. Positive examples of digit `1' would be the spoken digit `1', by the speakers.

Negative examples for digit `1' would be all the rest of the digits. For our learner

to learn the digit `1', it would need positive and negative examples of digit `1' in

order to truly learn the difference between digit `1' and the rest.

The processing unit is the learning agent in our focus of study. Any learning

agent or algorithm should in turn have at least the following three characteristics:

7.8.2.1 Feature representation

The input is usually broken down into a number of features. This is not a

rule, but sometimes the real world problems have inputs that cannot be fed

to a learning system directly, for instance, if the learner is to tell the

difference between a good and a not-good student, how do you suppose it

would take the input? And for that matter, what would be an appropriate

input to the system? It would be very interesting if the input were an entire

student named Ali or Umar etc. So the student goes into the machine and

it tells if the student it consumed was a good student or not. But that

seems like a far fetched idea right now. In reality, we usually associate

some attributes or features to every input, for instance, two features that

can define a student can be: grade and class participation. So these

become the feature set of the learning system. Based on these features,

the learner processes each input.

7.8.2.2 Distance measure

Given two different inputs, the learner should be able to tell them apart.

The distance measure is the procedure that the learner uses to calculate

the difference between the two inputs.

7.8.2.3 Generalization

In the training phase, the learner is presented with some positive and

negative examples from which it leans. In the testing phase, when the

164

Artificial Intelligence (CS607)

learner comes across new but similar inputs, it should be able to classify

them similarly. This is called generalization. Humans are exceptionally

good at generalization. A small child learns to differentiate between birds

and cats in the early days of his/her life. Later when he/she sees a new

bird, never seen before, he/she can easily tell that it's a bird and not a cat.

7.9 LEARNING: Symbol-based

Ours is a world of symbols. We use symbolic interpretations to understand the

world around us. For instance, if we saw a ship, and were to tell a friend about its

size, we will not say that we saw a 254.756 meters long ship, instead we'd say

that we saw a `huge' ship about the size of `Eiffel tower'. And our friend would

understand the relationship between the size of the ship and its hugeness with

the analogies of the symbolic information associated with the two words used:

`huge' and `Eiffel tower'.

Similarly, the techniques we are to learn now use symbols to represent

knowledge and information. Let us consider a small example to help us see

where we're headed. What if we were to learn the concept of a GOOD

STUDENT. We would need to define, first of all some attributes of a student, on

the basis of which we could tell apart the good student from the average. Then

we would require some examples of good students and average students. To

keep the problem simple we can label all the students who are "not good"

(average, below average, satisfactory, bad) as NOT GOOD STUDENT. Let's say

we choose two attributes to define a student: grade and class participation. Both

the attributes can have either of the two values: High, Low. Our learner program

will require some examples from the concept of a student, for instance:

1. Student (GOOD STUDENT): Grade (High) ^ Class Participation (High)

2. Student (GOOD STUDENT): Grade (High) ^ Class Participation (Low)

3. Student (NOT GOOD STUDENT): Grade (Low) ^ Class Participation

(High)

4. Student (NOT GOOD STUDENT): Grade (Low) ^ Class Participation (Low)

As you can see the system is composed of symbolic information, based on which

the learner can even generalize that a student is a GOOD STUDENT if his/her

grade is high, even if the class participation is low:

Student (GOOD STUDENT): Grade (High) ^ Class Participation (?)

This is the final rule that the learner has learnt from the enumerated examples.

Here the `?' means that the attribute class participation can have any value, as

long as the grade is high. In this section we will see all the steps the learner has

to go through to actually come up with the final conclusion like this.

7.10 Problem and problem spaces

Before we get down to solving a problem, the first task is to understand the

problem itself. There are various kinds of problems that require solutions. In

theoretical computer science there are two main branches of problems:

· Tractable

· Intractable

165

Artificial Intelligence (CS607)

Those problems that can be solved in polynomial time are termed as tractable,

the other half is called intractable. The tractable problems are further divided into

structured and complex problems. Structured problems are those which have

defined steps through which the solution to the problem is reached. Complex

problems usually don't have well-defined steps. Machine learning algorithms are

particularly more useful in solving the complex problems like recognition of

patterns in images or speech, for which it's hard to come up with procedural

algorithms otherwise.

The solution to any problem is a function that converts its inputs to corresponding

outputs. The domain of a problem or the problem space is defined by the

elements explained in the following paragraphs. These new concepts will be best

understood if we take one example and exhaustively use it to justify each

construct.

Example:

Let us consider the domain of HEALTH. The problem in this case is to distinguish

between a sick and a healthy person. Suppose we have some domain

knowledge; keeping a simplistic approach, we say that two attributes are

necessary and sufficient to declare a person as healthy or sick. These two

attributes are: Temperature (T) and Blood Pressure (BP). Any patient coming into

the hospital can have three values for T and BP: High (H), Normal (N) and Low

(L). Based on these values, the person is to be classified as Sick (SK). SK is a

Boolean concept, SK = 1 means the person is sick, and SK = 0 means person is

healthy. So the concept to be learnt by the system is of Sick, i.e., SK=1.

7.10.1

Instance space

How many distinct instances can the concept sick have? Since there are two

attributes: T and BP, each having 3 values, there can be a total of 9 possible

distinct instances in all. If we were to enumerate these, we'll get the following

table:

This is the entire instance space, denoted by X, and the individual instances are

denoted by xi. |X| gives us the size of the instance space, which in this case is 9.

|X| = 9

The set X is the entire data possibly available for any concept. However,

sometimes in real world problems, we don't have the liberty to have access to the

166

Artificial Intelligence (CS607)

entire set X, instead we have a subset of X, known as training data, denoted by

D, available to us, on the basis of which we make our learner learn the concept.

7.10.2

Concept space

A concept is the representation of the problem with respect to the given

attributes, for example, if we're talking about the problem scenario of concept

SICK defined over the attributes T and BP, then the concept space is defined by

all the combinations of values of SK for every instance x. One of the possible

concepts for the concept SICK might be enumerated in the following table:

But there are a lot of other possibilities besides this one. The question is: how

many total concepts can be generated out of this given situation. The answer is:

2|X|. To see this intuitively, we'll make small tables for each concept and see them

graphically if they come up to the number 29, since |X| = 9.

The representation used here is that every box in the following diagram is

populated using C(xi), i.e. the value that the concept C gives as output when xi is

given to it as input.

C(x3) C(x6) C(x9)

C(x2) C(x5) C(x8)

C(x1) C(x4) C(x7)

Since we don't know the concept yet, so there might be concepts which can

produce 29 different outputs, such as:

111

000

111

000

100

000

100

111

000

100

000

111

C29

Each of these is a different concept, only one of which is the true concept (that

we are trying to learn), but the dilemma is that we don't know which one of the 29

is the true concept of SICK that we're looking for, since in real world problems we

don't have all the instances in the instance space X, available to us for learning. If

we had all the possible instances available, we would know the exact concept,

but the problem is that we might just have three or four examples of instances

available to us out of nine.

167

Artificial Intelligence (CS607)

Notice that this is not the instance space X, in fact it is D: the training set. We

don't have any idea about the instances that lie outside this set D. The learner is

to learn the true concept C based on only these three observations, so that once

it has learnt, it could classify the new patients as sick or healthy based on the

input parameters.

7.10.3

Hypothesis space

The above condition is typically the case in almost all the real world problems

where learning is to be done based on a few available examples. In this situation,

the learner has to hypothesize. It would be insensible to exhaustively search over

the entire concept space, since there are 29 concepts. This is just a toy problem

with only 9 possible instances in the instance space; just imagine how huge the

concept space would be for real world problems that involve larger attribute sets.

So the learner has to apply some hypothesis, which has either a search or the

language bias to reduce the size of the concept space. This reduced concept

space becomes the hypothesis space. For example, the most common language

bias is that the hypothesis space uses the conjunctions (AND) of the attributes,

i.e.

H = <T, BP>

H is the denotive representation of the hypothesis space; here it is the

conjunction of attribute T and BP. If written in English it would mean:

H = <T, BP>:

IF "Temperature" = T AND "Blood Pressure" = BP

THEN

H=1

ELSE

H=0

Now if we fill in these two blanks with some particular values of T and B, it would

form a hypothesis, e.g. for T = N and BP = N:

168

Artificial Intelligence (CS607)

H000

N010

L 000

LNH T

For h = <L, L>:

H000

N000

L 100

LNH T

Notice that this is the C2 we presented before in the concept space section:

This means that if the true concept of SICK that we wanted to learn was c2 then

the hypothesis h = <L, L> would have been the solution to our problem. But you

must still be wondering what's all the use of having separate conventions for

hypothesis and concepts, when in the end we reached at the same thing: C2 =

<L, L> = h. Well, the advantage is that now we are not required to look at 29

different concepts, instead we are only going to have to look at the maximum of

17 different hypotheses before reaching at the concept. We'll see in a moment

how that is possible.

We said H = <T, BP>. Now T and BP here can take three values for sure: L, N

and H, but now they can take two more values: ? and Ø. Where ? means that for

any value, H = 1, and Ø means that there will be no value for which H will be 1.

For example, h1 = <?, ?>: [For any value of T or BP, the person is sick]

Similarly h2 = <?, N>: [For any value of T AND for BP = N, the person is sick]

H111

N111

L 111

H T

LNH T

h3 = < Ø , Ø >: [For no value of T or BP, the person

[ is sick]

169

000

LNH T

Artificial Intelligence (CS607)

Having said all this, how does this still reduce the hypothesis space to 17? Well

it's simple, now each attribute T and BP can take 5 values each: L, N, H, ? and

Ø. So there are 5 x 5 = 25 total hypotheses possible. This is a tremendous

reduction from 29 = 512 to 25.

But if we want to represent h4 = < Ø , L>, it would be the same as h3, meaning

that there are some redundancies within the 25 hypotheses. These redundancies

are caused by Ø, so if there's this `Ø' in the T or the BP or both, we'll have the

same hypothesis h3 as the outcome, all zeros. To calculate the number of

semantically distinct hypotheses, we need one hypothesis which outputs all

zeros, since it's a distinct hypothesis than others, so that's one, plus we need to

know the rest of the combinations. This primarily means that T and BP can now

take 4 values instead of 5, which are: L, N, H and ?. This implies that there are

now 4 x 4 = 16 different hypotheses possible. So the total distinct hypotheses

are: 16 + 1 = 17. This is a wonderful idea, but it comes at a vital cost. What if the

true concept doesn't lie in the conjunctive hypothesis space? This is often the

case. We can try different hypotheses then. Some prior knowledge about the

problem always helps.

7.10.4

Version space and searching

Version space is a set of all the hypotheses that are consistent with all the

training examples. When we are given a set of training examples D, it is possible

that there might be more than one hypotheses from the hypothesis space that are

consistent with all the training examples. By consistent we mean h(xi) = C(xi).

That is, if the true output of a concept [c(xi)] is 1 or 0 for an instance, then the

output by our hypothesis [h(xi)] is 1 or 0 as well, respectively. If this is true for

every instance in our training set D, we can say that the hypothesis is consistent.

Let us take the following training set D:

One of the consistent hypotheses can be h1=<H, H >

But then there are other hypotheses consistent with D, such as h2 = < H, ? >

H00 1

N0 0 0

L 0 00

LNH T

Although it classifies some of the unseen instances that are not in the training set

H00 1

170

N0 0 1

L 0 01

LNH T

Artificial Intelligence (CS607)

D, different from h1, but it's still consistent over all the instances in D. Similarly

H11 1

N0 0 0

L 0 00

LNH T

there's another hypothesis, h3 = < ?, H >

Notice the change in h3 as compared to h2, but this is again consistent with D.

Version space is denoted as VS H,D = {h1, h2, h3}. This translates as: Version

space is a subset of hypothesis space H, composed of h1, h2 and h3, that is

consistent with D.

7.11 Concept learning as search

Now that we are well familiar with most of the terminologies of machine learning,

we can define the learning process in technical terms as:

"We have to assume that the concept lies in the hypothesis space. So we search

for a hypothesis belonging to this hypothesis space that best fits the training

examples, such that the output given by the hypothesis is same as the true

output of concept."

In short:-

Assume C∈ H, search for an h∈ H that best fits D

Such that ∀ xi∈ D, h(xi) = C(xi).

The stress here is on the word `search'. We need to somehow search through the

hypothesis space.

7.11.1

General to specific ordering of hypothesis space

Many algorithms for concept learning organize the search through the hypothesis

space by relying on a very useful structure that exists for any concept learning

problem: a general-to-specific ordering of hypotheses. By taking advantage of

this naturally occurring structure over the hypothesis space, we can design

learning algorithms that exhaustively search even infinite hypothesis spaces

without explicitly enumerating every hypothesis. To illustrate the general-to-

specific ordering, consider two hypotheses:

h1 = < H, H >

h2 = < ?, H >

Now consider the sets of instances that are classified positive by h1 and by h2.

Because h2 imposes fewer constraints on the instance, it classifies more

instances as positive. In fact, any instance classified positive by h1 will also be

classified positive by h2. Therefore, we say that h2 is more general than h1.

So all the hypothesis in H can be ordered according to their generality, starting

from < ?, ? > which is the most general hypothesis since it always classifies all

171

Artificial Intelligence (CS607)

the instances as positive. On the contrary we have < Ø , Ø > which is the most

specific hypothesis, since it doesn't classify a single instance as positive.

7.11.2

FIND-S

FIND-S finds the maximally specific hypothesis possible within the version space

given a set of training data. How can we use the general to specific ordering of

hypothesis space to organize the search for a hypothesis consistent with the

observed training examples? One way is to begin with the most specific possible

hypothesis in H, then generalize the hypothesis each time it fails to cover an

observed positive training example. (We say that a hypothesis "covers" a positive

example if it correctly classifies the example as positive.) To be more precise

about how the partial ordering is used, consider the FIND-S algorithm:

Initialize h to the most specific hypothesis in H

For each positive training instance x

For each attribute constraint ai in h

If the constraint ai is satisfied by x

Then do nothing

Else

Replace ai in h by the next more general

constraint that is satisfied by x

Output hypothesis h

To illustrate this algorithm, let us assume that the learner is given the sequence

of following training examples from the SICK domain:

The first step of FIND-S is to initialize h to the most specific hypothesis in H:

h=<Ø,Ø>

Upon observing the first training example (< H, H >, 1), which happens to be a

positive example, it becomes obvious that our hypothesis is too specific. In

particular, none of the "Ø" constraints in h are satisfied by this training example,

so each Ø is replaced by the next more general constraint that fits this particular

example; namely, the attribute values for this very training example:

h=<H,H>

This is our h after we have seen the first example, but this h is still very specific. It

asserts that all instances are negative except for the single positive training

example we have observed.

Upon encountering the second example; in this case a negative example, the

algorithm makes no change to h. In fact, the FIND-S algorithm simply ignores

every negative example. While this may at first seem strange, notice that in the

current case our hypothesis h is already consistent with the new negative

example (i.e. h correctly classifies this example as negative), and hence no

revision is needed. In the general case, as long as we assume that the

hypothesis space H contains a hypothesis that describes the true target concept

172

Artificial Intelligence (CS607)

c and that the training data contains no errors and conflicts, then the current

hypothesis h can never require a revision in response to a negative example.

To complete our trace of FIND-S, the third (positive) example leads to a further

generalization of h, this time substituting a "?" in place of any attribute value in h

that is not satisfied by the new example. The final hypothesis is:

h = < ?, H >

This hypothesis will term all the future patients which have BP = H as SICK for all

the different values of T.

There might be other hypotheses in the version space but this one was the

maximally specific with respect to the given three training examples. For

generalization purposes we might be interested in the other hypotheses but

FIND-S fails to find the other hypotheses. Also in real world problems, the training

data isn't consistent and void of conflicting errors. This is another drawback of

FIND-S, that, it assumes the consistency within the training set.

7.11.3

Candidate-Elimination algorithm

Although FIND-S outputs a hypothesis from H that is consistent with the training

examples, but this is just one of many hypotheses from H that might fit the

training data equally well. The key idea in Candidate-Elimination algorithm is to

output a description of the set of all hypotheses consistent with the training

examples. This subset of all hypotheses is actually the version space with

respect to the hypothesis space H and the training examples D, because it

contains all possible versions of the target concept.

The Candidate-Elimination algorithm represents the version space by storing only

its most general members (denoted by G) and its most specific members

(denoted by S). Given only these two sets S and G, it is possible to enumerate all

members of the version space as needed by generating the hypotheses that lie

between these two sets in general-to-specific partial ordering over hypotheses.

Candidate-Elimination algorithm begins by initializing the version space to the set

of all hypotheses in H; that is by initializing the G boundary set to contain the

most general hypothesis in H, for example for the SICK problem, the G0 will be:

G0 = {< ?, ? >}

The S boundary set is also initialized to contain the most specific (least general)

hypothesis:

S0 = {< Ø , Ø >}

These two boundary sets (G and S) delimit the entire hypothesis space, because

every other hypothesis in H is both more general than S0 and more specific than

G0. As each training example is observed one by one, the S boundary is made

more and more general, whereas the G boundary set is made more and more

specific, to eliminate from the version space any hypotheses found inconsistent

with the new training example. After all the examples have been processed, the

computed version space contains all the hypotheses consistent with these

examples. The algorithm is summarized below:

173

Artificial Intelligence (CS607)

Initialize G to the set of maximally general hypotheses in H

Initialize S to the set of maximally specific hypotheses in H

For each training example d, do

If d is a positive example

Remove from G any hypothesis inconsistent with d

For each hypothesis s in S that is inconsistent with d

Remove s from S

Add to S all minimal generalization h of s, such that

h is consistent with d, and some member of G is more general than h

Remove from S any hypothesis that is more general than another one in S

If d is a negative example

Remove from S any hypothesis inconsistent with d

For each hypothesis g in G that is inconsistent with d

Remove g from G

Add to G all minimal specializations h of g, such that

h is consistent with d, and some member of S is more specific than h

Remove from G any hypothesis that is less general than another one in S

The Candidate-Elimination algorithm above is specified in terms of operations.

The detailed implementation of these operations will depend on the specific

problem and instances and their hypothesis space, however the algorithm can be

applied to any concept learning task. We will now apply this algorithm to our

designed problem SICK, to trace the working of each step of the algorithm. For

comparison purposes, we will choose the exact training set that was employed in

FIND-S:

We know the initial values of G and S:

G0 = {< ?, ? >}

S0 = {< Ø, Ø >}

Now the Candidate-Elimination learner starts:

First training observation is: d1 = (<H, H>, 1) [A positive example]

G1 = G0 = {< ?, ? >}, since <?, ?> is consistent with d1; both give positive outputs.

Since S0 has only one hypothesis that is < Ø, Ø >, which implies S0(x1) = 0, which

is not consistent with d1, so we have to remove < Ø, Ø > from S1. Also, we add

minimally general hypotheses from H to S1, such that those hypotheses are

consistent with d1. The obvious choices are like <H,H>, <H,N>, <H,L>,

174

Artificial Intelligence (CS607)

<N,H>......... <L,N>, <L,L>, but none of these except <H,H> is consistent with d1.

So S1 becomes:

S1 = {< H, H >}

G1 = {< ?, ? >}

Second training example is: d2 = (<L, L>, 0) [A negative example]

S2 = S1 = {< H, H>}, since <H, H> is consistent with d2: both give negative outputs

for x2.

G1 has only one hypothesis: < ?, ? >, which gives a positive output on x2, and

hence is not consistent, since SK(x2) = 0, so we have to remove it and add in its

place, the hypotheses which are minimally specialized. While adding we have to

take care of two things; we would like to revise the statement of the algorithm for

the negative examples:

"Add to G all minimal specializations h of g, such that

h is consistent with d, and some member of S is more specific than h"

The immediate one step specialized hypotheses of < ?, ? > are:

{< H, ? >, < N, ? >, < L, ? >, < ?, H >, < ?, N >, < ?, L >}

Out of these we have to get rid of the hypotheses which are not consistent with d2

= (<L, L>, 0). We see that all of the above listed hypotheses will give a 0

(negative) output on x2 = < L, L >, except for < L, ? > and < ?, L >, which give a 1

(positive) output on x2, and hence are not consistent with d2, and will not be

added to G2. This leaves us with {< H, ? >, < N, ? >, < ?, H >, < ?, N >}. This

takes care of the inconsistent hypotheses, but there's another condition in the

algorithm that we must take care of before adding all these hypotheses to G2. We

will repeat the statement again, this time highlighting the point under

consideration:

"Add to G all minimal specializations h of g, such that

h is consistent with d, and some member of S is more specific than h"

This is very important condition, which is often ignored, and which results in the

wrong final version space. We know the current S we have is S2, which is: S2 = {<

H, H>}. Now for which hypotheses do you think < H, H > is more specific to, out

of {< H, ? >, < N, ? >, < ?, H >, < ?, N >}. Certainly < H, H > is more specific than

< H, ? > and < ?, H >, so we remove < N, ? > and < ?, N >to get the final G2:

G2 = {< H, ? >, < ?, H >}

S2 = {< H, H>}

Third and final training example is: d3 = (<N, H>, 1) [A positive example]

We see that in G2, < H, ? > is not consistent with d3, so we remove it:

G3 = {< ?, H >}

175

Artificial Intelligence (CS607)

We also see that in S2, < H, H > is not consistent with d3, so we remove it and

add minimally general hypotheses than < H, H >. The two choices we have are: <

H, ? > and < ?, H >. We only keep < ?, H >, since the other one is not consistent

with d3. So our final version space is encompassed by S3 and G3:

G3 = {< ?, H >}

S3 = {< ?, H >}

It is only a coincidence that both G and S sets are the same. In bigger problems,

or even here if we had more examples, there was a chance that we'd get different

but consistent sets. These two sets of G and S outline the version space of a

concept. Note that the final hypothesis is the same one that was computed by

FIND-S.

7.12 Decision trees learning

Up until now we have been searching in conjunctive spaces which are formed by

ANDing the attributes, for instance:

IF Temperature = High AND Blood Pressure = High

THEN Person = SICK

But this is a very restrictive search, as we saw the reduction in hypothesis space

from 29 total possible concepts to 17. This can be risky if we're not sure if the true

concept will lie in the conjunctive space. So a safer approach is to relax the

searching constraints. One way is to involve OR into the search. Do you think

we'll have a bigger search space if we employ OR? Yes, most certainly; consider,

for example, the statement:

IF Temperature = High OR Blood Pressure = High

THEN Person = SICK

If we could use these kind of OR statements, we'd have a better chance of

finding the true concept, if the concept does not lie in the conjunctive space.

These are also called disjunctive spaces.

7.12.1

Decision tree representation

Decision trees give us disjunctions of conjunctions, that is, they have the form:

(A AND B) OR (C AND D)

In tree representation, this would translate into:

where A, B, C and D are the attributes for the problem. This tree gives a positive

output if either A AND B attributes are present in the instance; OR C AND D

attributes are present. Through decision trees, this is how we reach the final

hypothesis. This is a hypothetical tree. In real problems, every tree has to have a

176

Artificial Intelligence (CS607)

root node. There are various algorithms like ID3 and C4.5 to find decision trees

for learning problems.

7.12.2

ID3

ID stands for interactive dichotomizer. This was the 3rd revision of the algorithm

which got wide acclaims. The first step of ID3 is to find the root node. It uses a

special function GAIN, to evaluate the gain information of each attribute. For

example if there are 3 instances, it will calculate the gain information for each.

Whichever attribute has the maximum gain information, becomes the root node.

The rest of the attributes then fight for the next slots.

7.12.2.1

Entropy

In order to define information gain precisely, we begin by defining a measure

commonly used in statistics and information theory, called entropy, which

characterizes the purity/impurity of an arbitrary collection of examples. Given a

collection S, containing positive and negative examples of some target concept,

the entropy of S relative to this Boolean classification is:

Entropy(S) = - p+log2 p+ - p-log2 p-

where p+ is the proportion of positive examples in S and p- is the proportion of

negative examples in S. In all calculations involving entropy we define 0log 0 to

be 0.

To illustrate, suppose S is a collection of 14 examples of some Boolean concept,

including 9 positive and 5 negative examples, then the entropy of S relative to

this Boolean classification is:

Entropy(S) = - (9/14)log2 (9/14) - (5/14)log2 (5/14)

= 0.940

Notice that the entropy is 0, if all the members of S belong to the same class

(purity). For example, if all the members are positive (p+ = 1), then p- = 0 and so:

Entropy(S) = - 1log2 1 - 0log2 0

= - 1 (0) - 0

[since log2 1 = 0, also 0log2 0 = 0]

Note the entropy is 1 when the collection contains equal number of positive and

negative examples (impurity). See for yourself by putting p+ and p- equal to 1/2.

Otherwise if the collection contains unequal numbers of positive and negative

examples, the entropy is between 0 and 1.

7.12.2.2

Information gain

Given entropy as a measure of the impurity in a collection of training examples,

we can now define a measure of the effectiveness of an attribute in classifying

the training data. The measure we will use, called information gain, is simply the

expected reduction in entropy caused by partitioning the examples according to

this attribute. That is, if we use the attribute with the maximum information gain

as the node, then it will classify some of the instances as positive or negative with

100% accuracy, and this will reduce the entropy for the remaining instances. We

will now proceed to an example to explain further.

177

Artificial Intelligence (CS607)

7.12.2.3

Example

Suppose we have the following hypothetical training data available to us given in

the table below. There are three attributes: A, B and E. Attribute A can take three

values: a1, a2 and a3. Attribute B can take two values: b1 and b2. Attribute E can

also take two values: e1 and e2. The concept to be learnt is a Boolean concept,

so C takes a YES (1) or a NO (0), depending on the values of the attributes.

YES

First step is to calculate the entropy of the entire set S. We know:

E(S) = - p+log2 p+ - p-log2 p-

2 2

- log 2 - log 2 = 0.97

5 5

We have the entropy of the entire training set S with us now. We have to

calculate the information gain for each attribute A, B, and E based on this entropy

so that the attribute giving the maximum information is to be placed at the root of

the tree.

The formula for calculating the gain for A is:

| Sa1 |

| Sa2 |

| Sa3 |

G(S , A) = E(S ) -

E (Sa1 ) -

E (Sa2 ) -

E(Sa3 )

|S|

where |Sa1| is the number of times attribute A takes the value a1. E(Sa1) is the

entropy of a1, which will be calculated by observing the proportion of total

population of a1 and the number of times the C is YES or NO within these

observation containing a1 for the value of A.

For example, from the table it is obvious that:

|S| = 5

|Sa1| = 1

[since there is only one observation of a1 which outputs a YES]

E(Sa1) = -1log21 - 0log20 = 0

[since log 1 = 0]

|Sa2| = 2

[one outputs a YES and the other outputs NO]

1 1

E(Sa2) = - log 2 - log 2 = - (- 1) - (- 1) = 1

2 2

|Sa3| = 1

[since there is only one observation of a3 which outputs a NO]

E(Sa3) = -0log20 - 1log21 = 0

[since log 1 = 0]

Putting all these values in the equation for G(S,A) we get:

178

Artificial Intelligence (CS607)

(0) - 2 (1) - 1 (0) = 0.57

G(S , A) = 0.97 -

Similarly for B, now since there are only two values observable for the attribute B:

| Sb2 |

| Sb1 |

G(S , B) = E(S ) -

E(Sb1 ) -

E(Sb2 )

|S|

3 1

1 2

G(S , B) = 0.97 - (1) - (- log 2 - log 2 )

5 3

3 3

G(S , B) = 0.97 - 0.4 - (0.52 + 0.39) = 0.02

Similarly for E

| Se1 |

| Se2 |

E(Se2 ) = 0.02

G(S , E ) = E(S ) -

E (Se1 ) -

|S|

This tells us that information gain for A is the highest. So we will simply choose A

as the root of our decision tree. By doing that we'll check if there are any

conflicting leaf nodes in the tree. We'll get a better picture in the pictorial

representation shown below:

YES

S' = [d2, d4]

This is a tree of height one, and we have built this tree after only one iteration.

This tree correctly classifies 3 out of 5 training samples, based on only one

attribute A, which gave the maximum information gain. It will classify every

forthcoming sample that has a value of a1 in attribute A as YES, and each sample

having a3 as NO. The correctly classified samples are highlighted below:

YES

Note that a2 was not a good determinant for classifying the output C, because it

gives both YES and NO for d2 and d4 respectively. This means that now we have

to look at other attributes B and E to resolve this conflict. To build the tree further

we will ignore the samples already covered by the tree above. Our new sample

space will be given by S' as given in the table below:

179

Artificial Intelligence (CS607)

S' A

d2 a2

YES

d4 a2

We'll apply the same process as above again. First we calculate the entropy for

this sub sample space S':

E(S') = - p+log2 p+ - p-log2 p-

1 1

= - log 2 - log 2 = 1

2 2

This gives us entropy of 1, which is the maximum value for entropy. This is also

obvious from the data, since half of the samples are positive (YES) and half are

negative (NO).

Since our tree already has a node for A, ID3 assumes that the tree will not have

the attribute repeated again, which is true since A has already divided the data as

much as it can, it doesn't make any sense to repeat A in the intermediate nodes.

Give this a thought yourself too. Meanwhile, we will calculate the gain information

of B and E with respect to this new sample space S':

|S'| = 2

|S'b2| = 2

| S ' b2 |

G(S ' , B) = E (S ' ) -

E (S ' b2 )

| S '|

2 1

1 1

G(S ' , B) = 1 - (- log 2 - log 2 ) = 1 - 1 = 0

2 2

Similarly for E:

|S'| = 2

|S'e1| = 1

[since there is only one observation of e1 which outputs a YES]

E(S'e1) = -1log21 - 0log20 = 0

[since log 1 = 0]

|S'e2| = 1

[since there is only one observation of e2 which outputs a NO]

E(S'e2) = -0log20 - 1log21 = 0

[since log 1 = 0]

Hence:

| S ' e1 |

| S ' e2 |

G(S ' , E ) = E (S ' ) -

E (S ' e1 ) -

E (S ' e2 )

| S '|

G(S ' , E ) = 1 - (0) - (0) = 1 - 0 - 0 = 1

Therefore E gives us a maximum information gain, which is also true intuitively

since by looking at the table for S', we can see that B has only one value b2,

which doesn't help us decide anything, since it gives both, a YES and a NO.

Whereas, E has two values, e1 and e2; e1 gives a YES and e2 gives a NO. So we

put the node E in the tree which we are already building. The pictorial

representation is shown below:

180

Artificial Intelligence (CS607)

YES

Now we will stop further iterations since there are no conflicting leaves that we

need to expand. This is our hypothesis h that satisfies each training example.

7.13 LEARNING: Connectionist

Although ID3 spanned more of the concept space, but still there is a possibility

that the true concept is not simply a mixture of disjunctions of conjunctions, but

some

complex

arrangement

attributes.

(Artificial Neural Networks) ANNs can compute more complicated functions

ranging from linear to any higher order quadratic, especially for non-Boolean

concepts. This new learning paradigm takes its roots from biology inspired

approach to learning. Its primarily a network of parallel distributed computing in

which the focus of algorithms is on training rather than explicit programming.

Tasks for which connectionist approach is well suited include:

· Classification

· Fruits Apple or orange

· Pattern Recognition

· Finger print, Face recognition

· Prediction

· Stock market analysis, weather forecast

7.14 Biological aspects and structure of a neuron

The brain is a collection of about 100 billion interconnected neurons. Each

neuron is a cell that uses biochemical reactions to receive, process and transmit

information. A neuron's dendritic tree is connected to a thousand neighboring

neurons. When one of those neurons fire, a positive or negative charge is

received by one of the dendrites. The strengths of all the received charges are

added together through the processes of spatial and temporal summation. Spatial

summation occurs when several weak signals are converted into a single large

one, while temporal summation converts a rapid series of weak pulses from one

source into one large signal. The aggregate input is then passed to the soma (cell

body). The soma and the enclosed nucleus don't play a significant role in the

processing of incoming and outgoing data. Their primary function is to perform

the continuous maintenance required to keep the neuron functional. The part of

the soma that does concern itself with the signal is the axon hillock. If the

aggregate input is greater than the axon hillock's threshold value, then the neuron

fires, and an output signal is transmitted down the axon. The strength of the

output is constant, regardless of whether the input was just above the threshold,

or a hundred times as great. The output strength is unaffected by the many

181

Artificial Intelligence (CS607)

divisions in the axon; it reaches each terminal button with the same intensity it

had at the axon hillock. This uniformity is critical in an analogue device such as a

brain where small errors can snowball, and where error correction is more difficult

than in a digital system. Each terminal button is connected to other neurons

across a small gap called a synapse.

7.14.1

Comparison between computers and the brain

Biological Neural Networks

Computers

Speed

Fast (nanoseconds)

Slow (milliseconds)

Processing

Superior (massively parallel)

Inferior (Sequential mode)

1011 neurons, 1015 interconnections

Size & Complexity

Far few processing elements

Storage

Adaptable, interconnection strengths

Strictly replaceable

Fault tolerance

Extremely Fault tolerant

Inherently non fault tolerant

Control mechanism

Distributive control

Central control

While this clearly shows that the human information processing system is

superior to conventional computers, but still it is possible to realize an artificial

neural network which exhibits the above mentioned properties. We'll start with a

single perceptron, pioneering work done in 1943 by McCulloch and Pitts.

7.15 Single perceptron

To capture the essence of biological neural systems, an artificial neuron is

defined as follows:

It receives a number of inputs (either from original data, or from the output of

other neurons in the neural network). Each input comes via a connection that has

a strength (or weight); these weights correspond to synaptic efficacy in a

biological neuron. Each neuron also has a single threshold value. The weighted

sum of the inputs is formed, and the threshold subtracted, to compose the

activation of the neuron. The activation signal is passed through an activation

function (also known as a transfer function) to produce the output of the neuron.

182

Artificial Intelligence (CS607)

Input1

weigh

Threshold

Output

Input2

Function

weigh

Bias

Neuron Firing Rule:

IF (Input1 x weight1) + (Input2 x weight2) + (Bias) satisfies Threshold value

Then Output = 1

Else Output = 0

7.15.1

Response of changing bias

The response of changing the bias of a neuron results in shifting the decision line

up or down, as shown by the following figures taken from matlab.

183

Artificial Intelligence (CS607)

7.15.2

Response of changing weight

The change in weight results in the rotation of the decision line. Hence this up

and down shift, together with the rotation of the straight line can achieve any

linear decision.

7.16 Linearly separable problems

There is a whole class of problems which are termed as linearly separable. This

name is given to them, because if we were to represent them in the input space,

we could classify them using a straight line. The simplest examples are the

logical AND or OR. We have drawn them in their input spaces, as this is a simple

2D problem. The upper sloping line in the diagram shows the decision boundary

for AND gate, above which, the output is 1, below is 0. The lower sloping line

decides for the OR gate similarly.

Input 1

Input 2

AND

(1,0

(1,1

Input 2

Input 1

Input 2

(0,0)

(0,1

Input 1

A single perceptron simply draws a line, which is a hyper plane when the data is

more then 2 dimensional. Sometimes there are complex problems (as is the case

184

Artificial Intelligence (CS607)

in real life). The data for these problems cannot be separated into their respective

classes by using a single straight line. These problems are not linearly separable.

Linearly Separable

Linearly Non Separable /

Non linear decision region

Another example of linearly non-separable problems is the XOR gate (exclusive

OR). This shows how such a small data of just 4 rows, can make it impossible to

draw one line decision boundary, which can separate the 1s from 0s.

(1,0)

(1,1)

Input 1

Input 2

Output

Input 2

(0,0)

(0,1)

Input 1

Can you draw one line which separates the ones from zeros for the output?

185

Artificial Intelligence (CS607)

We need two lines:

(1,0)

(1,1)

Input 2

(0,0)

(0,1)

Input 1

A single layer perceptron can perform pattern classification only on linearly

separable patterns, regardless of the type of non-linearity (hard limiter, signoidal).

Papert and Minsky in 1969 illustrated the limitations of Rosenblatt's single layer

perceptron (e.g. requirement of linear separability, inability to solve XOR

problem) and cast doubt on the viability of neural networks. However, multi-layer

perceptron and the back-propagation algorithm overcomes many of the

shortcomings of the single layer perceptron.

7.17 Multiple layers of perceptrons

Just as in the previous example we saw that XOR needs two lines to separate

the data without incorporating errors. Likewise, there are many problems which

need to have multiple decision lines for a good acceptable solution. Multiple layer

perceptrons achieve this task by the introduction of one or more hidden layers.

Each neuron in the hidden layer is responsible for a different line. Together they

form a classification for the given problem.

Input 1

Input 2

Hidden

Output

Input

Layer

Each neuron in the hidden layer forms a different decision line. Together all the

lines can construct any arbitrary non-linear decision boundaries. These multi-

layer perceptrons are the most basic artificial neural networks.

186

Artificial Intelligence (CS607)

7.18 Artificial Neural Networks: supervised and unsupervised

A neural network is a massively parallel distributed computing system that has a

natural propensity for storing experiential knowledge and making it available for

use. It resembles the brain in two respects:

· Knowledge is acquired by the network through a learning process (called

training)

· Interneuron connection strengths known as synaptic weights are used to

store the knowledge

Knowledge in the artificial neural networks is implicit and distributed.

Advantages

· Excellent for pattern recognition

· Excellent classifiers

· Handles noisy data well

· Good for generalization

Draw backs

· The power of ANNs lie in their parallel architecture

Unfortunately, most machines we have are serial (Von Neumann

architecture)

· Lack of defined rules to build a neural network for a specific problem

Too many variables, for instance, the learning algorithm, number of

neurons per layer, number of layers, data representation etc

· Knowledge is implicit

· Data dependency

But all these drawbacks doesn't mean that the neural networks are useless

artifacts. They are still arguably very powerful general purpose problem solvers.

7.19 Basic terminologies

Number of layers

o Single layer network

o Multilayer networks

Single Layer

Two Layers

Only one input and

One input ,

One hidden and

One output layer

187

Artificial Intelligence (CS607)

Direction of information (signal) flow

o Feed-forward

o Recurrent (feed-back)

Feed forward

Recurrent Network

Connectivity

o Fully connected

o Partially connected

Fully connected

Partially connected

Learning methodology

o Supervised

Given a set of example input/output pairs, find a rule that does a

good job of predicting the output associated with a new input.

o Unsupervised

Given a set of examples with no labeling, group them into

sets called clusters

Knowledge is not explicitly represented in ANNs. Knowledge is primarily encoded

in the weights of the neurons within the network

7.20 Design phases of ANNs

Feature Representation

· The number of features are determined using no of inputs for the

problem. In many machine learning applications, there are huge

number of features:

· Text Classification (# of words)

· Gene Arrays for DNA classification (5,000-50,000)

· Images (512 x 512)

188

Artificial Intelligence (CS607)

These large feature spaces make algorithms run slower. They also

make the training process longer. The solution lies in finding a

smaller feature space which is the subset of existing features.

Feature Space should show discrimination between classes of the

data. Patient's height is not a useful feature for classifying whether

he is sick or healthy

Training

· Training is either supervised or unsupervised.

· Remember when we said:

· We assume that the concept lies in the hypothesis space. So

we search for a hypothesis belonging to this hypothesis

space that best fits the training examples, such that the

output given by the hypothesis is same as the true output of

concept

· Finding the right hypothesis is the goal of the training

session. So neural networks are doing function

approximation, and training stops when it has found the

closest possible function that gives the minimum error on all

the instances

· Training is the heart of learning, in which finding the best

hypothesis that covers most of the examples is the objective.

Learning is simply done through adjusting the weights of the

network

Activation

function

Weighted

Similarity

Input

Sum

measure

of input

Weight

updation

Similarity Measurement

· A measure to tell the difference between the actual output of

the network while training and the desired labeled output

· The most common technique for measuring the total error in

each iteration of the neural network (epoch) is Mean

Squared Error (MSE).

Validation

· During training, training data is divided into k data sets; k-1

sets are used for training, and the remaining data set is used

for cross validation. This ensures better results, and avoids

over-fitting.

189

Artificial Intelligence (CS607)

Stopping Criteria

· Done through MSE. We define a low threshold usually 0.01,

which if reached stops the training data.

· Another stopping criterion is the number of epochs, which

defines how many maximum times the data can be

presented to the network for learning.

Application Testing

· A network is said to generalize well when the input-output

relationship computed by the network is correct (or nearly so)

for input-output pattern (test data) never used in creating and

training the network.

7.21 Supervised

Given a set of example input/output pairs, find a rule that does a good job of

predicting the output associated with a new input.

7.21.1

Back propagation algorithm

1. Randomize the weights {ws} to small random values (both positive and

negative)

2. Select a training instance t, i.e.,

a. the vector {xk(t)}, i = 1,...,Ninp (a pair of input and output patterns),

from the training set

3. Apply the network input vector to network input

4. Calculate the network output vector {zk(t)}, k = 1,...,Nout

5. Calculate the errors for each of the outputs k , k=1,...,Nout, the difference

between the desired output and the network output

6. Calculate the necessary updates for weights -ws in a way that minimizes

this error

7. Adjust the weights of the network by - ws

8. Repeat steps for each instance (pair of inputoutput vectors) in the training

set until the error for the entire system

7.22 Unsupervised

Given a set of examples with no labeling, group them into sets called

clusters

A cluster represents some specific underlying patterns in the data

Useful for finding patterns in large data sets

Form clusters of input data

Map the clusters into outputs

Given a new example, find its cluster, and generate the associated output

7.22.1

Self-organizing neural networks: clustering, quantization,

function approximation, Kohonen maps

1. Each node's weights are initialized

190

Artificial Intelligence (CS607)

2. A data input from training data (vector) is chosen at random and presented

to the cluster lattice

3. Every cluster centre is examined to calculate which weights are most like

the input vector. The winning node is commonly known as the Best

Matching Unit (BMU)

4. The radius of the neighborhood of the BMU is now calculated. Any nodes

found within this radius are deemed to be inside the BMU's neighborhood

5. Each neighboring node's (the nodes found in step 4) weights are adjusted

to make them more like the input vector. The closer a node is to the BMU,

the more its weights get altered

6. Repeat steps for N iterations

191

Artificial Intelligence (CS607)

7.23 Exercise

1) We will change the problem size for SICK a little bit. If T can take on 4

values, and BP can take 5 values. For conjunctive bias, determine the

size of instance space and hypothesis space.

2) Is the following concept possible through conjunctive or disjunctive

hypothesis? ( T AND BP ) or ( T OR BP )

H100

N010

L 001

LNH T

Appendix MATLAB CODE

makeTrainData.m

trainData = zeros(21, 100);

tempImage = zeros(10,10);

for i = 1:7

filename = strcat('alif', int2str(i),'.bmp');

tempImage = imread(filename);

trainData(i,:) = reshape(tempImage,1,100);

end

for i = 1:7

filename = strcat('bay', int2str(i),'.bmp');

tempImage = imread(filename);

trainData(i+7,:) = reshape(tempImage,1,100);

end

for i = 1:7

filename = strcat('jeem', int2str(i),'.bmp');

tempImage = imread(filename);

trainData(i+14,:) = reshape(tempImage,1,100);

end

targetData = zeros(21,3);

targetData(1:7,1) = 1;

targetData(8:14,2) = 1;

targetData(15:21,3) = 1;

192

Artificial Intelligence (CS607)

save 'trainData' trainData targetData ;

makeTestData.m

testData = zeros(9, 100);

tempImage = zeros(10,10);

for i = 1:3

filename = strcat('alif', int2str(i),'.bmp');

tempImage = imread(filename);

testData(i,:) = reshape(tempImage,1,100);

end

for i = 1:3

filename = strcat('bay', int2str(i),'.bmp');

tempImage = imread(filename);

testData(i+3,:) = reshape(tempImage,1,100);

end

for i = 1:3

filename = strcat('jeem', int2str(i),'.bmp');

tempImage = imread(filename);

testData(i+6,:) = reshape(tempImage,1,100);

end

targetData = zeros(9,3);

targetData(1:3,1) = 1;

targetData(4:6,2) = 1;

targetData(7:9,3) = 1;

save 'testData' testData targetData;

193

Artificial Intelligence (CS607)

trainNN.m

load 'trainData.mat';

minMax = [min(trainData) ; max(trainData)]';

bpn = newff(minMax, [10 3],{'tansig' 'tansig'});

bpn.trainParam.epochs = 15;

bpn.trainParam.goal = 0.01;

bpn = train(bpn,trainData',targetData');

save 'bpnNet' bpn;

testNN.m

load('trainData');

load('bpnNet');

Y = sim(bpn, trainData');

[X,I] = max(Y);

errorCount = 0;

for i = 1 : length(targetData)

if ceil(i/7) ~= I(i)

errorCount = errorCount + 1;

end

percentageAccuracyOnTraining = (1-(errorCount/length(targetData))) * 100

%%%%%%%%%%%%%%%%%%%%%%%%%%

load('testData');

Y = sim(bpn, testData');

[X,I] = max(Y);

errorCount = 0;

for i = 1 : length(targetData)

if ceil(i/3) ~= I(i)

errorCount = errorCount + 1;

end

percentageAccuracyOnTesting = (1-(errorCount/length(targetData))) * 100

194

Table of Contents: