Geometric Models
Geometric models use the concept of instance space The most obvious example of geometric models is when all the features are numerical and can become coordinates in a Cartesian coordinate system. When we only have two or three features they are easy as visualise. However since many machine learning problems have hundreds or thousands of features and therefore dimensions, visualising these spaces is impossible. However many of the geometric concepts, such as linear transformations, still apply in this hyper space.This can help us better understand our models. For instance we expect many learning algorithms will be translation invariant, that is it does not matter where we place the origin in the coordinate system. Also we can use the geometric concept of euclidean distance to measure similarity between instances, this gives us a method to cluster like instances and form a decision boundary between them.
Supposing we are using our linear classifier to classify paragraphs as either happy or sad and we have devised a set of tests. Each test is associated with a weight ,w, to determine how much each test contributes to the overall result.
We can simply sum each test, multiplied by its weight, to get an overall score and create a decision rule that will create a boundary, say if the happy score is greater than a threshold t.
Each feature contributes independently to the overall result, hence the rules linearity. This contribution depends on each features relative weights. These weights can be positive or negative and each individual feature is not subject to the threshold in calculating the overall score.
We can rewrite this sum using vector notation using w for a vector of weights (w_{1},w_{2},...,w_{n}) and
x for a vector of test results (x_{1},x_{2},...x_{n}). Also if we make it an equality we can define the decision boundary:
w . x = t
We can think of w as a vector pointing between the 'centres of mass' of the positive (happy) examples, P , and the negative examples, N_{.} We can calculate these centres of mass by averaging:
Our aim now is to create a decision boundary half way between these centres of mass. We can see that w is proportional, or equal to, P  N and that (P + N)/ 2 will be on the decision boundary. So we can write :
fig Decision boundary
Insert image B05198_01_08.png
In practice real data is noisy and not necessarily that easily separable. Even when data is easily separable a particular decision boundary may not have much meaning. Consider data that is sparse, such as in text classification where the number of words is large compared to the number of instances of each word. In this large area of empty instance space it may be easy to find a decision boundary but which is the best one? One way is to use a margin to measure the distance between the decision boundary and its closest instance. We will explore these techniques later in the book.
Probabilistic Models
A typical example of a probabilistic model is the Bayesian classifier. Given some training data, D, and a probability based on an initial training set, a particular hypothesis, h, called the posteriori probability, P (h/D).
As an example consider we have a bag of marbles. We know that 40% of them are red and 60% are blue. We also know that of the red marbles half of them have flecks of white and all the blue marbles have flecks. When we reach into the bag to select a marble we can feel by its texture that it has flecks. What are the chances it is red?
Let P(RF) = the probability that a randomly drawn marble with flecks is red.
P(FR) = the probability of a red marble with flecks = 0.5
P(R) = the probability a marble is red = 0.4
P(F) = the probability that a marble has flecks = 0.5 x 0.4 + 1 x 0.6= 0.8
Probabilistic models allow us to allow us to explicitly calculate probabilities, rather than just a binary true or false. As we know the key thing we need to do is create a model that maps or features variable to a target variable. When we take a probabilistic approach we assume there is an underlying random process that creates a well defined, but unknown, probability distribution.
Consider a spam detector. Our feature variables, X, could consist of a set of words that indicate the email might be spam, The target variable Y is the instance class: spam or ham. We are interested in the conditional probability of Y given X. For each email instance there will be a feature vector, X, consisting of boolean values representing the presence of our spam words. We are trying to find Y our target boolean representing spam or not spam.
Now consider that we have two words, x_{1} and x_{2} , that constitute our feature vector X. From our training set we can construct a table such as the following:
P(Y = spam x_{1 }, x_{2}) 
P(Y = not spam x_{1 }, x_{2}) 

P(Y x_{1 }= 0, x_{2 } = 0) 
0.1 
0.9 
P(Y x_{1 }= 0, x_{2 } = 1) 
0.7 
0.3 
P(Y x_{1 }= 1, x_{2 } = 0) 
0.4 
0.6 
P(Y x_{1 }= 1, x_{2 } = 1) 
0.8 
0.2 
Table 1.1
We can see that once we begin adding more words to our feature vector it will quickly grow unmanageable. With a feature vector of n size we will have 2^{n }cases to distinguish. Fortunately there are other methods to deal with this problem as we shall see later.
The probabilities in the above table are known as posterior probabilities. These are used when we have knowledge from a prior distribution. For instance that one in ten emails is spam. However consider a case where we may know that X contains x_{2}= 1 but we are unsure of the value of x_{1 }This instance could belong in row two where the probability of it being spam is 0.7, or in row 4 where the probability is 0.8. The solution is to average these two rows using the probability of x_{1} =1 in any instance. That is the probability that a word, x_{1}, will appear in any email, spam or not.
P(Yx_{2}=1) = P(Yx_{1}=0,x_{2}=1)P(x_{1}=0) + P(x_{1}=1,x_{2}=1)P(x_{1}=1)
This is called a likelihood function. If we know, from a training set, that : P(x_{1}=1) = 0.1 and therefore P(x_{1}=0) = 0.9. So we can calculate the probability that an email contains the spam word : 0.7 * 0.9 + 0.8 * 0.1 = 0.71
This is an example of a likelihood function: P(XY). So why do want to know the probability of X , something we all ready know, conditioned on Y something we know nothing about. A way to look at this is to consider the probability of any email containing a particular random paragraph, say the the 127^{th} paragraph of War and Peace. Clearly this probability is small regardless if the email is spam or not. What we are really interested in is not the magnitude of these likelihoods but their ratio. How much more likely is an email containing a particular combination of words likely to be spam or not spam. These are sometimes called generative models because we can sample across all the variables involved.
We can use Bayes rule to transform between prior distributions and a likelihood function:
P(Y) is the prior probability, how likely is each class before having observed X. Similarly P(X) is the probability without taking into account Y. If we have only two classes we can work with ratios. For instance if we want to know how much the data favours each class.:
If the odds are less than one we assume that the class in the denominator the most likely. If it is greater than one then the class in the enumerator is the most likely. If we use the data from table 1.1 we calculate the following posterior odds.
The likelihood function is important in machine learning because it creates a generative model. If we know the probability distribution of each word in a vocabulary, together with the likelihood of each one appearing in either a spam or not spam email we can generate a random spam email according to the conditional probability P(XY = spam).
Logical Models
Logical models are based on algorithms. They can be translated into a set of formal rules that can be understood by humans. For example : if x_{1} and x_{2} = 1 then Class Y = spam. These are called decision rules
These logical rules can be organised into a tree structure. In fig we see that the instance space is iteratively partitioned at each branch. The leaves consist of rectangular areas (or hyper rectangles in the case of higher dimensions) representing segments of the instance space. Depending on the task we are solving the leaves are labelled with a class, probability, real number and so on.
Fig Feature tree
Feature trees are very use full in representing machine learning problems, even those that, at first sight , do not appear to have a tree structure. For instance in the Bayes classifier in the previous section we can partition our instance space in to as many regions as there are combinations of feature values. Decision tree models often employ a pruning technique to delete branches that give an incorrect result. In chapter 3 we will look at a number of ways to represent decision trees in python
Note that decision rules may overlap and make contradictory predictions.
They are then said to be logically inconsistent. Rules can also be incomplete in when they do not take in to account all the coordinates in the feature space._{ }There are a number of ways we can address these issues and we will look at these in detail later in the book.
Since tree learning algorithms usually work in a top down manner the first task is to find a good feature to split on at the top of the tree. We need to find a split that will result in a higher degree of purity in subsequent nodes. By purity I mean the degree to which training examples all belong to the same class. As we descend down the tree, at each level we find the training examples at each node increase in purity, that is they increasingly become separated into their own classes until we reach the leaf where all examples belong to the same class.
To look at this another way we are interested in lowering the entropy of subsequent nodes in our decision tree. Entropy, a measure of disorder, is high at the top of the tree (the root) and is progressively lowered at each node as the data is divided up into its respective classes.
In more complex problems, those with larger feature sets and decision rules, finding the optimum splits is sometimes not possible, at least not in acceptable amount of time. We are really interested in creating the shallowest tree to reach our leafs in the shortest path. The time it takes to analyse each node grows exponentially with each additional feature, so the optimum decision tree may take longer to find than actually using a sub optimum tree to perform the task.
An important property of logical models is that they can, to some extent, provide an explanation for their predictions. For example consider the predictions made by a decision tree. By tracing the path from leaf to root we can determine the conditions that resulted in the final result. This is one of the advantages of logical models in that they can be inspected by a human to reveal more about the problem.
Features
In the same way that decisions in real life are only as good as the information available to us, in a machine learning task the model is only as good as its features. Mathematically features are a functions that maps from the instance space to a set of values in a particular domain. In machine learning most measurements we make are numerical therefore the most common feature domain is the set of real numbers. Other common domains include boolean, true or false, or integers, say when we are counting the occurrence of a particular feature, or finite sets such as a set of colours or shapes.
Models are defined in terms of their features. Also single features can be turned into a model, known a s a univariate model. We can distinguishes between two use of features, related to the distinction between grouping and grading.
Firstly we can group our features by zooming into an area in the instance space. Let f be a feature counting the number of occurrences of a word, x_{1}, in an email X. We can set up conditions such as :
f(X) = 0, emails that do not contain x_{1, , }f(X) > 0, emails that contain x_{1} one or more times and so on. These conditions are called binary splits because they divide the instance space in to two groups. Those that satisfy the condition and those that don't. We can also split the instance space into more than two segments to create non binary splits. For instance where f(X) = 0 ; 0 < F(X) < 5 ; F(X) > 5 and so on.
Secondly we can grade our features, to calculate the independent contribution each one makes to the overall result. Recall our simple linear classifier. The decision rule of the form:
Since this rule is linear each feature makes an independent contribution to the score of an instance. This contribution depends on w_{i} . If it is positive then an positive x_{i} will increase the score. If w_{i} is negative a positive x_{i} decreases the score. If w_{i} is small or zero then the contribution it makes to the overall result is negligible. It can be seen that the features make a measurable contribution to the final prediction.
These two uses of features, as splits (grouping) and predictors (grading) can be combined into one model. A typical example occurs when want to approximate a non linear function say: y sin π x on the interval 1 < x < 1. Clearly simple linear model will not work. Of course the simple answer is to split the x axis into 1 < x 0 and 0 < On each of these segments we can find a reasonable linear approximation.
Fig Using grouping and grading
Insert image B05198_01_10.png
A lot of work can be done improving our models performance by feature construction and transformation. In most machine learning problems the features are not necessarily explicitly available. They need to be construction from raw data sets and then transformed into something our model can make use of. This is especially important in problems like text classification. In our simple spam example we used what is known as a bag of words representation because it disregards the order of the words. However by doing this we loose important information about the meaning of the text.
An important part of feature construction is discretisation. We can sometimes extract more information, or information that is more relevant to our task by dividing features into relevant chunks. For instance supposing our data consists of a list of peoples precise income and we are trying to determine if there is a relationship between financial income and the suburb lived in. Clearly it would be appropriate if our feature set did not consist of precise incomes but ranges of income. Although strictly speaking we loose information, if we choose our intervals appropriately we will not loose information related to our problem, and our model will perform better and give us results that are more easily interpreted.
This highlights the major tasks of feature selection, separating the signal from the noise.
Real world data will invariable contain a lot of information we do not need as well as just plain random noise, and separating the, perhaps small, part of the data that is relevant to our needs is important to the success of our model. It is of course important that we do not throw out information that may be important to us.
Often our features will be non linear, and linear regression may not give us good results. A trick is to transform the instance space itself. Supposing we have data such as that shown in figure below Clearly linear regression only gives us a reasonable fit, shown in the figure on the left. However we can improve this result if we square the instance space, that is we make x = x^{2} and y = y^{2 }, shown in the figure on the right.
Variance = .92 Variance = .97
Fig Transforming the instance space
Insert image B05198_01_11.png
We can go further and use a technique called the kernel trick. The idea is that we can create a higher dimensional implicit feature space. Pairs of data points are mapped from the raw data set to this higher dimensional space via a specified function, sometimes called a similarity function.
For instance:
Let x_{1} = (x_{1}, y_{1}) and x_{2} = (x_{2}, y_{2})
We create a 2D to 3D mapping:
The points in the 3D space corresponding to the 2D points x_{1} and x_{2} are:
and
Now the dot product of these two vectors is:
We can see that by squaring the dot product in the original 2D space we obtain the dot product in the 3D space without actually creating the feature vectors themselves. Here we have defined the kernel k(x_{1},x_{2}) = (x_{1 }_{•}x_{2})^{2. } Calculating the dot product in a higher dimensional space is often computationally cheaper and, as we will see, this technique is used quite widely in machine learning from support vector machines (SVM), principle component analysis (PCA) and correlation analysis.
The basic linear classifier we looked at earlier defines a decision boundary w • x = t . The vector, w, is equal to the difference between the mean of the positive example with the mean of the negative examples. pn. Suppose we have points n= (0,0) and p = (0,1) . Lets assume that we have obtained a positive mean from two training examples p_{1} = (1,1) and p_{2} = (1,1). Therefore we have:
We can now write the decision boundary as:
Using the kernel trick we can obtain the following decision boundary:
With the kernel we defined earlier:
We can now derive the decision boundary:
This is simply a circle around the origin with a radius √t
Using a kernel allows us to classify each new instance by evaluating it paired with each training example, rather than on the mean of positive and negative example as is the case for the basic linear classifier. This allows us to create more flexible, non linear, decision boundaries.
A very interesting and important aspect is the interaction between features. One form of interaction is correlation. For example words in a blog post, where we perhaps might expect there to be positive correlation between the words 'winter' and 'cold', and a negative correlation between winter and hot. What this means for your model depends on your task. If you are doing a sentiment analysis, you might want to consider reducing the weights of each word if they appear together, since the addition of another correlated word would be expected to contribute marginally less weight to the over all result than if that word appeared by itself.
Also with regards to sentiment analysis we often need to transform certain features to capture their meaning. For example the phrase 'not happy' contains a word that would, if we just used 1grams, contribute to a positive sentiment score when its sentiment is clearly negative. A solution, apart from using 2grams which may unnecessarily complicate the model, would be to recognise when these two words appear in a sequence and create a new feature 'not_happy' with an associated sentiment score.
Selecting and optimising features is work well spent. It can be a significant part of the design of learning systems. This iterative nature of design flips between two phases. Firstly understanding the properties of the phenomena you are studying and secondly testing your ideas with experimentation. This experimentation gives us deeper insight into the phenomena, allow us to optimise our features, gain deeper understanding and so on until we are satisfied our model gives us an accurate reflection of reality.
Summary
So far we have introduced a broad crosssection of machine learning problems, techniques and concepts. Hopefully by now you have an idea about how to begin tackling a new and unique problem by breaking it up into its components. We have reviewed some of the essential mathematics and explored ways to visualise our designs. We can see that the same problem can have many different representations and each one may highlight different aspects. Before we can begin modelling we need a well defined objective phrased in terms of a specific, feasible and meaningful question. We need to be clear how we can phrase the question in a way a machine can understand.
We have seen that the design process, although consisting of different, distinct activities, is not necessarily a linear process, but more an iterative one. We cycle through each particular phase, proposing and testing ideas until we feel can jump to the next phase. Sometimes we may jump back to a previous stage. We may sit at an equilibrium point, waiting for a particular event to occur, we may cycle through stages, or go through several stages in parallel.
One of the most important components of a machine learning system is its data. In the same way humans learn by interpreting sensory information, a machine learn from the information retrieved from data. Data varies hugely in quality, availability and type. In many machine learning projects most of the time can be spent collecting, verifying and 'massaging' data in to a shape that is acceptable to our learning algorithms. In the next chapter we will look at the details of this vitally important aspect of machine learning.