11.10 Statisical II


Student talks: From Monday 17th, send slides by Monday latest.
Between 45 and 60 mins,  exercises for the group (make up own or use the ones in the book).

Collocations (5)
Statistical Estimators (6.2)
Lexical Acquisition (8)
Clustering (14)

*() = chapter in the book. 

==========================================
==These slides from Statistics Intro 1==

##Character bank: Ω ∑ ∈ ∏ ⊆ ∩ ∪
## B_i = B underscript i. B^i = B overscript I. 
## ∩ can also be notated as &. 


Marginal Probability:  (cont.)
Because:
    Blue    Red
A  p1      p2        |
B  p3      p4        |    p3 + p4
----------------------
    p1 + p3


= {n/∑/i=1} P (A ∩ B_i)

= For i:n in ∑:        (Or something similar - the point is that i = initial, n = last index)
    f(i)
    
# i:n = i through n?
# yes - Python notation. He compared it to computer programing. It's a f(x). 

∏ = product
∑ = sum 
∪ = coordination (or)

∪ can also be a (large?) function, just like ∑ or ∏

=====

Bayes Rule

 * P (A, B) = P (B, A) since A ∩ B = B ∩ A 
 * P(B,A) = P(B|A)P(A) #Chain rule


{argmax/i} = the value of i which maximises the function. 
Could also write it explicitely - {n/argmax/i=1} ƒ{i}

Slide 24

If we are interested in comparing the probability of events A1, A2, . . . given B, we can ignore P (B) since it’s the same for all A_i.

 * argmax P (Ai|B) = argmax P (B|Ai)P (Ai) i iP(B)
                        = argmax P (B|Ai)P (Ai) i

This idea is sometimes expressed as
 * P (A|B) ∝ P (B|A)P(A) ##∝ = proportional to (NOT infinity)

==
Example  to slide 24: Trying to figure out the probabilities of possible German translations of an English sentence - look only at probabilities of the German sentences and ignore that they all have the same B (English sentence) as condition.

P(G|E) ∝ P(E|G)P(G)
        
        P(E|G) is the translation model: Gives high probability to those german strings that are likely to be the translation of the English.  
        P(G) is the language model. It doesn't care about english, but it pushes to the top of the ranking those strings that correspond well to German. 
        
        By doing this, we've decomposed the problem into simpler ones: one whether the string is German, one whether the German is connected to the English. 

Miloš: A normal model is discontinuous. You can switch the E|G. This is a generative model, here, and if you really want to have a string of words and search for the best translation, that's an np hard problem. For generativists, they now go around this by using discontinuous models. 
(For more on this: http://en.wikipedia.org/wiki/NP-hard )

==
This doesn't really work for the traditional Balls and the Jars example - Bayes Theorem needs to be applied where B is constant. However, let's try:
    
    ## We're doing this to find out which urn to pick if we're trying to get blue?
    ## I'm pretty sure. Not entirely. It's a hard example to conver to. 
    Background: urn1 has two blue, one red ball
    urn2 has three blue, one red ball
    
    P(Urn1|blue) ∝ P(Blue|Urn1)P(Urn1)
    P(Urn2|blue) ∝ P(Blue|urn2)P(urn2)

==Slide 25==
Example: Suppose we are interested in a test to detect a disease which affects one in 100,000 people on average. A lab has developed a test which works but is not perfect.
 * If a person has the disease it will give a positive result with probability 0.97
 * If they do not, the test will be positive with probability 0.007.
You took the test, and it gave a positive result. What is the probability that you actually have the disease?

Right. For this question, you need to calculate for the denominator of the bayes rule (probability of being tested positive). To do this, use the marginal probability. (http://en.wikipedia.org/wiki/Marginal_distribution)

So, the unknown is the probability of you having the disease, given that it was positive. (Left side). On the right, we know the amount of probability that you have the disease, given it was positive (.97). Then, we know that the amount of people who have the disease (1/100000). Over this, we want to know the probability of being tested positive. 

To get this, we run the marginal probability. P(?) = {n/∑/i=1} P (A ∩ B_i). So, here, we have only two events - having, or not having the disease. So, we have .97 times 1/100,000 (having it.) And .007 times 99,999/100,000 (not having it.) 

(Here, we skipped the chain rule. Basically, you have this:
P(+) = P(+,D) + P(+,N)
This means that both events are joined (ie, in the same time.) 
So, we apply the chain rule, which says that:
P(A,B) = P(B)P(A|B). This is how we end up with...)

P(+) = P(+|D)*P(D) + P(+|N)*P(N) <-- Marginal probability for this variable. 

Once you have this, you can plug it in, and you get the result. So:

X = (.97/100000) / (.97/100000+.007*(99999.0/100000.0))
#Tip - run this in Python. (Hence the .0)
X= 0.0013838105577612513

So, there is a 1.3% chance you actually have the disease. 

==

Credits
Some material adapted from: 
 * Foundations of Statistical NLP (Companion website: http://nlp.stanford.edu/fsnlp/ )
 * Intro to NLP slides by Jan Hajic (Homepage: http://ufal.mff.cuni.cz/~hajic/  )
 * How to use probabilities slides by Jason Eisner (http://www.clsp.jhu.edu/workshops/ws08/documents/eisner-slides.pdf )


===Statistics 2======

==Slide 3==

Random Variables

Problem: Different sample spaces, different properties. When comparing experiments, it can be hard/impossible to compare, say heads (coin toss) with red balls (balls + jar), so having actual numbers is desirable.
RV is a function that translates sample space to a number. E.g., sample space in toin coss experiment (heads, tails) into 1 and 0.


 * Function X : Ω → Rn (typically n = 1) 
 * It may be more convenient to work with real number than directly with events
 * Coin toss: X : {H,T} → {0,1} Sum of two dice throws: {1..6}2 → {1..12} Probability mass function:
 * p(x)=P(X =x)=P(A) where A={ω ∈ Ω : X ( ω ) = x }

We take an output of omega such that for such an event the output is x. 

X(a,b) = a+b

Let' say that X =7. What will A be?
{ω ∈ Ω : (X(ω) = x}
{(1,6),(2,5),(3,4),(4,3),(5,2),(6,1) } 

For seven tosses, (1,6) would be 1 heads, 6 tails. Just translating our outputs into numbers. 

We take an x out of the range of the function X, and try to find all events that, plugged into the function, have x as output.

==Slide 4==

Expectation is a mean (weighted average) of a random variable
E (X) = (∑/x) p(x ) · x

Example: Rolling a dice:

E(X) = (6/∑/x=1)p(x)x = (6/∑/x=1)x/6 = 3.5

==Slide 5==

Variance measures how much values of a random variable vary 

Var(X) = E[(X − E(X))^2]
# squared because we always want a positive number here (and we don't want 0 variance just because the outliers balanced each other out?)

Standard deviation σ is the square root of the variance 

What is the variance of a random variable describing a single throw of
a dice?

Sum over numbers 1-6
(6/∑/x=1) p(x) (x-3.5)^2 ## mean = 3.5 

Probability of any number is 1/6, so take p(x) out of the equation, draw it to the front
1/6 (6/∑/x=1) (x-3.5)^2 

So:
1/6 [(1-3.5)^2 + (2-3.5)^2 + (3-3.5)^2 + (4-3.5)^2 + (5-3.5)^2 + (6-3.5)^2]

2.91666666666667