Probability Theory Fundamentals

Introduction

In this series I want to explore some introductory concepts from statistics that may occur helpful for those learning machine learning or refreshing their knowledge. Those topics lie at the heart of data science and arise regularly on a rich and diverse set of topics. It is always good to go through the basics again — this way we may discover new knowledge which was previously hidden from us, so let’s go on.

在本系列文章中,我想探讨一些统计学上的入门概念,这些概念可能会帮助我们了解机器学习或开拓视野。这些概念是数据科学的核心,并经常出现在各种各样的话题上。重温基础知识总是有益的,这样我们就能发现以前并未理解的新知识,所以我们开始吧。

The first part will introduce fundamentals of probability theory.

Probability

Why do we need probabilities when we already have such a great mathematical tooling? We have calculus to work with functions on the infinitesimal scale and to measure how they change. We developed algebra to solve equations, and we have dozens of other areas of mathematics that help us to tackle almost any kind of hard problem we can think of.

我们已经拥有十分强大的数学工具了,为什么我们还需要学习概率论? 我们用微积分来处理变化无限小的函数,并计算它们的变化。 我们使用代数来解方程,我们还有其他几十个数学领域来帮助我们解决几乎任何一种可以想到的难题。

The difficult part is that we all live in a chaotic universe where things can’t be measured exactly most of the time. When we study real world processes we want to learn about numerous random events that distort our experiments. Uncertainty is everywhere and we must tame it to be used for our needs. That is when probability theory and statistics come into play.

难点在于我们都生活在一个混乱的世界中,多数情况下无法准确地测量事物。当我们研究真实世界的过程时,我们想了解许多影响实验结果的随机事件。不确定性无处不在,我们必须驯服它以满足我们的需要。只有如此,概率论和统计学才会发挥作用。

Nowadays those disciplines lie in the center of artificial intelligence, particle physics, social science, bio-informatics and in our everyday lives.

如今,这些学科处于人工智能,粒子物理学,社会科学,生物信息学以及日常生活中的中心。

If we are getting to talk about statistics, it is better to settle on what is a probability. Actually, this question has no single best answer. We will go through various views on probability theory below.

如果我们要谈论统计学,最好先确定什么是概率。其实,这个问题没有绝对的答案。我们接下来将阐述概率论的各种观点。

Frequentist probabilities

Imagine we were given a coin and want to check whether it is fair or not. How do we approach this? Let’s try to conduct some experiments and record 1 if heads come up and 0 if we see tails. Repeat this 1000 tosses and count each 0 and 1. After we had some tedious time experimenting, we got those results: 600 heads (1s) and 400 tails (0s). If we then count how frequent heads or tails came up in the past, we will get 60% and 40% respectively. Those frequencies can be interpreted as probabilities of a coin coming up heads or tails. This is called a frequentist view on the probabilities.

想象一下,我们有一枚硬币,想验证投掷后正反面朝上频率是否相同。 我们如何解决这一问题? 我们试着进行一些实验,如果硬币正面向上记录1,如果反面向上记录0。 重复投掷1000次并记录0和1的次数。在我们进行了一些繁琐的时间实验后,我们得到了这些结果:600个正面(1)和400反面(0)。 如果我们计算过去正面和反面的频率,我们将分别得到60%和40%。 这些频率可以被解释为硬币出现正面或者反面的概率。 这被称为频率化的频率。

Conditional probabilities

Frequently we want to know the probability of an event given some other event has occurred. We write conditional probability of an event A given event B as P(A | B). Take rains for example:

通常,我们想知道某些事件发生时其它事件也发生的概率。 我们将事件B发生且事件 A也发生的条件概率写为P(A | B)。 以下雨为例:

  • What is the probability of a rain given we see thunder
  • What is the probability of a rain given it is sunny?
  • 打雷时下雨的概率有多大?
  • 晴天下雨的概率有多大?

From this Euler diagram we can see that P(Rain | Thunder) = 1: there is always raining when we see thunder (yes, it is not exactly true, but we’ll take this as true in our example).

从这个欧拉图,我们可以看到P(Rain | Thunder)= 1 :当我们看到雷声时,总会下雨(当然,这不完全正确,但是我们在这个例子中保证它成立)。

What about P(Rain | Sunny)? Visually this probability is quite small but how can we formulate this mathematically and do exact calculations? Conditional probability is defined as:

P(Rain | Sunny)是多少呢? 直觉上这个概率很小,但是我们怎样才能在数学上做出这个准确的计算呢? 条件概率定义为:

In words, we divide probability of both Rain and Sunny by the probability of a Sunny weather.

换句话说,我们用 Rain且Sunny的概率除以 Sunny 的概率。

Dependent and independent events

Events are called independent if the probability of one event does not influence the other in any way. Take for example the probability of rolling a dice and getting a 2 for the first time and for the second time. Those events are independent. We can state this as

如果一个事件的概率不以任何方式影响另一个事件,则该事件被称为独立事件。 以掷骰子且连续两次掷得2的概率为例。 这些事件是独立的。 我们可以这样表述

But why this formula works? First, let’s rename events for 1st and 2nd tosses as A and B to remove notational clutter and then rewrite probability of a roll explicitly as joint probability of both rolls we had seen so far:

但是为什么这个公式可行? 首先,我们将第一次投掷和第二次投掷的事件重命名为A 和 B ,以消除语义影响,然后将我们看到的两次投掷的的联合概率明确地重写为两次投掷的单独概率乘积:

And now multiply and divide P(A) by P(B) (nothing changes, it can be cancelled out) and recall the definition of conditional probability:

现在用 P(A)乘以P(B)(没有变化,可以取消)并重新回顾条件概率的定义:

If we read expression above from right to left we find that P(A | B) = P(A). Basically, this means that A is independent of B! The same argument goes for P(B) and we are done.

如果我们从右到左阅读上式,我们会发现P(A | B) = P(A)。这就意味着事件A独立于事件B!P(B)也是一样,独立事件就解释到这里。

Bayesian view on probability

There is an alternative way to look at probabilities called Bayesian. Frequentist approach to statistics supposes the existence of one best concrete combination of model parameters we are looking to find. On the other hand, Bayesian way treats parameters in a probabilistic manner and views them as random variables. In Bayesian statistics, each parameter has its own probability distribution which tells us how probable are parameters given the data. Mathematically this can be written as

贝叶斯可以作为一种理解概率的替代方法。 频率统计方法假设存在我们正在寻找的模型参数的一个最佳的具体组合。 另一方面,贝叶斯以概率方式处理参数,并将其视为随机变量。 在贝叶斯统计中,每个参数都有自己的概率分布,它告诉我们给已有数据的参数有多种可能。 数学上可以写成

It all starts with a simple theorem that allows us to compute conditional probabilities based on prior knowledge:

这一切都从一个允许我们基于先验知识来计算条件概率的简单的定理开始:

Despite its simplicity, Bayes Theorem has an immense value, vast area of application and even special branch of statistics called Bayesian statistics. There is a very nice blog post about Bayes Theorem if you are interested in how it can be derived — it is not that hard at all.

尽管贝叶斯定理很简单,但它具有巨大的价值,广泛的应用领域,甚至是贝叶斯统计学的特殊分支。 有一个关于贝叶斯定理的非常棒的博客文章,如果你对贝叶斯的推导感兴趣—这并不难。

Distributions

What is a probability distribution anyways? It is a law that tells us probabilities of different possible outcomes in some experiment formulated as a mathematical function. As each function, a distribution may have some parameters to adjust its behavior.

什么是概率分布?这是一个定律,它以数学函数的形式告诉我们在一些实验中不同可能结果的概率。 对于每个函数,分布可能有一些参数来调整其行为。

When we measured relative frequencies of a coin toss event we have actually calculated a so-called empirical probability distribution. It turns out that many uncertain processes in our world can be formulated in terms of probability distributions. For example, our coin outcomes have a Bernoulli distribution and if we wanted to calculate a probability of heads after n trials we may use a Binomial distribution.

当我们计算硬币投掷事件的相对频率时,我们实际上计算了一个所谓经验概率分布。事实证明,世界上许多不确定的过程可以用概率分布来表述。例如,我们的硬币结果是一个伯努利分布,如果我们想计算一个n次试验后硬币正面向上的概率,我们可以使用二项式分布。

It is convenient to introduce a concept analogous to a variable that may be used in probabilistic environments — a random variable. Each random variable has some distribution assigned to it. Radom variables are written in upper case by convention, and we may use ~ symbol to specify a distribution assigned to a variable.

引入一个类似于概率环境中的变量的概念会方便很多–随机变量。每个随机变量都具有一定的分布。随机变量默认用大写字母表示,我们可以使用 ~ 符号指定一个分布赋给一个变量。

This means that random variable X is distributed according to a Bernoulli distribution with probability of success (heads) equal to 0.6.

上式表示随机变量X服从成功率(正面向上)为0.6的伯努利分布。

Continuous and discrete probability distributions

Probability distributions can come in two flavors: — Discrete ones are dealing with random variables that have a finite countable number of values, as it was the case with coins and Bernoulli distribution. Discrete distributions are defined by functions called Probability Mass Functions (PMF) — Continuous distributions deal with continuous random variables that can (in theory) have an infinite number of values. Think of velocity and acceleration measured with noisy sensors. Continuous distributions are defined by functions called Probability Density Functions (PDF)

概率分布可分为两种:离散分布用于处理具有有限值的随机变量,如投掷硬币和伯努利分布的情形。离散分布是由所谓的概率质量函数(PMF)定义的,连续分布用于处理连续的(理论上)有无限数量的值的随机变量。想想用声音传感器测量的速度和加速度。连续分布是由概率密度函数(PDF)定义的。

Those types of distributions differ in mathematical treatment: you typically will use summations with discrete and integrals with continuous probability distributions. Take expected value for an example:

这两种分布类型在数学处理上有所不同:通常连续分布使用积分∫而离散分布使用求和Σ。 以期望值为例:

Samples and statistics

Suppose we are doing research on human height and are eager to publish a mind-blowing scientific paper. We measured the height of some strangers on the street, therefore our measurements are independent. A process when we select a random subset of data from the true population is called sampling. Statistic is the funcion used to summarize the data from using values from the sample. The statistic you likely met before is the sample mean:

假设我们正在研究人类的身高分布,并渴望发表一篇令人兴奋的科学论文。我们测量了街上一些陌生人的身高,因此我们的测量数据是独立的。我们从真实人群中随机选择数据子集的过程称为抽样。统计是用来总结采样值数据规律的函数。你可能见过的统计量是样本均值:

Another example is sample variance:

另一个例子是样本方差:

This formula captures in overall how all data points differ from their mean.

这个公式可以得出所有数据点偏离平均值的程度。

What if I want more?

You want to go in-depth with probability theory and statistics? Great! You will definitely benefit from this knowledge whether you are want to get a solid understanding of the theory behind machine learning or just curious.

你想去深入的概率论与数理统计?很好!你一定会受益于这方面的知识,无论你是想理解机器学习背后的理论亦或只是好奇心作祟。

  • Entry level: Khan Academy is a great free resource. The course will get you through the basics in a very intuitive and simple form
  • Intermediate level: All of the statistics by Larry Wasserman is a great and concise resource that will present you almost all of the important topics in statistics. Beware that this book assumes you are familiar with linear algebra and calculus
  • Advanced level: I bet you will tailor your personal reading list by this time 🙃
  • 入门级:可汗学院是一个很好的免费资源。 课程将以非常直观和简单的形式让您了解基础知识
  • 中级:Larry Wasserman的所有统计资料都是一个精彩简洁的资源,将为您介绍统计中几乎所有重要的课题。 请注意,本书假定你熟悉线性代数和微积分
  • 高级水平:我敢打赌,到这一步你会定制你的个人阅读清单
Share this to:

发表评论

电子邮件地址不会被公开。 必填项已用*标注