R 朴素贝叶斯分类(Machine Learning With R by Brett Lantz)

来自 Machine Learning With R by Brett Lantz

基本背景

书中从垃圾邮件分类引出联合概率和 Ven 图,再引出贝叶斯定理。

朴素贝叶斯分类器是一个条件概率模型,基于贝叶斯公式(条件概率):

p ( C | F 1 , … , F n ) = p ( F 1 , … , F n | C ) · p ( C ) / p ( F 1 , … , F n ) ,即

posterior (后验) = prior (先验) × likelihood (似然) / evidence (边际似然,marginal likelihood)

朴素贝氏分类器包括了这种模型和相应的决策规则,即选出最大后验概率(MAP)。

相应的分类器公式:

classify ( f 1 , … , f n ) = argmaxc p ( C = c ) ∏i = 1n p ( Fi = fi | C = c )

其次是应用拉普拉斯估计(Laplace estimator),在频率表中每个计数加上一个较小的数,以此保证每一类中特征发生概率非零。

SMS 短信过滤实例

预处理

英文单词预处理,去处停用词、数字等。

#!/usr/bin/env Rscript

# Machine Learning With R by Brett Lantz

library('tm')     # See print(vignette('tm'))

sms_raw <- read.csv('sms_spam.csv', stringsAsFactors = FALSE)
sms_raw$type <- factor(sms_raw$type)

# Create corpus(语料库), pre-process
sms_corpus <- Corpus(VectorSource(sms_raw$text))

corpus_clean <- tm_map(sms_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)

# Build a sparse matrix
sms_dtm <- DocumentTermMatrix(corpus_clean)

划分数据集

# Build training data and test data
num_train <- 4169
num_total <- length(sms_raw$text)

sms_raw_train <- sms_raw[1:num_train, ]
sms_raw_test <- sms_raw[(num_train + 1):num_total, ]

sms_dtm_train <- sms_dtm[1:num_train, ]
sms_dtm_test <- sms_dtm[(num_train + 1):num_total, ]

sms_corpus_train <- corpus_clean[1:num_train]
sms_corpus_test <- corpus_clean[(num_train + 1):num_total]

单词云图

# Wordcloud
library('wordcloud')

wordcloud(sms_corpus_train, min.freq = 40, random.order = FALSE)

spam <- subset(sms_raw_train, type == 'spam')
ham <- subset(sms_raw_train, type == 'ham')

wordcloud(spam$text, max.words = 40, scale = c(3, .5))
wordcloud(ham$text, max.words = 40, scale = c(3, .5))

创建频繁单词指示特征

# Create indicator features for frequently occurring data
sms_dict <- findFreqTerms(sms_dtm_train, 5)
sms_train <- DocumentTermMatrix(sms_corpus_train, list(dictionary = sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_test, list(dictionary = sms_dict))

convert_counts <- function(x) {
  # Convert counts into factors('Yes' or 'No')
  x <- ifelse(x > 0, 1, 0)
  x <- factor(x, levels = c(0, 1), labels = c('No', 'Yes'))
  x
}

sms_train <- apply(sms_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_test, MARGIN = 2, convert_counts)

预测及结果

# Naive Bayes (e1071) classifier
library('e1071')

sms_classifier <- naiveBayes(sms_train, sms_raw_train$type)
sms_test_pred <- predict(sms_classifier, sms_test)

# Evaluation
library('gmodels')
CrossTable(sms_test_pred, sms_raw_test$type,
  prop.chisq = FALSE, prop.t = FALSE,
  dnn = c('predicted', 'actual'))

# Improve
sms_classifier2 <- naiveBayes(sms_train, sms_raw_train$type,
  laplace = 1)
sms_test_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_test_pred2, sms_raw_test$type,
  prop.chisq = FALSE, prop.t = FALSE,
  dnn = c('predicted', 'actual'))

朴素贝叶斯利用事件相互独立的假设,常用于过滤垃圾邮件/短信。

参考:

  1. Machine Learning With R by Brett Lantz
  2. https://zh.wikipedia.org/zh-hans/朴素贝叶斯分类器
  3. https://archive.ics.uci.edu/ml/index.php

作者: YanWen

Web 开发者

发表评论

Fill in your details below or click an icon to log in:

WordPress.com 徽标

You are commenting using your WordPress.com account. Log Out /  更改 )

Google photo

You are commenting using your Google account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )

Connecting to %s