TM 中提炼英文文本

收集 R 代码,tm 库去除英文中的多余单词。

library('tm')

cleaning_text_file <- function(filename) {
  # Read file as a str
  text <- paste(readLines(filename), collapse=' ')

  # Filter non-words and decimals
  text <- gsub(pattern='(\W)|(\d)', replace=' ', text)

  # Lower
  text <- tolower(text)

  # Remove stop words
  text <- removeWords(text, stopwords())

  # Filter single characters
  text <- gsub(pattern='\b[A-z]\b{1}', replace=' ', text)

  # Remove extra spaces
  text <- stripWhitespace(text)

  text
}

作者: YanWen

Web 开发者

发表评论

Fill in your details below or click an icon to log in:

WordPress.com 徽标

You are commenting using your WordPress.com account. Log Out /  更改 )

Google photo

You are commenting using your Google account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )

Connecting to %s