Package 'ngramrr' reference manual

Title:	A Simple General Purpose N-Gram Tokenizer
Description:	A simple n-gram (contiguous sequences of n items from a given sequence of text) tokenizer to be used with the 'tm' package with no 'rJava'/'RWeka' dependency.
Authors:	Chung-hong Chan <[email protected]>
Maintainer:	Chung-hong Chan <[email protected]>
License:	GPL-2
Version:	0.2.0
Built:	2025-02-06 04:11:33 UTC
Source:	https://github.com/chainsawriot/ngramrr

Wrappers to DocumentTermMatrix and DocumentTermMatrix to use n-gram tokenizaion

Description

Wrappers to DocumentTermMatrix and DocumentTermMatrix to use n-gram tokenization provided by ngramrr.

Usage

dtm2(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE, ...)

tdm2(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE, ...)
dtm2(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE, ...)

tdm2(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE, ...)

Arguments

`x`	character vector, `Source` or `Corpus` to be converted
`char`	logical, using character n-gram. char = FALSE denotes word n-gram.
`ngmin`	integer, minimun order of n-gram
`ngmax`	integer, maximun order of n-gram
`rmEOL`	logical, remove ngrams wih EOL character
`...`	Additional options for `DocumentTermMatrix` or `DocumentTermMatrix`

Value

DocumentTermMatrix or DocumentTermMatrix

Examples

nirvana <- c("hello hello hello how low", "hello hello hello how low",
"hello hello hello how low", "hello hello hello",
"with the lights out", "it's less dangerous", "here we are now", "entertain us",
"i feel stupid", "and contagious", "here we are now", "entertain us",
"a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay")
dtm2(nirvana, ngmax = 3, removePunctuation = TRUE)
nirvana <- c("hello hello hello how low", "hello hello hello how low",
"hello hello hello how low", "hello hello hello",
"with the lights out", "it's less dangerous", "here we are now", "entertain us",
"i feel stupid", "and contagious", "here we are now", "entertain us",
"a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay")
dtm2(nirvana, ngmax = 3, removePunctuation = TRUE)

General purpose n-gram tokenizer

Description

A non-Java based n-gram tokenizer to be used with the tm package. Support both character and word n-gram.

Usage

ngramrr(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE)
ngramrr(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE)

Arguments

`x`	input string.
`char`	logical, using character n-gram. char = FALSE denotes word n-gram.
`ngmin`	integer, minimun order of n-gram
`ngmax`	integer, maximun order of n-gram
`rmEOL`	logical, remove ngrams wih EOL character

Value

vector of n-grams

Examples

require(tm)

nirvana <- c("hello hello hello how low", "hello hello hello how low",
"hello hello hello how low", "hello hello hello",
"with the lights out", "it's less dangerous", "here we are now", "entertain us",
"i feel stupid", "and contagious", "here we are now", "entertain us",
"a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay")

ngramrr(nirvana[1], ngmax = 3)
ngramrr(nirvana[1], ngmax = 3, char = TRUE)
nirvanacor <- Corpus(VectorSource(nirvana))
TermDocumentMatrix(nirvanacor, control = list(tokenize = function(x) ngramrr(x, ngmax =3)))

# Character ngram

TermDocumentMatrix(nirvanacor, control = list(tokenize =
function(x) ngramrr(x, char = TRUE, ngmax =3), wordLengths = c(1, Inf)))
require(tm)

nirvana <- c("hello hello hello how low", "hello hello hello how low",
"hello hello hello how low", "hello hello hello",
"with the lights out", "it's less dangerous", "here we are now", "entertain us",
"i feel stupid", "and contagious", "here we are now", "entertain us",
"a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay")

ngramrr(nirvana[1], ngmax = 3)
ngramrr(nirvana[1], ngmax = 3, char = TRUE)
nirvanacor <- Corpus(VectorSource(nirvana))
TermDocumentMatrix(nirvanacor, control = list(tokenize = function(x) ngramrr(x, ngmax =3)))

# Character ngram

TermDocumentMatrix(nirvanacor, control = list(tokenize =
function(x) ngramrr(x, char = TRUE, ngmax =3), wordLengths = c(1, Inf)))

Package 'ngramrr'

Help Index

Wrappers to DocumentTermMatrix and DocumentTermMatrix to use n-gram tokenizaion

Description

Usage

Arguments

Value

See Also

Examples

General purpose n-gram tokenizer

Description

Usage

Arguments

Value

Examples