Title: | A Simple General Purpose N-Gram Tokenizer |
---|---|
Description: | A simple n-gram (contiguous sequences of n items from a given sequence of text) tokenizer to be used with the 'tm' package with no 'rJava'/'RWeka' dependency. |
Authors: | Chung-hong Chan <[email protected]> |
Maintainer: | Chung-hong Chan <[email protected]> |
License: | GPL-2 |
Version: | 0.2.0 |
Built: | 2024-11-08 04:13:00 UTC |
Source: | https://github.com/chainsawriot/ngramrr |
Wrappers to DocumentTermMatrix
and DocumentTermMatrix
to use n-gram tokenization provided by ngramrr
.
dtm2(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE, ...) tdm2(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE, ...)
dtm2(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE, ...) tdm2(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE, ...)
x |
character vector, |
char |
logical, using character n-gram. char = FALSE denotes word n-gram. |
ngmin |
integer, minimun order of n-gram |
ngmax |
integer, maximun order of n-gram |
rmEOL |
logical, remove ngrams wih EOL character |
... |
Additional options for |
DocumentTermMatrix
or DocumentTermMatrix
ngramrr
, DocumentTermMatrix
, TermDocumentMatrix
nirvana <- c("hello hello hello how low", "hello hello hello how low", "hello hello hello how low", "hello hello hello", "with the lights out", "it's less dangerous", "here we are now", "entertain us", "i feel stupid", "and contagious", "here we are now", "entertain us", "a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay") dtm2(nirvana, ngmax = 3, removePunctuation = TRUE)
nirvana <- c("hello hello hello how low", "hello hello hello how low", "hello hello hello how low", "hello hello hello", "with the lights out", "it's less dangerous", "here we are now", "entertain us", "i feel stupid", "and contagious", "here we are now", "entertain us", "a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay") dtm2(nirvana, ngmax = 3, removePunctuation = TRUE)
A non-Java based n-gram tokenizer to be used with the tm package. Support both character and word n-gram.
ngramrr(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE)
ngramrr(x, char = FALSE, ngmin = 1, ngmax = 2, rmEOL = TRUE)
x |
input string. |
char |
logical, using character n-gram. char = FALSE denotes word n-gram. |
ngmin |
integer, minimun order of n-gram |
ngmax |
integer, maximun order of n-gram |
rmEOL |
logical, remove ngrams wih EOL character |
vector of n-grams
require(tm) nirvana <- c("hello hello hello how low", "hello hello hello how low", "hello hello hello how low", "hello hello hello", "with the lights out", "it's less dangerous", "here we are now", "entertain us", "i feel stupid", "and contagious", "here we are now", "entertain us", "a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay") ngramrr(nirvana[1], ngmax = 3) ngramrr(nirvana[1], ngmax = 3, char = TRUE) nirvanacor <- Corpus(VectorSource(nirvana)) TermDocumentMatrix(nirvanacor, control = list(tokenize = function(x) ngramrr(x, ngmax =3))) # Character ngram TermDocumentMatrix(nirvanacor, control = list(tokenize = function(x) ngramrr(x, char = TRUE, ngmax =3), wordLengths = c(1, Inf)))
require(tm) nirvana <- c("hello hello hello how low", "hello hello hello how low", "hello hello hello how low", "hello hello hello", "with the lights out", "it's less dangerous", "here we are now", "entertain us", "i feel stupid", "and contagious", "here we are now", "entertain us", "a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay") ngramrr(nirvana[1], ngmax = 3) ngramrr(nirvana[1], ngmax = 3, char = TRUE) nirvanacor <- Corpus(VectorSource(nirvana)) TermDocumentMatrix(nirvanacor, control = list(tokenize = function(x) ngramrr(x, ngmax =3))) # Character ngram TermDocumentMatrix(nirvanacor, control = list(tokenize = function(x) ngramrr(x, char = TRUE, ngmax =3), wordLengths = c(1, Inf)))