Package: textrecipes 1.1.0.9000

Emil Hvitfeldt

textrecipes: Extra 'Recipes' for Text Processing

Converting text to numerical features requires specifically created procedures, which are implemented as steps according to the 'recipes' package. These steps allows for tokenization, filtering, counting (tf and tfidf) and feature hashing.

Authors:Emil Hvitfeldt [aut, cre], Michael W. Kearney [cph], Posit Software, PBC [cph, fnd]

textrecipes_1.1.0.9000.tar.gz
textrecipes_1.1.0.9000.zip(r-4.7)textrecipes_1.1.0.9000.zip(r-4.6)textrecipes_1.1.0.9000.zip(r-4.5)
textrecipes_1.1.0.9000.tgz(r-4.6-x86_64)textrecipes_1.1.0.9000.tgz(r-4.6-arm64)textrecipes_1.1.0.9000.tgz(r-4.5-x86_64)textrecipes_1.1.0.9000.tgz(r-4.5-arm64)
textrecipes_1.1.0.9000.tar.gz(r-4.7-arm64)textrecipes_1.1.0.9000.tar.gz(r-4.7-x86_64)textrecipes_1.1.0.9000.tar.gz(r-4.6-arm64)textrecipes_1.1.0.9000.tar.gz(r-4.6-x86_64)
textrecipes_1.1.0.9000.tgz(r-4.6-emscripten)
manual.pdf |manual.html
DESCRIPTION |NEWS
card.svg |card.png
textrecipes/json (API)

# Install 'textrecipes' in R:
install.packages('textrecipes', repos = c('https://tidymodels.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/tidymodels/textrecipes/issues

Pkgdown/docs site:https://textrecipes.tidymodels.org

Datasets:

On CRAN:

Conda:

10.35 score 164 stars 1 packages 1.1k scripts 1.3k downloads 33 exports 65 dependencies

Last updated from:855427fb6a. Checks:13 OK. Indexed: yes.

TargetResultTimeFilesSyslog
linux-devel-arm64OK219
linux-devel-x86_64OK221
source / vignettesOK285
linux-release-arm64OK250
linux-release-x86_64OK259
macos-release-arm64OK129
macos-release-x86_64OK327
macos-oldrel-arm64OK116
macos-oldrel-x86_64OK388
windows-develOK170
windows-releaseOK162
windows-oldrelOK156
wasm-releaseOK138

Exports:%>%all_tokenizedall_tokenized_predictorscount_functionsngramrequired_pkgsshow_tokensstep_clean_levelsstep_clean_namesstep_dummy_hashstep_ldastep_lemmastep_ngramstep_pos_filterstep_sequence_onehotstep_stemstep_stopwordsstep_text_normalizationstep_textfeaturestep_texthashstep_tfstep_tfidfstep_tokenfilterstep_tokenizestep_tokenize_bpestep_tokenize_sentencepiecestep_tokenize_wordpiecestep_tokenmergestep_untokenizestep_word_embeddingstidytokenlisttunable

Dependencies:classcliclockcodetoolscpp11data.tablediagramdigestdplyrfarverfuturefuture.applygenericsggplot2globalsgluegowergtablehardhatipredisobandKernSmoothlabelinglatticelavalifecyclelistenvlubridatemagrittrMASSMatrixnnetnumDerivparallellypillarpkgconfigprodlimprogressrpurrrR6RColorBrewerRcpprecipesrlangrpartS7scalesshapeSnowballCsparsevctrsSQUAREMstringistringrsurvivaltibbletidyrtidyselecttimechangetimeDatetokenizerstzdbutf8vctrsviridisLitewithr

Cookbook - Using more complex recipes involving text
Counting select words | Removing words in addition to the stop words list | Letter distributions | TF-IDF of ngrams of stemmed tokens

Last update: 2025-04-24
Started: 2018-11-04

Under the hood - tokenlist
tokens attribute | lemma and pos attributes

Last update: 2025-04-24
Started: 2020-04-08

Working with n-grams
Only using step_tokenize() | Using step_tokenize() and step_ngram()

Last update: 2025-04-24
Started: 2020-04-08

Readme and manuals

Help Manual

Help pageTopics
Role Selectionall_tokenized all_tokenized_predictors
List of all feature counting functionscount_functions
Sample sentences with emojisemoji_samples
Show token output of recipeshow_tokens
Clean Categorical Levelsstep_clean_levels tidy.step_clean_levels
Clean Variable Namesstep_clean_names tidy.step_clean_names
Indicator Variables via Feature Hashingstep_dummy_hash tidy.step_dummy_hash
Calculate LDA Dimension Estimates of Tokensstep_lda tidy.step_lda
Lemmatization of Token Variablesstep_lemma tidy.step_lemma
Generate n-grams From Token Variablesstep_ngram tidy.step_ngram
Part of Speech Filtering of Token Variablesstep_pos_filter tidy.step_pos_filter
Positional One-Hot encoding of Tokensstep_sequence_onehot tidy.step_sequence_onehot
Stemming of Token Variablesstep_stem tidy.step_stem
Filtering of Stop Words for Tokens Variablesstep_stopwords tidy.step_stopwords
Normalization of Character Variablesstep_text_normalization tidy.step_text_normalization
Calculate Set of Text Featuresstep_textfeature tidy.step_textfeature
Feature Hashing of Tokensstep_texthash tidy.step_texthash
Term frequency of Tokensstep_tf tidy.step_tf
Term Frequency-Inverse Document Frequency of Tokensstep_tfidf tidy.step_tfidf
Filter Tokens Based on Term Frequencystep_tokenfilter tidy.step_tokenfilter
Tokenization of Character Variablesstep_tokenize tidy.step_tokenize
BPE Tokenization of Character Variablesstep_tokenize_bpe tidy.step_tokenize_bpe
Sentencepiece Tokenization of Character Variablesstep_tokenize_sentencepiece tidy.step_tokenize_sentencepiece
Wordpiece Tokenization of Character Variablesstep_tokenize_wordpiece tidy.step_tokenize_wordpiece
Combine Multiple Token Variables Into Onestep_tokenmerge tidy.step_tokenmerge
Untokenization of Token Variablesstep_untokenize tidy.step_untokenize
Pretrained Word Embeddings of Tokensstep_word_embeddings tidy.step_word_embeddings
Create Token Objecttokenlist