Package: textrecipes 1.0.7.9000

Emil Hvitfeldt

textrecipes: Extra 'Recipes' for Text Processing

Converting text to numerical features requires specifically created procedures, which are implemented as steps according to the 'recipes' package. These steps allows for tokenization, filtering, counting (tf and tfidf) and feature hashing.

Authors:Emil Hvitfeldt [aut, cre], Michael W. Kearney [cph], Posit Software, PBC [cph, fnd]

# Install 'textrecipes' in R:

install.packages('textrecipes', repos = c('https://tidymodels.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/tidymodels/textrecipes/issues

Pkgdown site:https://textrecipes.tidymodels.org

Datasets:

emoji_samples - Sample sentences with emojis

On CRAN:

10.87 score 160 stars 1 packages 964 scripts 1.3k downloads 33 exports 57 dependencies

Last updated 5 days agofrom:7a1182a21e. Checks:5 OK, 7 NOTE. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Mar 07 2025
R-4.5-win-x86_64	OK	Mar 07 2025
R-4.5-mac-x86_64	OK	Mar 07 2025
R-4.5-mac-aarch64	OK	Mar 07 2025
R-4.5-linux-x86_64	OK	Mar 07 2025
R-4.4-win-x86_64	NOTE	Mar 07 2025
R-4.4-mac-x86_64	NOTE	Mar 07 2025
R-4.4-mac-aarch64	NOTE	Mar 07 2025
R-4.4-linux-x86_64	NOTE	Mar 07 2025
R-4.3-win-x86_64	NOTE	Mar 07 2025
R-4.3-mac-x86_64	NOTE	Mar 07 2025
R-4.3-mac-aarch64	NOTE	Mar 07 2025

Exports:%>%all_tokenized all_tokenized_predictors count_functions ngram required_pkgs show_tokens step_clean_levels step_clean_names step_dummy_hash step_lda step_lemma step_ngram step_pos_filter step_sequence_onehot step_stem step_stopwords step_text_normalization step_textfeature step_texthash step_tf step_tfidf step_tokenfilter step_tokenize step_tokenize_bpe step_tokenize_sentencepiece step_tokenize_wordpiece step_tokenmerge step_untokenize step_word_embeddings tidy tokenlist tunable

Dependencies:class cli clock codetools cpp11 data.table diagram digest dplyr fansi future future.apply generics globals glue gower hardhat ipred KernSmooth lattice lava lifecycle listenv lubridate magrittr MASS Matrix nnet numDeriv parallelly pillar pkgconfig prodlim progressr purrr R6 Rcpp recipes rlang rpart shape SnowballC sparsevctrs SQUAREM stringi stringr survival tibble tidyr tidyselect timechange timeDate tokenizers tzdb utf8 vctrs withr

Cookbook - Using more complex recipes involving text

Emil Hvitfeldt

Rendered fromcookbook---using-more-complex-recipes-involving-text.Rmdusingknitr::rmarkdownon Mar 07 2025.

Last update: 2024-11-09
Started: 2018-11-04

Under the hood - tokenlist

Rendered fromtokenlist.Rmdusingknitr::rmarkdownon Mar 07 2025.

Last update: 2024-04-01
Started: 2020-04-08

Working with n-grams

Rendered fromWorking-with-n-grams.Rmdusingknitr::rmarkdownon Mar 07 2025.

Last update: 2024-04-01
Started: 2020-04-08

Citation

Development and contributors

Readme and manuals

Help Manual

Help page	Topics
Role Selection	all_tokenized all_tokenized_predictors
List of all feature counting functions	count_functions
Sample sentences with emojis	emoji_samples
Show token output of recipe	show_tokens
Clean Categorical Levels	step_clean_levels tidy.step_clean_levels
Clean Variable Names	step_clean_names tidy.step_clean_names
Indicator Variables via Feature Hashing	step_dummy_hash tidy.step_dummy_hash
Calculate LDA Dimension Estimates of Tokens	step_lda tidy.step_lda
Lemmatization of Token Variables	step_lemma tidy.step_lemma
Generate n-grams From Token Variables	step_ngram tidy.step_ngram
Part of Speech Filtering of Token Variables	step_pos_filter tidy.step_pos_filter
Positional One-Hot encoding of Tokens	step_sequence_onehot tidy.step_sequence_onehot
Stemming of Token Variables	step_stem tidy.step_stem
Filtering of Stop Words for Tokens Variables	step_stopwords tidy.step_stopwords
Normalization of Character Variables	step_text_normalization tidy.step_text_normalization
Calculate Set of Text Features	step_textfeature tidy.step_textfeature
Feature Hashing of Tokens	step_texthash tidy.step_texthash
Term frequency of Tokens	step_tf tidy.step_tf
Term Frequency-Inverse Document Frequency of Tokens	step_tfidf tidy.step_tfidf
Filter Tokens Based on Term Frequency	step_tokenfilter tidy.step_tokenfilter
Tokenization of Character Variables	step_tokenize tidy.step_tokenize
BPE Tokenization of Character Variables	step_tokenize_bpe tidy.step_tokenize_bpe
Sentencepiece Tokenization of Character Variables	step_tokenize_sentencepiece tidy.step_tokenize_sentencepiece
Wordpiece Tokenization of Character Variables	step_tokenize_wordpiece tidy.step_tokenize_wordpiece
Combine Multiple Token Variables Into One	step_tokenmerge tidy.step_tokenmerge
Untokenization of Token Variables	step_untokenize tidy.step_untokenize
Pretrained Word Embeddings of Tokens	step_word_embeddings tidy.step_word_embeddings
Create Token Object	tokenlist