Changes in version 1.1.0.9000 Changes in version 1.1.0 (2025-03-18) Improvements - The following steps has gained the argument sparse. When set to "yes", they will produce sparse vectors. (#277) - step_dummy_hash() - step_texthash() - step_tf() - step_tfidf() Changes in version 1.0.7 (2025-01-23) Improvements - Documentation for tidy methods for all steps has been improved to describe the return value more accurately. (#262) - Calling ?tidy.step_*() now sends you to the documentation for step_*() where the outcome is documented. (#261) - step_textfeatures() has been made faster and more robust. (#265) Bug Fixes - Fixed bug in step_clean_levels() where it would produce NAs for character columns. (#274) Changes in version 1.0.6 (2023-11-15) - textfeatures has been removed from Suggests. (#255) - step_textfeatures() no longer returns a politeness feature. (#254) Changes in version 1.0.5 (2023-10-20) - step_untokenize() and step_normalization() now returns factors instead of strings. (#247) Changes in version 1.0.4 (2023-08-17) Improvements - step_clean_names() now throw an informative error if needed non-standard role columns are missing during bake(). (#235) - The keep_original_cols argument has been added to step_tokenmerge. This change should mean that every step that produces new columns has the keep_original_cols argument. (#242) - Many internal changes to improve consistency and slight speed increases. Bug Fixes - Fixed bug where step_dummy_hash() and step_texthash() would add new columns before old columns. (#235) - Fixed bug where vocabulary_size wasn't tunable in step_tokenize_bpe(). (#239) Changes in version 1.0.3 (2023-04-14) Improvements - Steps with tunable arguments now have those arguments listed in the documentation. - All steps that add new columns will now informatively error if name collision occurs. Bug Fixes - Fixed bug where step_tf() wasn't tunable for weight argument. Changes in version 1.0.2 (2022-12-21) - Setting token = "tweets" in step_tokenize() have been deprecated due to tokenizers::tokenize_tweets() being deprecated. (#209) - step_sequence_onehot(), step_dummy_hash(), step_dummy_texthash() now return integers. step_tf() returns integer when weight_scheme is "binary" or "raw count". - All steps now have required_pkgs() methods. Changes in version 1.0.1 (2022-10-06) - Examples no longer include if (require(...)) code. Changes in version 1.0.0 (2022-07-02) - Indicate which steps support case weights (none), to align documentation with other packages. Changes in version 0.5.2 (2022-05-04) - Remove use of okc_text in vignette - Fix bug in printing of tokenlists Changes in version 0.5.1 (2022-03-29) - step_tfidf() now correctly saves the idf values and applies them to the testing data set. - tidy.step_tfidf() now returns calculated IDF weights. Changes in version 0.5.0 (2022-03-20) New steps - step_dummy_hash() generates binary indicators (possibly signed) from simple factor or character vectors. - step_tokenize() has gotten a couple of cousin functions step_tokenize_bpe(), step_tokenize_sentencepiece() and step_tokenize_wordpiece() which wraps {tokenizers.bpe}, {sentencepiece} and {wordpiece} respectively (#147). Improvements and Other Changes - Added all_tokenized() and all_tokenized_predictors() to more easily select tokenized columns (#132). - Use show_tokens() to more easily debug a recipe involving tokenization. - Reorganize documentation for all recipe step tidy methods (#126). - Steps now have a dedicated subsection detailing what happens when tidy() is applied. (#163) - All recipe steps now officially support empty selections to be more aligned with dplyr and other packages that use tidyselect (#141). - step_ngram() has been given a speed increase to put it in line with other packages performance. - step_tokenize() will now try to error if vocabulary size is too low when using engine = "tokenizers.bpe" (#119). - Warning given by step_tokenfilter() when filtering failed to apply now correctly refers to the right argument name (#137). - step_tf() now returns 0 instead of NaN when there aren't any tokens present (#118). - step_tokenfilter() now has a new argument filter_fun will takes a function which can be used to filter tokens. (#164) - tidy.step_stem() now correctly shows if custom stemmer was used. - Added keep_original_cols argument to step_lda, step_texthash(), step_tf(), step_tfidf(), step_word_embeddings(), step_dummy_hash(), step_sequence_onehot(), and step_textfeatures() (#139). Breaking Changes - Steps with prefix argument now creates names according to the pattern prefix_variablename_name/number. (#124) Changes in version 0.4.1 (2021-07-11) Bug fixes - Fixed a bug in step_tokenfilter() and step_sequence_onehot() that sometimes caused crashes in R 4.1.0. Changes in version 0.4.0 (2020-11-12) Breaking Changes - step_lda() now takes a tokenlist instead of a character variable. See readme for more detail. New Features - step_sequence_onehot() now takes tokenlists as input. - added {tokenizers.bpe} engine to step_tokenize(). - added {udpipe} engine to step_tokenize(). - added new steps for cleaning variable names or levels with {janitor}, step_clean_names() and step_clean_levels(). (#101) Changes in version 0.3.0 (2020-07-08) - stopwords package have been moved from Imports to Suggests. - step_ngram() gained an argument min_num_tokens to be able to return multiple n-grams together. (#90) - Adds step_text_normalization() to perform unicode normalization on character vectors. (#86) Changes in version 0.2.3 (2020-05-22) Changes in version 0.2.2 (2020-05-10) - step_word_embeddings() got a argument aggregation_default to specify value in cases where no words matches embedding. Changes in version 0.2.1 (2020-05-04) Changes in version 0.2.0 (2020-04-14) - step_tokenize() got an engine argument to specify packages other then tokenizers to tokenize. - spacyr have been added as an engine to step_tokenize(). - step_lemma() has been added to extract lemma attribute from tokenlists. - step_pos_filter() has been added to allow filtering of tokens bases on their pat of speech tags. - step_ngram() has been added to generate ngrams from tokenlists. - step_stem() not correctly uses the options argument. (Thanks to @grayskripko for finding bug, #64) Changes in version 0.1.0 (2020-03-05) - step_word2vec() have been changed to step_lda() to reflect what is actually happening. - step_word_embeddings() has been added. Allows for use of pre-trained word embeddings to convert token columns to vectors in a high-dimensional "meaning" space. (@jonthegeek, #20) - text2vec have been changed from Imports to Suggests. - textfeatures have been changed from Imports to Suggests. - step_tfidf() calculations are slightly changed due to flaw in original implementation https://github.com/dselivanov/text2vec/issues/280. Changes in version 0.0.2 (2019-09-07) - Custom stemming function can now be used in step_stem using the custom_stemmer argument. - step_textfeatures() have been added, allows for multiple numerical features to be pulled from text. - step_sequence_onehot() have been added, allows for one hot encoding of sequences of fixed width. - step_word2vec() have been added, calculates word2vec dimensions. - step_tokenmerge() have been added, combines multiple list columns into one list-columns. - step_texthash() now correctly accepts signed argument. - Documentation have been improved to showcase the importance of filtering tokens before applying step_tf() and step_tfidf(). Changes in version 0.0.1 (2018-12-17) First CRAN version