What's the Best way to Tokenize a string? - C++

TopAnswers C++

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

What's the Best way to Tokenize a string?

add tag

Jonathan Mee

I can tokenize by writing my own function:

    std::vector<std::string> Foo(const std::string& input) {
        auto start = std::find(std::cbegin(input), std::cend(input), ' ');
        std::vector<std::string> output { std::string(std::cbegin(input), start) };

        while (start != std::cend(input)) {
            const auto finish = std::find(++start, std::cend(input), ' ');

            output.push_back(std::string(start, finish));
            start = finish;
        }
        return output;
    }

This has several issues, most importantly, doesn't C++ provide me something to do this? But also:

 1. `Foo` includes spaces in the tokens
 1. `Foo` makes a token for each space, even repeated spaces
 1. `Foo` only delimits based on spaces, not other white space

Is there something better available to me?

Top Answer

Jonathan Mee

There are 4 solutions which C++ provides, listed from least to most expensive at run time:

 1. [`std::strtok`](https://en.cppreference.com/w/cpp/string/byte/strtok)
 1. [`std::split_view`](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0789r1.pdf)
 1. [`std::istream_iterator`](https://en.cppreference.com/w/cpp/iterator/istream_iterator)
 1. [`std::regex_token_iterator`](https://en.cppreference.com/w/cpp/regex/regex_token_iterator)

They are discussed in detail below:

# `std::strtok`

`std::strtok` is a destructive tokenizer meaning:

 1. `std::strtok` will modify the string to be tokenized, so it cannot operate on `const std::string`s or `const char*`s, if the string to be tokenized needs to be preserved a copy must be made to use `std::strtok` upon
 1. Because `std::strtok` depends upon modifications to the string to be tokenized the tokenization of multiple strings cannot be interlaced, though some implementations do support this, such as: [`strtok_s`](https://msdn.microsoft.com/en-us/library/ftsafwz3.aspx/)
 1. Additionally the standard does not place any requirements upon `std::strtok` to be thread safe, though some implementations are thread safe: https://msdn.microsoft.com/en-us/library/ftsafwz3.aspx/

You could rewrite `Foo` with `std::strtok` as follows:

    std::vector<std::string> Foo(std::string input) {
        std::vector<std::string> output;

        for (auto i = strtok(std::data(input), " "); i != nullptr; i = strtok(nullptr, " ")) {
            output.push_back(i);
        }
        return outupt;
    }

This suffers from issues **1**, **2**, and **3** as listed in your question, and really only adds the use of a C++ function for doing the tokenizing.

# `std::split_view`

In C++20 has given us `std::split_view`, the exact implementation is not yet official, but examples that we've been given describe that `Foo` should be written like:

    std::vector<std::string> Foo(const std::string& input) {
        std::vector<std::string> output;

        for(const auto& i : input | std::ranges::views::split(' ')) {
            output.emplace_back(std::cbegin(i), std::cend(i));
        }
        return output;
    }

This suffers from issues **1**, **2**, and **3** as listed in your question, but improves over `std::strtok` by tokenizing without destroying `input`. It should be noted that the C++20 standard hasn't been finalized I've used [this resource](https://ezoeryou.github.io/blog/article/2019-01-10-range-view.html) in prototyping `std::split_view` code.

# `std::istream_iterator`

`std::istream_iterator` requires a `std::istringstream` to be created, but makes `Foo` very easy to write:

    std::vector<std::string> Foo(const std::string& input) {
        std::istringstream output(input);

        return { std::istream_iterator<std::string>(output), std::istream_iterator<std::string>() };
    }

This solves all issues listed in your question, but adds the cost of constructing a `std::istringstream`.

# `std::regex_token_iterator`

`std::regex_token_iterator` requires a regex which captures tokens. This provides greater flexability because the delimiters need not be whitespace, but requires a regex to be run on the string to be tokenized. If `Foo` were to be rewritten with a `std::regex_token_iterator` it would look something like:

    std::vector<std::string> Foo(std::string input) {
        std::regex output((?:^|\s*)(\S+))

        return { std::sregex_token_iterator(std::cbegin(input), std::cend(input), output, 1), std::sregex_token_iterator() };
    }

This solves all the issues listed in your question, but adds the cost of running a regex on the string to be tokenized.

1 Answer