Jonathan Mee
I can tokenize by writing my own function:
std::vector<std::string> Foo(const std::string& input) {
auto start = std::find(std::cbegin(input), std::cend(input), ' ');
std::vector<std::string> output { std::string(std::cbegin(input), start) };
while (start != std::cend(input)) {
const auto finish = std::find(++start, std::cend(input), ' ');
output.push_back(std::string(start, finish));
start = finish;
}
return output;
}
This has several issues, most importantly, doesn't C++ provide me something to do this? But also:
1. `Foo` includes spaces in the tokens
1. `Foo` makes a token for each space, even repeated spaces
1. `Foo` only delimits based on spaces, not other white space
Is there something better available to me?
Top Answer
Jonathan Mee
There are 4 solutions which C++ provides, listed from least to most expensive at run time:
1. [`std::strtok`](https://en.cppreference.com/w/cpp/string/byte/strtok)
1. [`std::split_view`](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0789r1.pdf)
1. [`std::istream_iterator`](https://en.cppreference.com/w/cpp/iterator/istream_iterator)
1. [`std::regex_token_iterator`](https://en.cppreference.com/w/cpp/regex/regex_token_iterator)
They are discussed in detail below:
# `std::strtok`
`std::strtok` is a destructive tokenizer meaning:
1. `std::strtok` will modify the string to be tokenized, so it cannot operate on `const std::string`s or `const char*`s, if the string to be tokenized needs to be preserved a copy must be made to use `std::strtok` upon
1. Because `std::strtok` depends upon modifications to the string to be tokenized the tokenization of multiple strings cannot be interlaced, though some implementations do support this, such as: [`strtok_s`](https://msdn.microsoft.com/en-us/library/ftsafwz3.aspx/)
1. Additionally the standard does not place any requirements upon `std::strtok` to be thread safe, though some implementations are thread safe: https://msdn.microsoft.com/en-us/library/ftsafwz3.aspx/
You could rewrite `Foo` with `std::strtok` as follows:
std::vector<std::string> Foo(std::string input) {
std::vector<std::string> output;
for (auto i = strtok(std::data(input), " "); i != nullptr; i = strtok(nullptr, " ")) {
output.push_back(i);
}
return outupt;
}
This suffers from issues **1**, **2**, and **3** as listed in your question, and really only adds the use of a C++ function for doing the tokenizing.
# `std::split_view`
In C++20 has given us `std::split_view`, the exact implementation is not yet official, but examples that we've been given describe that `Foo` should be written like:
std::vector<std::string> Foo(const std::string& input) {
std::vector<std::string> output;
for(const auto& i : input | std::ranges::views::split(' ')) {
output.emplace_back(std::cbegin(i), std::cend(i));
}
return output;
}
This suffers from issues **1**, **2**, and **3** as listed in your question, but improves over `std::strtok` by tokenizing without destroying `input`. It should be noted that the C++20 standard hasn't been finalized I've used [this resource](https://ezoeryou.github.io/blog/article/2019-01-10-range-view.html) in prototyping `std::split_view` code.
# `std::istream_iterator`
`std::istream_iterator` requires a `std::istringstream` to be created, but makes `Foo` very easy to write:
std::vector<std::string> Foo(const std::string& input) {
std::istringstream output(input);
return { std::istream_iterator<std::string>(output), std::istream_iterator<std::string>() };
}
This solves all issues listed in your question, but adds the cost of constructing a `std::istringstream`.
# `std::regex_token_iterator`
`std::regex_token_iterator` requires a regex which captures tokens. This provides greater flexability because the delimiters need not be whitespace, but requires a regex to be run on the string to be tokenized. If `Foo` were to be rewritten with a `std::regex_token_iterator` it would look something like:
std::vector<std::string> Foo(std::string input) {
std::regex output((?:^|\s*)(\S+))
return { std::sregex_token_iterator(std::cbegin(input), std::cend(input), output, 1), std::sregex_token_iterator() };
}
This solves all the issues listed in your question, but adds the cost of running a regex on the string to be tokenized.