An inline assembly routine using "plain old" jz/jnz is unlikely to be faster than what you have; that said, you have a few inefficiencies in your code:
- you're retrieving tstring.length()once per loop iteration; that's unnecessary.
- you're using random indexing, tstring[len]which might be a more-expensive operation than using a forward iterator.
- you're calling stuff()during the loop; depending on what exactly that does, it might be faster to just let the loop build a list of locations within the string first (so that the scanned string as well as the scanning code stays cache-hot and is not evicted by whateverstuff()does), and only afterwards iterate over those results.
There's already a likely low-level optimized standard library function available,strchr(), for exactly that kind of scanning. The C++ STL std::string::find() is also likely to have been optimized for the purpose (and/or might use strchr() in the char specialization).
In particular, strchr() has SSE2 (using pcmpeqb, maskmov... and bsf) or SSE4.2 (using the string op pcmpistri) implementations; for examples/actual SSE code doing this, check e.g. strchr() in GNU libc (as used on Linux). See also the references and comments here (suitably named website ...).
My advice: Check your library implementation / documentation, and/or the actual generated assembly code for your program. You might well be using fast code already ... or would be if you'd switch from your hand-grown character-by-character simple search to just using std::string::find() or strchr().
If this is ultra-speed-critical, then inlining assembly code for strchr() as used by known/tested implementations (watch licensing) would eliminate function calls and gain a few cycles. Depends on your requirements ... code, benchmark, vary, benchmark again, ...