Reading a UTF-8 file a parsing characters

Question

I originally had the below code using std::wstring and was using wide strings that were statically typed into the code.

Later I learned that UTF-8 will "fit" into std::string and that there was no real need for std::wstring but that I might need some encoding translations later on. So I have a UTF-8 encoded text file that I'm reading in.

#include <iostream>
#include <fstream>

class A
{
public:
    A(std::istream& stream)
    :
        m_stream(stream),
        m_lineNumber(1),
        m_characterNumber(1)
    {

    }

    bool OutputKnownWords()
    {
        while(m_stream.good())
        {
            if(Take("MIDDLE"))
                std::cout << "Found middle" << std::endl;
            else if(Take("BEGIN"))
                std::cout << "Found begin" << std::endl;
            else if(Take("END"))
                std::cout << "Found end" << std::endl;
            else if(Take(" "))
                std::cout << "parsed out space" << std::endl;
            else
                return false;
        }
        return true;
    }

protected:

    std::istream::char_type Get()
    {
        auto c = m_stream.get();
        ++m_characterNumber;
        if(c == '\n')
        {
            ++m_lineNumber;
            m_characterNumber = 1;
        }
        return c;
    }

    bool Take(const std::string& str)
    {
        if(!Match(str))
            return false;

        for(std::string::size_type i = 0; i < str.size(); ++i)
            Get();

        return true;
    }

    bool Match(const std::string& str)
    {
        auto cursorPos = m_stream.tellg();

        std::string readStr(str.size(),'\0');

        m_stream.read(&readStr[0],str.size());

        if(std::size_t(m_stream.gcount()) < str.size() || readStr != str)
        {
            if(!m_stream.good())
                m_stream.clear();
            m_stream.seekg(cursorPos);
            return false;
        }
        m_stream.seekg(cursorPos);
        return true;
    }

    std::istream& m_stream;
    std::size_t m_lineNumber;
    std::size_t m_characterNumber;
};

int main()
{
    std::ifstream file("test.txt");
    if(!file.is_open())
    {
        std::cerr << "could not open file" << std::endl;
        return 0;
    }

    A a(file);

    if(!a.OutputKnownWords())
    {
        std::cerr << "something went wrong" << std::endl;
        return 0;
    }

    return 0;
}

text.text

BEGIN MIDDLE
END

So I would expect that this program outputs:

Found begin
parsed out space
Found middle
parsed out space
Found end

However, OutputKnownWords returns an error. I stepped through with the debugger and I found that the seekg calls in Match appear to not be setting the correct position. It's like, each test is out by one character.

When I was doing this with wide strings statically typed I had no problem.

I sort of think this might be related to the difference between UTF-8 encoding vs std::string's idea of a "character". But I'm not sure how then to handle how many "characters" are in an std::string.

This isn't related to tellg() function give wrong size of file? because I'm not doing anything with the cursor from tellg other than using it to reset the position.

Note that `while (m_stream.good())` is really not that different from `while (!m_stream.eof())`, and [`eof` inside a loop condition is generally always wrong](http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong). — Some programmer dude, Aug 04 '18 at 05:59
Where is the UTF-8? Also, in order to get closer to a [mcve], you could use a stringstream, which would allow putting all this into one file. — Ulrich Eckhardt, Aug 04 '18 at 06:09
@UlrichEckhardt the `test.txt` is UTF-8 encoded. It's in two files because I don't have an error when I use a `stringstream`. — NeomerArcana, Aug 04 '18 at 06:12
Try opening the stream in binary mode, due to newline translations seeks in text file streams are unreliable — Alan Birtles, Aug 04 '18 at 06:17
Possible duplicate of [tellg() function give wrong size of file?](https://stackoverflow.com/questions/22984956/tellg-function-give-wrong-size-of-file) — Alan Birtles, Aug 04 '18 at 06:23
@AlanBirtles I don't believe it's to do with `tellg()` because I do no translations on it after calling it, I just use it to go back to whether I was before the `Match()` test. — NeomerArcana, Aug 04 '18 at 06:30
1. Run the test on a western encoded file. 2. Read about UTF-8 decoding. — zdf, Aug 04 '18 at 06:59
@ZDF there should be a prize for most non-helpful comment. You would win. — NeomerArcana, Aug 04 '18 at 07:03
There is no UTF-8 decoding in your code. Search std for UTF-8 decoding. Irony? - Love, Sheldon. — zdf, Aug 04 '18 at 07:11
Use your debugger to step through the program. Find the point where the expected behaviour diverges from the actual behaviour. Observe the values of local variables. The problem has nothing to do with UTF-8 whatsoever. — n. m. could be an AI, Aug 04 '18 at 07:17
@n.m. thanks for the suggestion. I've done this but I still can't work out the issue. Maybe you could copy and paste the code and show me where the problem is by posting an answer. — NeomerArcana, Aug 04 '18 at 07:25
Pay special attention to `readStr` in `Match`. Does it always have the value you expect? — n. m. could be an AI, Aug 04 '18 at 11:06
@n.m. no it doesn't. That's like the whole problem. Seekg to a position returned by tellg seems to be like "misaligned" — NeomerArcana, Aug 04 '18 at 11:37
"Seekg to a position returned by tellg seems to be like "misaligned" No it isn't. Since you have the newline character in the string, it stands to reason that you have failed to read the newline character, rather than there's any tellg misalignment. — n. m. could be an AI, Aug 04 '18 at 11:58
@n.m. reading a newline or not has no bearing on whether seekg to a cursor previously returned by tellg will work correctly. — NeomerArcana, Aug 04 '18 at 12:17
Reading a newline or not has everything to do with the overall correctness of your program. If you don't read the newline correctly, your results will be wrong. You blame your wrong results on seekg/tellg but there's no evidence for it whatsoever. Did you rule out incorrect reading of the newline? How? — n. m. could be an AI, Aug 04 '18 at 13:31
Let's try a minimal change. Add this: `else if(Take("\n")) std::cout << "parsed out newline" << std::endl;` Does it improve the results? — n. m. could be an AI, Aug 04 '18 at 13:42
@n.m. I think you should read the comments on one of the answers to appreciate where the error is. — NeomerArcana, Aug 04 '18 at 22:28
I think you should try what I have suggested befre alleging that I don't understand something in your code. — n. m. could be an AI, Aug 05 '18 at 02:55

score 1 · Accepted Answer · answered Aug 04 '18 at 07:42

A much simpler and more efficient version of your code would be:

#include <iostream>
#include <fstream>
#include <string>

class A
{
public:
    A(std::istream& stream)
        :
        m_stream(stream),
        m_lineNumber(0),
        m_characterNumber(0)
    {

    }

    bool OutputKnownWords()
    {
        while (m_stream.good())
        {
            if (Take("MIDDLE"))
                std::cout << "Found middle" << std::endl;
            else if (Take("BEGIN"))
                std::cout << "Found begin" << std::endl;
            else if (Take("END"))
                std::cout << "Found end" << std::endl;
            else if (Take(" "))
                std::cout << "parsed out space" << std::endl;
            else
                return !m_stream.good();
        }
        return true;
    }

protected:

    bool Take(const std::string& str)
    {
        if (!Match(str))
            return false;

        m_characterNumber += str.size();

        return true;
    }

    bool readLine()
    {
        std::getline(m_stream, line);
        m_characterNumber = 0;
        m_lineNumber++;
        return !m_stream.eof();
    }

    bool Match(const std::string& str)
    {
        while (m_characterNumber >= line.size())
        {
            if (!readLine())
            {
                return false;
            }
        }
        if (line.size() - m_characterNumber < str.size())
        {
            return false;
        }
        return line.substr(m_characterNumber, str.size()) == str;
    }

    std::istream& m_stream;
    std::size_t m_lineNumber;
    std::size_t m_characterNumber;
    std::string line;
};

int main()
{
    std::ifstream file("test.txt");
    if (!file.is_open())
    {
        std::cerr << "could not open file" << std::endl;
        return 0;
    }

    A a(file);

    if (!a.OutputKnownWords())
    {
        std::cerr << "something went wrong" << std::endl;
        return 0;
    }

    return 0;
}

this seems to work. I note you're not using tellg and seekg. I would really like to get to the bottom of this. — NeomerArcana, Aug 04 '18 at 08:01
@NeomerArcana seeking in text streams is unreliable (seems especially so in MinGW), as seeking is not necessary to implement your program I'd just go with the simpler option — Alan Birtles, Aug 04 '18 at 21:54
This works and after some research, windows seeking in text mode is a bit broken. — NeomerArcana, Aug 05 '18 at 03:29

score 0 · Answer 2 · answered Aug 04 '18 at 07:05

I'm not sure what exactly you are ultimately trying to achieve but your code seems unnecessarily complex and difficult to understand and debug. There are a couple of issues. The first is that you never consume the newline character as you only call Get for the number of characters in your string (if you debug you'll notice that the string you read in Match is "\nEN" instead of "END". The second is that if you try to read a string but the stream doesn't return the requested number of characters you clear the error flags and return, this means you'll never reach the end of the file and your while condition of stream.good() will never fail.

To fix these issues change Match to:

bool Match(const std::string& str)
{
    auto cursorPos = m_stream.tellg();
    if (m_stream.peek() == '\n')
    {
        //consume the newline
        Get();
        cursorPos = m_stream.tellg();
    }

    std::string readStr(str.size(), '\0');

    m_stream.read(&readStr[0], str.size());
    if (m_stream.gcount() == 0)
    {
        // must be at EOF
        return false;
    }
    if (std::size_t(m_stream.gcount()) < str.size() || readStr != str)
    {
        std::cout << "expected '" << str << "' actual '" << readStr << "'\n";
        if (!m_stream.good())
            m_stream.clear();
        m_stream.seekg(cursorPos);
        return false;
    }
    m_stream.seekg(cursorPos);
    return true;
}

change OutputKnownWords to:

bool OutputKnownWords()
{
    while (m_stream.good())
    {
        if (Take("MIDDLE"))
            std::cout << "Found middle" << std::endl;
        else if (Take("BEGIN"))
            std::cout << "Found begin" << std::endl;
        else if (Take("END"))
            std::cout << "Found end" << std::endl;
        else if (Take(" "))
            std::cout << "parsed out space" << std::endl;
        else
            // stream can only be not good at EOF
            return !m_stream.good();
    }
    return true;
}

Thanks for the suggestion, but this doesn't fix the core problem. The output `expected 'MIDDLE' actual 'BEGIN ' Found begin expected 'MIDDLE' actual ' MIDDL' expected 'BEGIN' actual 'IDDLE' expected 'END' actual 'IDD' expected ' ' actual 'I' something went wrong` might explain what is happening. — NeomerArcana, Aug 04 '18 at 07:13
@NeomerArcana it works for me, did you try debugging? What platform and compiler are you using? — Alan Birtles, Aug 04 '18 at 07:26
Possibly a bug in the mingw stream implementation? Almost looks like peek is consuming a character — Alan Birtles, Aug 04 '18 at 07:36
Also looks the stream is being opened in text mode, which can cause problems due to how it attempts to handle CRLF as line terminator. (It tries to be helpful, and often the CRLF handling is helpful. Except when it doesn't work well, such as trying to treat the file as binary.) — Eljay, Aug 04 '18 at 13:20
@Eljay seems to be going wrong before it reaches the newline though according to the above debug output. — Alan Birtles, Aug 04 '18 at 21:55

Reading a UTF-8 file a parsing characters

2 Answers2