I have been searching over internet, but could not find any existing tools for extracting words from a file with a specific delimiter in C++. Does anyone know an already existing library or code in C++ that does the job. Given below is what I wanted to achieve :
- Objective : to extract words from a file using a delimiter
- Words type : words can be made of any combination of unsigned characters (within UTF-8 encoding set). So, \0should also be considered as a character. And only delimiter should be able to separate any two words from each other.
- File type : text file
I have tried the following code :
#include <iostream>
using std::cout;
using std::endl;
#include <fstream>
using std::ifstream;
#include <cstring>
const int MAX_TOKENS_PER_FILE = 100000;
const int MAX_CHARS_PER_LINE = 512;
const int MAX_TOKENS_PER_LINE = 256;
const char* const DELIMITER = " ";
int main()
{
  int index = 0, keyword_num = 0;
  // stores all the words that are in a file
  unsigned char *keywords_extracted[MAX_TOKENS_PER_FILE];    
  // create a file-reading object
  ifstream fin;
  fin.open("data.txt"); // open a file
  if (!fin.good()) 
    return 1; // exit if file not found
  // read each line of the file
  while (!fin.eof())
  {
    // read an entire line into memory
    char buf[MAX_CHARS_PER_LINE];
    fin.getline(buf, MAX_CHARS_PER_LINE);
    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index
    // array to store memory addresses of the tokens in buf
    const char* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0
    // parse the line
    token[0] = strtok(buf, DELIMITER); // first token
    if (token[0]) // zero if line is blank
    {
      keywords_extracted[keyword_num] = (unsigned char *)token[0];
      keyword_num++;
      for (n = 1; n < MAX_TOKENS_PER_LINE; n++)
      {
        token[n] = strtok(0, DELIMITER); // subsequent tokens
        if (!token[n]) break; // no more tokens
            keywords_extracted[keyword_num] = (unsigned char *)token[n];
            keyword_num++;
      }
    }
  }
    // process (print) the tokens
    for(index=0;index<keyword_num;index++)
        cout << keywords_extracted[index] << endl;
}
But I have a problem from the above code :
- The first word/entry in keywords_extracted is being replaced with '0' as the the content of the last line the program reads is empty.(correct me if i'm doing/assuming anything wrong).
Is there a way to overcome this problem in the above code or are any other existing libraries for this functionality? Sorry for lengthy explanation, just trying to be clear.
 
     
    