Alt+Ctrl+Backspace: C++ utf8 to wstring conversion routine

Some time ago I was looking for C++ conversion routine that could convert utf8 string(stored in std::string) into std::wstring and vice versa. It is certainly possible to do that using setlocale and some C functions, but I wanted something done in "pure C++". I did some research, didn't find anything, and after all wrote conversion routine myself (using information on wikipedia).

Here is the code:

Str.h:

#ifndef STR_H
#define STR_H
#include 
#include 

typedef std::string Str;
typedef std::wstring WStr;

std::ostream& operator<<(std::ostream& f, const WStr& s); std::istream& operator>>(std::istream& f, WStr& s);
void utf8toWStr(WStr& dest, const Str& src);
void wstrToUtf8(Str& dest, const WStr& src);

#endif

Str.cpp:

/*
Copyright (c) 2009 SegFault aka "ErV" (altctrlbackspace.blogspot.com)

Redistribution and use of this source code, with or without modification, is
permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright
 notice, this list of conditions and the following disclaimer.

THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "Str.h"
#ifdef UTF8TEST
#include 
#endif

void utf8toWStr(WStr& dest, const Str& src){
 dest.clear();
 wchar_t w = 0;
 int bytes = 0;
 wchar_t err = L'�';
 for (size_t i = 0; i < src.size(); i++){
  unsigned char c = (unsigned char)src[i];
  if (c <= 0x7f){//first byte
   if (bytes){
    dest.push_back(err);
    bytes = 0;
   }
   dest.push_back((wchar_t)c);
  }
  else if (c <= 0xbf){//second/third/etc byte
   if (bytes){
    w = ((w << 6)|(c & 0x3f));
    bytes--;
    if (bytes == 0)
     dest.push_back(w);
   }
   else
    dest.push_back(err);
  }
  else if (c <= 0xdf){//2byte sequence start
   bytes = 1;
   w = c & 0x1f;
  }
  else if (c <= 0xef){//3byte sequence start
   bytes = 2;
   w = c & 0x0f;
  }
  else if (c <= 0xf7){//3byte sequence start
   bytes = 3;
   w = c & 0x07;
  }
  else{
   dest.push_back(err);
   bytes = 0;
  }
 }
 if (bytes)
  dest.push_back(err);
}

void wstrToUtf8(Str& dest, const WStr& src){
 dest.clear();
 for (size_t i = 0; i < src.size(); i++){
  wchar_t w = src[i];
  if (w <= 0x7f)
   dest.push_back((char)w);
  else if (w <= 0x7ff){
   dest.push_back(0xc0 | ((w >> 6)& 0x1f));
   dest.push_back(0x80| (w & 0x3f));
  }
  else if (w <= 0xffff){
   dest.push_back(0xe0 | ((w >> 12)& 0x0f));
   dest.push_back(0x80| ((w >> 6) & 0x3f));
   dest.push_back(0x80| (w & 0x3f));
  }
  else if (w <= 0x10ffff){
   dest.push_back(0xf0 | ((w >> 18)& 0x07));
   dest.push_back(0x80| ((w >> 12) & 0x3f));
   dest.push_back(0x80| ((w >> 6) & 0x3f));
   dest.push_back(0x80| (w & 0x3f));
  }
  else
   dest.push_back('?');
 }
}

Str wstrToUtf8(const WStr& str){
 Str result;
 wstrToUtf8(result, str);
 return result;
}

WStr utf8toWStr(const Str& str){
 WStr result;
 utf8toWStr(result, str);
 return result;
}

std::ostream& operator<<(std::ostream& f, const WStr& s){
 Str s1;
 wstrToUtf8(s1, s);
 f << s1;
 return f;
}

std::istream& operator>>(std::istream& f, WStr& s){
 Str s1;
 f >> s1;
 utf8toWStr(s, s1);
 return f;
}

#ifdef UTF8TEST
bool utf8test(){
 WStr w1;
 //for (wchar_t c = 1; c <= 0x10ffff; c++){
 for (wchar_t c = 0x100000; c <= 0x100002; c++){
  w1 += c; 
 }
 Str s = wstrToUtf8(w1);
 WStr w2 = utf8toWStr(s);
 bool result = true;
 if (w1.length() != w2.length()){
  printf("length differs\n");
  //std::cout << "length differs" << std::endl;
  result = false;
 }
 
 printf("w1: %S\ns: %s\nw2: %S\n", w1.c_str(), s.c_str(), w2.c_str());
 
 for (size_t i = 0; i < w1.size(); i++)
  if (w1[i] != w2[i]){
   result = false;
   printf("character at pos %x differs (expected %.8x got %.8x)\n", i, w1[i], w2[i]);
   //std::cout << "character at pos " << i  << " differs" << std::endl;
   break;
  }
  
 if (!result){
  printf("utf8 dump: \n");
  for (size_t i = 0; i < s.size(); i++)
   printf("%2x ", (unsigned char)s[i]);
 }
 
 return result;
}

int main(int argc, char** argv){
 std::wstring ws(L"фыва");
 std::string s("фыва");
 std::cout << ws << s << std::endl;
 std::cout << wstrToUtf8(utf8toWStr("фыва")) << std::endl;
 if (utf8test())
  std::cout << "utf8Test succesful" << std::endl;
 else
  std::cout << "utf8Test failed" << std::endl;
 return 0;
}
#endif

Code was successfully tested on 32bit linux system (see "utf8test()" routine) and seems to work. Should work on 32bit windows platform as well, but keep in mind that wchar_t on msvc has size of 2 bytes, so on windows platform routine won't handle unicode characters in range 0xffff..0x10ffff).

If you need routine like this, feel free to use it, just don't claim you wrote it.

Code is available under modified BSD license.

2 comments:

Stefano SabatiniFebruary 18, 2010 at 7:15 PM
Hi, in this portion of code:

void utf8toWStr(WStr& dest, const Str& src){
dest.clear();
wchar_t w = 0;
int bytes = 0;
wchar_t err = L'�';

What's the funny char '?' meant to represent?

Can you point to the specs tou followed for implementing this convertion?

Thanks for sharing, regards.
SegFaultMarch 26, 2010 at 6:41 AM
>> What's the funny char '?' meant to represent?

"Funny char ?" is a unicode symbol which will be used instead of character that could not be converted from utf8 to wchars due to some kind of error.

>> Can you point to the specs tou followed implementing this convertion?
Specs were taken from wikipedia.
http://en.wikipedia.org/wiki/Utf8

Alt+Ctrl+Backspace

2009/03/07

C++ utf8 to wstring conversion routine

2 comments:

Blog Archive

About Me

Followers