2009/03/07

C++ utf8 to wstring conversion routine


Some time ago I was looking for C++ conversion routine that could convert utf8 string(stored in std::string) into std::wstring and vice versa. It is certainly possible to do that using setlocale and some C functions, but I wanted something done in "pure C++". I did some research, didn't find anything, and after all wrote conversion routine myself (using information on wikipedia).

Here is the code:



Str.h:
#ifndef STR_H
#define STR_H
#include
#include

typedef std::string Str;
typedef std::wstring WStr;

std::ostream& operator<<(std::ostream& f, const WStr& s); std::istream& operator>>(std::istream& f, WStr& s);
void utf8toWStr(WStr& dest, const Str& src);
void wstrToUtf8(Str& dest, const WStr& src);

#endif




Str.cpp:
/*
Copyright (c) 2009 SegFault aka "ErV" (altctrlbackspace.blogspot.com)

Redistribution and use of this source code, with or without modification, is
permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "Str.h"
#ifdef UTF8TEST
#include
#endif

void utf8toWStr(WStr& dest, const Str& src){
dest.clear();
wchar_t w = 0;
int bytes = 0;
wchar_t err = L'�';
for (size_t i = 0; i < src.size(); i++){
unsigned char c = (unsigned char)src[i];
if (c <= 0x7f){//first byte
if (bytes){
dest.push_back(err);
bytes = 0;
}
dest.push_back((wchar_t)c);
}
else if (c <= 0xbf){//second/third/etc byte
if (bytes){
w = ((w << 6)|(c & 0x3f));
bytes--;
if (bytes == 0)
dest.push_back(w);
}
else
dest.push_back(err);
}
else if (c <= 0xdf){//2byte sequence start
bytes = 1;
w = c & 0x1f;
}
else if (c <= 0xef){//3byte sequence start
bytes = 2;
w = c & 0x0f;
}
else if (c <= 0xf7){//3byte sequence start
bytes = 3;
w = c & 0x07;
}
else{
dest.push_back(err);
bytes = 0;
}
}
if (bytes)
dest.push_back(err);
}

void wstrToUtf8(Str& dest, const WStr& src){
dest.clear();
for (size_t i = 0; i < src.size(); i++){
wchar_t w = src[i];
if (w <= 0x7f)
dest.push_back((char)w);
else if (w <= 0x7ff){
dest.push_back(0xc0 | ((w >> 6)& 0x1f));
dest.push_back(0x80| (w & 0x3f));
}
else if (w <= 0xffff){
dest.push_back(0xe0 | ((w >> 12)& 0x0f));
dest.push_back(0x80| ((w >> 6) & 0x3f));
dest.push_back(0x80| (w & 0x3f));
}
else if (w <= 0x10ffff){
dest.push_back(0xf0 | ((w >> 18)& 0x07));
dest.push_back(0x80| ((w >> 12) & 0x3f));
dest.push_back(0x80| ((w >> 6) & 0x3f));
dest.push_back(0x80| (w & 0x3f));
}
else
dest.push_back('?');
}
}

Str wstrToUtf8(const WStr& str){
Str result;
wstrToUtf8(result, str);
return result;
}

WStr utf8toWStr(const Str& str){
WStr result;
utf8toWStr(result, str);
return result;
}

std::ostream& operator<<(std::ostream& f, const WStr& s){
Str s1;
wstrToUtf8(s1, s);
f << s1;
return f;
}

std::istream& operator>>(std::istream& f, WStr& s){
Str s1;
f >> s1;
utf8toWStr(s, s1);
return f;
}

#ifdef UTF8TEST
bool utf8test(){
WStr w1;
//for (wchar_t c = 1; c <= 0x10ffff; c++){
for (wchar_t c = 0x100000; c <= 0x100002; c++){
w1 += c;
}
Str s = wstrToUtf8(w1);
WStr w2 = utf8toWStr(s);
bool result = true;
if (w1.length() != w2.length()){
printf("length differs\n");
//std::cout << "length differs" << std::endl;
result = false;
}

printf("w1: %S\ns: %s\nw2: %S\n", w1.c_str(), s.c_str(), w2.c_str());

for (size_t i = 0; i < w1.size(); i++)
if (w1[i] != w2[i]){
result = false;
printf("character at pos %x differs (expected %.8x got %.8x)\n", i, w1[i], w2[i]);
//std::cout << "character at pos " << i << " differs" << std::endl;
break;
}

if (!result){
printf("utf8 dump: \n");
for (size_t i = 0; i < s.size(); i++)
printf("%2x ", (unsigned char)s[i]);
}

return result;
}

int main(int argc, char** argv){
std::wstring ws(L"фыва");
std::string s("фыва");
std::cout << ws << s << std::endl;
std::cout << wstrToUtf8(utf8toWStr("фыва")) << std::endl;
if (utf8test())
std::cout << "utf8Test succesful" << std::endl;
else
std::cout << "utf8Test failed" << std::endl;
return 0;
}
#endif



Code was successfully tested on 32bit linux system (see "utf8test()" routine) and seems to work. Should work on 32bit windows platform as well, but keep in mind that wchar_t on msvc has size of 2 bytes, so on windows platform routine won't handle unicode characters in range 0xffff..0x10ffff).

If you need routine like this, feel free to use it, just don't claim you wrote it.
Code is available under modified BSD license.

2 comments:

  1. Hi, in this portion of code:

    void utf8toWStr(WStr& dest, const Str& src){
    dest.clear();
    wchar_t w = 0;
    int bytes = 0;
    wchar_t err = L'�';

    What's the funny char '?' meant to represent?

    Can you point to the specs tou followed for implementing this convertion?

    Thanks for sharing, regards.

    ReplyDelete
  2. >> What's the funny char '?' meant to represent?

    "Funny char ?" is a unicode symbol which will be used instead of character that could not be converted from utf8 to wchars due to some kind of error.

    >> Can you point to the specs tou followed implementing this convertion?
    Specs were taken from wikipedia.
    http://en.wikipedia.org/wiki/Utf8

    ReplyDelete