Showing posts with label snippet. Show all posts
Showing posts with label snippet. Show all posts

2009/03/25

How to use sprintf/wsprintf with std::string/std::wstring


Explanation


One of the few problems with std::string (and std::wstring) class in C++ is the lack of sprintf function that would return std::string as result. Or sprintf-styled constructor would be nice to have. Of course, there are many alternatives to sprintf - boost format library, sstream class and QString class in Qt 4, but so far plain old sprintf is most compact and easy when compared to them. sstream requires several statements(lines) to make a simple formatted string. Unless you like it too much, it is an overkill for making small formatted string. Boost format library will require boost or a part of it. And Qt 4 QString class will require entire Qt installation, acceptance of one of few available licenses, and it still won't be as compact as sprintf.

Solution



The first thing that comes to mind is to create temporary buffer, sprintf into buffer, and then assign buffer to std::string class. When your programs grows, you'll eventually get sick of many temporary buffers, besides it would still require few lines of code. Here is better way to do that:
Str.h:


#ifndef STR_H
#define STR_H
#include

typedef std::string Str;
typedef std::wstring WStr;

WStr swprintf(const wchar_t* format, ...);
WStr vswprintf(const wchar_t* format, va_list args);
WStr swprintf(const WStr& format, ...);
WStr vswprintf(const WStr& format, va_list args);

Str sprintf(const char* format, ...);
Str vsprintf(const char* format, va_list args);
Str sprintf(const Str& format, ...);
Str vsprintf(const Str& format, va_list args);

#endif




Str.cpp:

#include "Str.h"

WStr swprintf(const wchar_t* format, ...){
va_list args;
va_start(args, format);
WStr result = vswprintf(format, args);
va_end(args);
return result;
}

WStr vswprintf(const wchar_t* format, va_list args){
const int bufferSize = 16384;
wchar_t buffer[bufferSize];
vswprintf(buffer, bufferSize, format, args);
return WStr(buffer);
}

WStr swprintf(const WStr& format, ...){
va_list args;
va_start(args, format);
WStr result = vswprintf(format, args);
va_end(args);
return result;
}

WStr vswprintf(const WStr& format, va_list args){
return vswprintf(format.c_str(), args);
}

Str sprintf(const char* format, ...){
va_list args;
va_start(args, format);
Str result = vsprintf(format, args);
va_end(args);
return result;
}

Str vsprintf(const char* format, va_list args){
const int bufferSize = 16384;
char buffer[bufferSize];
vsnprintf(buffer, bufferSize, format, args);
return Str(buffer);
}

Str sprintf(const Str& format, ...){
va_list args;
va_start(args, format);
Str result = vsprintf(format, args);
va_end(args);
return result;
}

Str vsprintf(const Str& format, va_list args){
return vsprintf(format.c_str(), args);
}





This will allow to quickly create formatted string in one function call.

Problems:



This code works with gcc on Linux system, but it might require some tweaking on different compilers. For example, mingw version of vswprintf has different number of arguments (it doesn't have "buffer size" argument), so it will need to be replaced by another function. In general, wchar_t-based printf functions might cause problems when making cross-platform application. For example, swnprintf exists on MSVC compiler but is missing in gcc. On other hand, default version of vswprintf used in MinGW compiler doesn't have "buffer size" argument, so it is vulnerable to buffer overruns (linux version of function doesn't have this problem).
And yet another problem is that linux/windows versions of wprintf-related functions might handle %s and %S differently. As I remember, in mingw compiler %S in swprintf does the same thing as %s in linux version of printf and vise versa. Those problems can be partially fixed by using few hacks, but people making portable applications with swprintf should be aware of those problems.

Another problem is that there is hard-coded size limit for created strings (it can be changed, but it still doesn't look "nice" when used with C++ string classes. This problem can be bypassed for std::string classes (by using vsnprintf which returns how much characters could not be written in buffer, so you could allocate buffer dynamically, then sprintf into it, and then assign it to std::string), but not for std::wstring (vswnprintf is not available on gcc, doesn't look like standard).

Other solutions


To my opinion, the best (not the fastest) way to make custom sprintf for std::string classes is probably to write it from scratch (in C++ it might be easier) or derive from existing C implementation of sprintf (for example, you can take one out of freebsd source repository). The reason for that is that several printf implementations might have differences. The problem here is that it won't be easy, and you probably will need some time to write standard-compliant (or "format-compliant") version of swprintf/sprintf which will operate on std::string/std::wstring. Of course, you can also implement limited version of those functions.

Another way is to make your own formatting routines or use already mentioned ones: boost format library, QString or sstream. This way you won't get sprintf function, but few formatting routines to use instead.

2009/03/07

how to find identical files on Linux using python script


In situation where you have two directories (A & B) with a lot of files, where some files in B directory are same as some files in A directory, you can use this python script to find identical files.

find_identical.py:
#!/usr/bin/env python
import os
import sys
import commands
#import string

if len(sys.argv) != 3:
print """not enough arguments!
usage: find_identical.py source destination"""
exit(0)

src=sys.argv[1]
dest=sys.argv[2]

#print "src: %s, dest: %s" % (src, dest)
print "#!/bin/sh"

srcText = commands.getoutput("md5sum %s/*" % src)
destText = commands.getoutput("md5sum %s/*" % dest)

def dictFromString(val):
lines = val.split("\n")
result = {}
for line in lines:
line = line.split()
result[line[0]] = line[1]
return result

def listFromString(val):
result = val.split("\n")
for i in range(len(result)):
result[i] = result[i].split();
return result;

srcDict = dictFromString(srcText)
destList = listFromString(destText)
filesFound = False
for file in destList:
if srcDict.has_key(file[0]):
print "rm \"%s\" #identical to %s" % (file[1], srcDict[file[0]])
filesFound = True;

if not filesFound:
print "#no identical files found"
exit(1)




Script needs two arguments - name of first and second directory. By default script prints (to standard output) shell script that would remove files from B directory which present in A directory.
So, to remove files from B which already present in A, you'll need to run:
find_identical.py A B >applychanges.sh
bash applychanges.sh


This done this way so you can examine list of files which will be removed before removing them. This behavior can be easily changed by modifying line:
 print "rm \"%s\" #identical to %s" % (file[1], srcDict[file[0]])

to something you need.

Script needs md5sum to work.
Notice, that there is already linux software available that does same thing (finds identical files). So this script is mostly useful for learning python, or as base for making another script.

C++ utf8 to wstring conversion routine


Some time ago I was looking for C++ conversion routine that could convert utf8 string(stored in std::string) into std::wstring and vice versa. It is certainly possible to do that using setlocale and some C functions, but I wanted something done in "pure C++". I did some research, didn't find anything, and after all wrote conversion routine myself (using information on wikipedia).

Here is the code:



Str.h:
#ifndef STR_H
#define STR_H
#include
#include

typedef std::string Str;
typedef std::wstring WStr;

std::ostream& operator<<(std::ostream& f, const WStr& s); std::istream& operator>>(std::istream& f, WStr& s);
void utf8toWStr(WStr& dest, const Str& src);
void wstrToUtf8(Str& dest, const WStr& src);

#endif




Str.cpp:
/*
Copyright (c) 2009 SegFault aka "ErV" (altctrlbackspace.blogspot.com)

Redistribution and use of this source code, with or without modification, is
permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "Str.h"
#ifdef UTF8TEST
#include
#endif

void utf8toWStr(WStr& dest, const Str& src){
dest.clear();
wchar_t w = 0;
int bytes = 0;
wchar_t err = L'�';
for (size_t i = 0; i < src.size(); i++){
unsigned char c = (unsigned char)src[i];
if (c <= 0x7f){//first byte
if (bytes){
dest.push_back(err);
bytes = 0;
}
dest.push_back((wchar_t)c);
}
else if (c <= 0xbf){//second/third/etc byte
if (bytes){
w = ((w << 6)|(c & 0x3f));
bytes--;
if (bytes == 0)
dest.push_back(w);
}
else
dest.push_back(err);
}
else if (c <= 0xdf){//2byte sequence start
bytes = 1;
w = c & 0x1f;
}
else if (c <= 0xef){//3byte sequence start
bytes = 2;
w = c & 0x0f;
}
else if (c <= 0xf7){//3byte sequence start
bytes = 3;
w = c & 0x07;
}
else{
dest.push_back(err);
bytes = 0;
}
}
if (bytes)
dest.push_back(err);
}

void wstrToUtf8(Str& dest, const WStr& src){
dest.clear();
for (size_t i = 0; i < src.size(); i++){
wchar_t w = src[i];
if (w <= 0x7f)
dest.push_back((char)w);
else if (w <= 0x7ff){
dest.push_back(0xc0 | ((w >> 6)& 0x1f));
dest.push_back(0x80| (w & 0x3f));
}
else if (w <= 0xffff){
dest.push_back(0xe0 | ((w >> 12)& 0x0f));
dest.push_back(0x80| ((w >> 6) & 0x3f));
dest.push_back(0x80| (w & 0x3f));
}
else if (w <= 0x10ffff){
dest.push_back(0xf0 | ((w >> 18)& 0x07));
dest.push_back(0x80| ((w >> 12) & 0x3f));
dest.push_back(0x80| ((w >> 6) & 0x3f));
dest.push_back(0x80| (w & 0x3f));
}
else
dest.push_back('?');
}
}

Str wstrToUtf8(const WStr& str){
Str result;
wstrToUtf8(result, str);
return result;
}

WStr utf8toWStr(const Str& str){
WStr result;
utf8toWStr(result, str);
return result;
}

std::ostream& operator<<(std::ostream& f, const WStr& s){
Str s1;
wstrToUtf8(s1, s);
f << s1;
return f;
}

std::istream& operator>>(std::istream& f, WStr& s){
Str s1;
f >> s1;
utf8toWStr(s, s1);
return f;
}

#ifdef UTF8TEST
bool utf8test(){
WStr w1;
//for (wchar_t c = 1; c <= 0x10ffff; c++){
for (wchar_t c = 0x100000; c <= 0x100002; c++){
w1 += c;
}
Str s = wstrToUtf8(w1);
WStr w2 = utf8toWStr(s);
bool result = true;
if (w1.length() != w2.length()){
printf("length differs\n");
//std::cout << "length differs" << std::endl;
result = false;
}

printf("w1: %S\ns: %s\nw2: %S\n", w1.c_str(), s.c_str(), w2.c_str());

for (size_t i = 0; i < w1.size(); i++)
if (w1[i] != w2[i]){
result = false;
printf("character at pos %x differs (expected %.8x got %.8x)\n", i, w1[i], w2[i]);
//std::cout << "character at pos " << i << " differs" << std::endl;
break;
}

if (!result){
printf("utf8 dump: \n");
for (size_t i = 0; i < s.size(); i++)
printf("%2x ", (unsigned char)s[i]);
}

return result;
}

int main(int argc, char** argv){
std::wstring ws(L"фыва");
std::string s("фыва");
std::cout << ws << s << std::endl;
std::cout << wstrToUtf8(utf8toWStr("фыва")) << std::endl;
if (utf8test())
std::cout << "utf8Test succesful" << std::endl;
else
std::cout << "utf8Test failed" << std::endl;
return 0;
}
#endif



Code was successfully tested on 32bit linux system (see "utf8test()" routine) and seems to work. Should work on 32bit windows platform as well, but keep in mind that wchar_t on msvc has size of 2 bytes, so on windows platform routine won't handle unicode characters in range 0xffff..0x10ffff).

If you need routine like this, feel free to use it, just don't claim you wrote it.
Code is available under modified BSD license.