c++ - have program get absolute path its directory

Shadow2531 · Nov 27, 2005

Thanks.

Yeh, basically, since wchar_t is a 2byte type and some characters might be 3 or 4 bytes, what I really need to do is just use a fixed 4 byte type like:

Code:

#include <iostream>
#include <string>
#include <fstream>

using namespace std;

inline 32bit_ustring generatePath( const 32bit_uchar* p, const 32bit_ustring& file) {
    const 32bit_ustring s( p );
    if ( s.find(32bit_U"/") != 32bit_ustring::npos) {
        return s.substr( 0, s.rfind( 32bit_U"/" ) ) + 32bit_U"/" + file;
    } else if ( s.find(32bit_U"\\") != 32bit_ustring::npos ) {
        return s.substr( 0, s.rfind( 32bit_U"\\" ) ) + 32bit_U"\\" + file;
    } else {
        return file;
    }
}

int 32bit_umain(int argc, 32bit_uchar** argv) {
    if ( argc < 1 ) {
        return 1;
    }
    const 32bit_ustring path( generatePath( argv[0], 32bit_U"file.txt" ) );
    32bit_ifstream in( path.32bit_uc.str() );
}

For something like this, I wouldn't mind using 4 btye chars.

(This isn't super important or anything. I'm just exploring the possibility of supporting unicode paths.)

mikeblas · Nov 27, 2005

Wow! What is it that you're writing that you'll be selling in rural China?

Shadow2531 · Nov 27, 2005

Nothing. Just investigating unicode a little.

Shadow2531 · Nov 30, 2005

I've come up with this so far:

Code:

#define _UNICODE
#include <cstdio>
#include <cwchar>
#include <cstdlib>
#include <tchar.h>

using namespace std;

int _tmain(int argc, _TCHAR* argv[] ) {
    if (argc != 3) {
        _tprintf( _T("Usage: this_program outputfile text\n") );
        return 1;
    }
    FILE* out = _tfopen( argv[1], _T("w") );
    _ftprintf(out, _T("%c"), 0xFF );
    _ftprintf(out, _T("%c"), 0xFE );
    for (size_t i = 0; i < wcslen( argv[2] ); ++i ) {
        _ftprintf(out, _T("%c"), ( argv[2][i] & 0xff ) );
        _ftprintf(out, _T("%c"), ( argv[2][i] >> 8 ) );
    }
    fclose( out );
}

Only compiles under vc++. (mingw has a fit about an undefined _tmain and reference to winmain16_

Anyway, I realized the radic symbol for example can indeed be represented with 2 bytes.

In the code above, if I print out the whole string to the file or each char, things come out as gibberish, but if I split each character into 2 bytes and then write each byte to the file, I'm able to get some things working. ( Had help with the bitwise operator for the first byte )

Basically, I can now do:

Code:

program.exe <unicode path to file that has characters that can't be represented by just one btye> <unicode text that has characters that can't be represented by just one btye>

meaning, with that program, I can do:

program "c:\documents and settings\user\desktop\√\file.txt" "abcdef√√¶¶1234"

and it will actually work. (in my limited testing)

Shadow2531 · Dec 1, 2005

I think I found out why I had to split up the bytes.

1.) Wide characters are stored in Big Endian.

2.) The file stream only accepts Big Endian.

3.) The order that the stream writes in depends on the BOM you insert.

The BOM for Little Endian is 0xFFFE.

The problem was, I wasn't inserting the Little Endian BOM into the stream as Big Endian.

I have to insert it as 0xFEFF.

Code:

#define _UNICODE
#include <cstdio>
#include <cwchar>
#include <cstdlib>
#include <tchar.h>

using namespace std;

int _tmain(int argc, _TCHAR* argv[] ) {
    if (argc != 3) {
        _tprintf( _T("Usage: this_program outputfile text\n") );
        return 1;
    }
    FILE* out = _tfopen( argv[1], _T("wb") );
    _ftprintf(out, _T("%c"), 0xFEFF );
    _ftprintf(out, _T("%s"), argv[2] );
    fclose( out );
}

I also switched to writing in binary.

Here's the same thing, but cleaner.

Code:

#define _UNICODE
#include <cstdio>
#include <cwchar>

using namespace std;

int wmain(int argc, wchar_t* argv[] ) {
    if (argc != 3) {
        wprintf(L"Usage: this_program outputfile text\n"); 
        return 1;
    }
    FILE* out = _wfopen( argv[1], L"wb" );
    fwprintf(out, L"%c", 0xFEFF  );
    fwprintf(out, L"%s", argv[2] );
    fclose( out );
}

Shadow2531 · Dec 1, 2005

Here we go:

Code:

#define _UNICODE
#include <cstdio>
#include <string>
#include <cwchar>

using namespace std;

inline wstring generatePath( const wchar_t* p, const wstring& file) {
    const wstring s( p );
    if ( s.find(L"/") != wstring::npos) {
        return s.substr( 0, s.rfind( L"/" ) ) + L"/" + file;
    } else if ( s.find(L"\\") != wstring::npos ) {
        return s.substr( 0, s.rfind( L"\\" ) ) + L"\\" + file;
    } else {
        return file;
    }
}

int wmain(int argc, wchar_t* argv[] ) {
    if (argc < 1) {
        return 1;
    }
    const wstring file( generatePath( argv[0], L"file.txt") );
    FILE* out = _wfopen( file.c_str(), L"wb" );
    fwprintf(out, L"%c", 0xFEFF  );
    fwprintf(out, L"%s", L"testing&#8730; 1, 2 ,3" );
    fclose( out );
}

There's a wifstream for vc++, but it doesn't support unicode paths, which is why I stuck with _wfopen().

Since mingw is such a pain with this, I'm investigating STLPORT from www.stlport.org .

However, that's a pain in itself to even build with mingw.

Shadow2531 · Dec 5, 2005

I finally got the stlport5.0 library to build.

Here's where you get it:
http://sourceforge.net/projects/stlport

Here's how you build it for mingw: (need to use msys to build it.)

cd stlport/build/lib
make -f gcc.mak depend
make -f gcc.mak all
make -f gcc.mak install-release-shared
make -f gcc.mak install-release-static
make -f gcc.mak install-dbg-shared
make -f gcc.mak install-dbg-static
make -f gcc.mak install-stldbg-static
make -f gcc.mak install-stldbg-shared

Then the dlls will be in the stlport/bin directory and the libs will be in the stlport/lib directory

Here's how you build with it using mingw

Code:

g++ -s -Wall -Wextra -mthreads file.cpp -o file "-Lpath_to_stlport_libs" "-Ipath_to_stlport_includes" -lstlport.5.0

Then, the program depends on libstlport.5.0.dll and mingwm10.dll to run. (mingwm10.dll should be in the mingw bin directory)

Anyway, once all setup, I can do this:

Code:

#define _UNICODE
#include <iostream>
#include <fstream>

using namespace std;

int main(int argc, wchar_t* argv[]) {
    if ( argc != 3) {
        wcout << L"usage: this [outfile] [text]" << endl;
        return 1;
    }
    wofstream out( argv[1], ios_base::binary );
    if ( !out ) {
        wcout << L"Error writing to file: " << argv[1];
        return 1;
    }
    out << 0xFEFF << argv[2];
}

, where the stream can accept a wchar_t* as an arg. Also, wcout works etc.

That will compile, but doesn't actually open a stream correctly or something. I'm asking on the mingw list for help to figure out why, but getting a little farther.

Edit:

Actually, since I'm not using wmain(), argv elements are just char*, which means that the wofstream in stlport probably doesn't support opening a wchar_t* path either. So, I'm stuck with _wfopen() or using the windows api with createFileW().

Also, I found more on why wmain() doesn't work with Mingw.
http://sourceforge.net/mailarchive/message.php?msg_id=5078083

Looks like I'd have to hack the mingw src, build my own distro and then use stlport along with it to do what I want. Or, I could just use vc++ like I did in the post above.

Shadow2531 · Dec 17, 2005

Ah, I figured out how to get unicode argv support in mingw without wmain (since mingw doesn't have wmain).

Code:

#define _UNICODE
#include <windows.h>
#include <cwchar>
#include <cstdio>

using namespace std;

int main() {
    int argc;
    wchar_t** argv = CommandLineToArgvW( GetCommandLineW(), &argc);
    if (argc != 3) {
        wprintf(L"usage: this [file] [text]\n");
        return 1;
    }
    FILE* out = _wfopen( argv[1], L"wb");
    if (!out) {
        return 1;
    }
    fwprintf(out, L"%c", 0xFEFF);
    fwprintf(out, L"%s", argv[2]);
    fclose(out);
}

@mikeblas

I've also been experimenting with having a 32bit char and 32bit string. I'm not too happy with a 16bit char. Storing in a 32bit char and being able to convert back n forth between encodings would be kind of neat. I don't have things sorted out for converting down to utf-8 yet, but here's an example that I have so far. I'm pretty sure I have to provide different conversion methods depending on the range though.

Code:

#include <fstream>

using namespace std;
// (char)0xE2 << (char)0x88 << (char)0x9A

int main() {
    ofstream out("utf8test.txt");
    uint16_t c = 0x221A; // radic in BE
    out << (char)0xEF << (char)0xBB << (char)0xBF; // utf-8 signature
    out << (char)(0xE0 | c >> 12);
    out << (char)(0x80 | c >> 6 & 0x3F);
    out << (char)(0x80 | c & 0x3F);
}

mikeblas · Dec 17, 2005

Shadow2531 said:
I'm not too happy with a 16bit char.

Why not? Every application I've shipped for international markets has been fine with UTF-16; it even supports surrogates for that standard required by the Chinese government ... what was it? (LATER: I was thinking of GB18030.)

With 32-bit characters, you're buying a lot of problems (storage size and performance, most notably). I wonder what benefit you're realizing from it that makes you think it is such a strong requirement.

Shadow2531 · Dec 17, 2005

^^ Oops. I didn't word that correctly.

I'm not too happy with wide character support with mingw and vc++ in regards to STL extensions like wifstream, wofstream and wmain etc. (I don't think they're really defined as part of the STL, which is why I said 'extensions'.)

I think things should be more unicode aware. That's why I'm experimenting with my own methods.

So far, I'm just using 16bit chars + padding and really don't make use of the other 2 bytes. Also, I think the most I could ever use at this time is 24bits anyway. You might say, "What a waist!".

One bonus of 32bit ( if I do it correctly) is that it's a fixed amount of bytes to work with. With utf-16, the number of bytes can be more than 2, which makes utf-16 a variable width like utf-8 and can possibly cause some problems with overloading wchar_t (I assume).

Since utf-8 and utf-16 are variable width, it seems like a regular old char array would be in order for storing them and you would rely on methods to get what you want out of the data. With utf-32, you've got 4 *expected* bytes for each char, which to me seems simpler.

Of course, I can always make a u16char struct/class, which either replaces wchar_t or improves on it without the wastefulness and performance hit of a 32bit char and then work around any limitations. (I'm also using all of this to practice structs/classes etc.)

As an example: say if you had to read utf-32 files. Would you just omit the last 2 bytes and store the first 2 as utf-16 in your program and hope that no characters consisted of more than 2 bytes?

I'm thinking too far ahead here though and I haven't encountered any of the known problems with utf-8. I'm only experimenting. The idea of handling everything as 32bit internally sounds great to me if I ignore the performance/resource issues.

mikeblas · Dec 17, 2005

Shadow2531 said:
You might say, "What a waist!".

Nah. I'd say "what a waste!" But I guss it would sound the same.

Shadow2531 said:
One bonus of 32bit ( if I do it correctly) is that it's a fixed amount of bytes to work with.

Sure. But that means you're wasting half your memory bandwidth and therefore half your cache by always fetching twice as much data as necessary.

32-bit characters are very very rare. Unless your processing archaic Chinese, screwing around with that GB18030 standard, or getting into some real esoteric language research or geneaology, you're not going to come across these characters.

At Microsoft, we worry about GB18030 because the Chinese government mandated that software sold in the country must support GB18030. The reason is that they are worried about data processing equipment forcing the obsolesence of their language and culture because UTF-16 (as it originally stood) couldn't represent antiquated characters.

UTF-16 efficiently stores the characters that Chinese and other complex languages uses every day, no problem. So why not optimize your code for the most common case?

Sure, you have to deal with the possibility that a character is wider than the next, but there are functions that help you with this in the NLS API and once you get used to the rules, it's not that hard to deal with.

Shadow2531 · Dec 18, 2005

mikeblas said:
Nah. I'd say "what a waste!" But I guss it would sound the same.

Oops! Yeh, I meant waste.

Sure. But that means you're wasting half your memory bandwidth and therefore half your cache by always fetching twice as much data as necessary.

32-bit characters are very very rare. Unless your processing archaic Chinese, screwing around with that GB18030 standard, or getting into some real esoteric language research or geneaology, you're not going to come across these characters.

At Microsoft, we worry about GB18030 because the Chinese government mandated that software sold in the country must support GB18030. The reason is that they are worried about data processing equipment forcing the obsolesence of their language and culture because UTF-16 (as it originally stood) couldn't represent antiquated characters.

UTF-16 efficiently stores the characters that Chinese and other complex languages uses every day, no problem. So why not optimize your code for the most common case?

Sure, you have to deal with the possibility that a character is wider than the next, but there are functions that help you with this in the NLS API and once you get used to the rules, it's not that hard to deal with.

I see your point.

Thanks

c++ - have program get absolute path its directory

Shadow2531

[H]ard|Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

Shadow2531

[H]ard|Gawd

Shadow2531

[H]ard|Gawd

Shadow2531

[H]ard|Gawd

Shadow2531

[H]ard|Gawd

Shadow2531

[H]ard|Gawd

Shadow2531

[H]ard|Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

Shadow2531

[H]ard|Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

Shadow2531

[H]ard|Gawd