Applescript and UTF-8 arguments

The following tip is based on a hint by mzs found on MacOSXHints.com. And note: this article relates only to Tiger. This issue has been resolved in OS X Leopard and Applescript 2.0.

Although the Mac has been a great environment for working with UTF-8 text (8-bit Unicode), I’ve found a few corners where it’s rather difficult to preserve the encoding of my text. One of these is passing UTF-8 arguments to Applescripts on the command-line, using the osascript utility.

To step back for a second: the reason I need UTF-8 support everywhere is that I sometimes work with Persian texts, which use an Arabic alphabet. In general, most Cocoa application display Arabic text fairly well (though a large number of them have no clue when it comes to properly formatting right-to-left text; this means that when I type an exclamation mark, it often appears to the right of my entered text, rather than to the left). But in the non-Cocoa world, which includes Carbon apps and the command-line, UTF-8 is either non-existent or very poor.

For example, as a result of my work in Persian, I have files that both contain Persian text and have Persian filenames. The default setup for the Mac is pretty well suited for handling this at the Cocoa-level of things, such as the Finder, TextEdit, and so on. But on the command-line, things are a bit different. For one, Terminal.app must be reconfigured to properly display Unicode characters. Then, you have to pass the -w flag to /bin/ls to get Unicode bytes in filenames to render correctly.

If you want pass a Persian filename to a script, many programs do not handle it at all. Some work transparently – they pass the encoded bytes right along to the underlying filesystem calls, which works great. But others convert the encoded filenames to their own encoding (usually MacRoman) which completely destroys UTF-8 characters. osascript is one of these.

If you write an Applescript with an “on run” handler, and call it with osascript, passing a UTF-8 encoded filename, your “on run” handler’s argument list will look nothing like what you passed in. But there is a trick for getting around this limitation. It appears that osascript does not translate data passed in via pipe. We can use this knowledge to trick osascript into reading its argument list in a different way instead of “on run”.

To do this requires making a shell script with two forks. The data fork is a regular shell script whose job is to package the argument list into a string that can be piped directly to osascript. The resource fork is the Applescript itself, compiled to read and unpackage those arguments from the other side of the pipe.

First, the script template, which is always the same:

#!/bin/sh

case $# in
0)
    echo "Usage: ${0##*/} file [ file... ]" >&2
    exit 1 ;;
esac

{   arg=$1
    echo -nE "$arg"
    shift

    for arg in "$@"; do
        echo -ne '\x00'; echo -nE "$arg"
    done
} | /usr/bin/osascript -- "$0"

Next, the Applescript template. After this header, refer to your argument list using the argv list:

set argv to do shell script "/bin/cat"

set AppleScript's text item delimiters to ASCII character 0
set argv to argv's text items
set AppleScript's text item delimiters to {""}

-- The rest of your script follows here...

To bind these pieces together, we’ll assume you’ve called the shell script template.sh, and your Applescript myscript.script. First you need to compile the Applescript:

osacompile -o myscript.scpt -- myscript.script

Then bind the compiled Applescript to the resource fork of the final script:

ditto -rsrc myscript.scpt myscript

Next, copy the shell script template to the data fork of the final script:

cat -- template.sh > myscript

And finally, mark the script executable and delete the byproducts:

chmod 755 myscript
rm myscript.scpt

Now you can run myscript and pass it a UTF-8 encoded filename, and the Applescript will see it as a properly encoded string of type “Unicode text”.