Converting JSON string collections to binary assets for non-UTF-8 compliant devices

#JSON #UTF8 #Localization #Translation

Published on 14 April 2022 by Andrew Owen (4 minutes)

Last time I wrote about localization with Weblate. This week, I’ll show how the SE Basic IV project takes JSON output from Weblate and converts it into binaries that can be used with 8-bit code pages by the interpreter. This is a very niche use case, but hopefully there are some general lessons that you can apply to your own projects.

The project uses a Git repository hosted on GitHub. At the top level is a locales folder containing the English JSON template used to produce the other language files. These files are mostly identified by a two-letter ISO language code, except where disambiguation is required (such as Latin American Spanish).

Here’s a snippet of the template:

    {
        "NAME": "English",
        "FILENAME": "EN.LN",
        "ICONV": "IBM437",
        "SCROLL": "Scroll?____",
        "ERROR": " in ",
        "READY": "Ready_______",
        "SYNTAX": "Syntax error",
    ...
        "PATH": "Path not found"
    }

The NAME field is used by Weblate and the build script (for reporting purposes).

The FILENAME field sets the output filename.

The ICONV field sets the code page to be used when converting from UTF-8.

The first three translatable terms contain padding in the form of underscores. This is required because the interpreter expects these terms to be stored at fixed addresses. The build script will convert the underscores to null (code point 0). The remaining terms are null-terminated.

The entries in the JSON file can be in any order. The build script will output them in the required order.

When any changes are made in the locales folder of the main branch, a GitHub Action is triggered that launches a GitHub hosted runner to run the build script. The action is defined in a YAML file:

name: build localized message bundles
on:
  push:
    paths:
    - 'locales/**'
  workflow_dispatch:
    jobs:
      locales:
        runs-on: ubuntu-latest
        permissions:
          contents: write
        steps:
          - uses: actions/checkout@v4
          - name:
            run: |
              ./scripts/locales.sh
              git config user.name github-actions
              git config user.email github-actions@github.com
              git add .
              git commit -m "locales"
              git push origin main

The name parameter is used for reporting purposes.

The on parameter sets when the script runs. In this case, when there is a push of any file type (**) in the locales folder. If you don’t specify a branch, the script defaults to the main branch. The workflow_dispatch parameter enables you to manually run the script from the GitHub Actions web interface.

The jobs parameter defines what the action does. Here locales is a job name.

The runs-on parameter sets the OS used by the hosted runner, in this case Ubuntu latest (x64). You can also specify other OSes, but if you have to pay for your runners (for instance because you’re developing commercial software), Linux is the cheapest option.

The steps parameter defines the job. The uses parameter enables you to use predefined actions such as checkout@v2. This puts a copy of the repository on the hosted runner.

The run: | parameter enables you to specify a list of command line actions. This action performs these tasks:

Run the locales.sh shell script.
Set the Git user.name and user.email. These will show that changes were made by an action.
Add all changes.
Commit the changes.
Push the changes.

Here’s a snippet of the locales.sh script:

    cd locales
    for f in *.json; do  
        export jname=${f%}
        name=$(jq -r .NAME $jname)
        echo Generating $name
        fname=$(jq -r .FILENAME $jname)
        iconv=$(jq -r .ICONV $jname)
        scroll=$(jq -r .SCROLL $jname)
        error=$(jq -r .ERROR $jname)
        ready=$(jq -r .READY $jname)
        synatx=$(jq -r .SYNTAX $jname)
    ...
        path=$(jq -r .PATH $jname)
        echo $scroll"_"$error"_"$ready"_"$ok"_"$break"_"$for"_"$synatx"_"$gosub"_"$data"_"$call"_"$overflow"_"$memory"_"$line"_"$subscript"_"$variable"_"$address"_"$statement"_"$type"_"$screen"_"$device"_"$stream"_"$channel"_"$function"_"$buffer"_"$next"_"$wend"_"$while"_"$file"_"$input"_"$path"____________________________________________________________________________________________________________________________________________________________________________________________________________________" > TEMP.LN
        iconv -f UTF8 -t $iconv TEMP.LN > $fname
        head -c 608 $fname > TEMP.LN
        mv TEMP.LN $fname
        perl -pi -e 's/_/\0/g' $fname
        mv $fname ../ChloeVM.app/Contents/Resources/chloehd/SYSTEM/LANGUAGE.S/$fname
    done

The script uses a FOR loop to iterate through every JSON file in the locales folder. Most of the actual work is done by two tools:

jq - a command line JSON processor
iconv - a command line character encoding tool.

For each parameter in the JSON file, the script uses jq to create an equivalent variable. The process for each file goes like this:

When all the variables are assigned, a temporary file called TEMP.LN is created with some additional padding.
Then iconv is used to convert the temporary file to the correct filename encoded for the appropriate code page.
The head command is used to trim the binary to a fixed length.
The temporary file is removed.
Perl is used to replace underscores (_) with the null character (code point 0).
The binary is moved into the correct place in the default file system.

Image: Original by Pineapple Supply Co..