Connect with us

Tech

Breaking our Latin-1 assumptions (2017)

Published

on

So in my previous post I explored a specific (wrong) assumption that programmers
tend to make about the nature of code points and text.

I was asked multiple times about other assumptions we tend to make. There are a lot. Most
Latin-based scripts are simple, but most programmers spend their time dealing with Latin
text so these complexities never come up.

I thought it would be useful to share my personal list of
scripts that break our Latin-1 assumptions. This is a list I mentally check against
whenever I am attempting to reason about text. I check if I’m making any assumptions that
break in these scripts. Most of these concepts are independent of Unicode; so any program
would have to deal with this regardless of encoding.

I again recommend going through eevee’s post, since it covers many related issues.
Awesome-Unicode also has a lot of random tidbits about Unicode.

Anyway, here’s the list. Note that a lot of the concepts here exist in scripts other than the
ones listed, these are just the scripts I use for comparing.

Arabic / Hebrew

Both Arabic and Hebrew are RTL scripts; they read right-to-left. This may even affect how
a page is laid out, see the Hebrew Wikipedia.

They both have a concept of letters changing how they look depending on where they are in the word.
Hebrew has the “sofit” letters, which use separate code points. For example, Kaf (כ) should be typed
as ך at the end of a word. Greek has something similar with the sigma.

In Arabic, the letters can have up to four different forms, depending on whether they start a word,
end a word, are inside a word, or are used by themselves. These forms can look very different. They
don’t use separate code points for this; however. You can see a list of these forms here

Arabic can get pretty tricky – the characters have to join up; and in cursive fonts (like those for Nastaliq),
you get a lot of complex ligatures.

As I mentioned in the last post, U+FDFD (﷽), a ligature representing the Basamala,
is also a character that breaks a lot of assumptions.

Indic scripts

Indic scripts are abugidas, where you have consonants with vowel modifiers. For example, क is
“kə”, where the upside down “e” is a schwa, something like an “uh” vowel sound. You can change the
vowel by adding a diacritic (e.g ); getting things like का (“kaa”) को (“koh”) कू (“koo”).

You can also mash together consonants to create consonant clusters. The “virama” is a vowel-killer
symbol that removes the inherent schwa vowel. So, + becomes क्. This sound itself is
unpronounceable since क is a stop consonant (vowel-killed consonants can be pronounced for nasal and some other
consonants though), but you can combine it with another consonant, as क् + (“rə”), to get क्र
(“krə”). Consonants can be strung up infinitely, and you can stick one or more vowel diacritics
after that. Usually, you won’t see more than two consonants in a cluster, but larger ones are not
uncommon in Sanskrit (or when writing down some onomatopoeia). They may not get rendered as single
glyphs, depending on the font.

One thing that crops up is that there’s no unambiguous concept of a letter here. There
is a concept of an “akshara”, which basically includes the vowel diacritics, and
depending on who you talk to may also include consonant clusters. Often things are
clusters an akshara depending on whether they’re drawn with an explicit virama
or form a single glyph.

In general the nature of the virama as a two-way combining character in Unicode is pretty new.

Hangul

Korean does its own fun thing when it comes to conjoining characters. Hangul has a concept
of a “syllable block”, which is basically a letter. It’s made up of a leading consonant,
medial vowel, and an optional tail consonant. 각 is an example of
such a syllable block, and it can be typed as ᄀ + ᅡ + ᆨ. It can
also be typed as 각, which is a “precomposed form” (and a single code point).

These characters are examples of combining characters with very specific combining rules. Unlike
accents or other diacritics, these combining characters will combine with the surrounding characters
only when the surrounding characters form an L-V-T or L-V syllable block.

As I mentioned in my previous post, apparently syllable blocks with more (adjacent) Ls, Vs, and Ts are
also valid and used in Old Korean, so the grapheme segmentation algorithm in Unicode considers
“ᄀᄀᄀ각ᆨᆨ” to be a single grapheme (it explicitly mentions this).
I’m not aware of any fonts which render these as a single syllable block, or if that’s even
a valid thing to do.

Han scripts

So Chinese (Hanzi), Japanese (Kanji1), Korean (Hanja2), and Vietnamese (Hán tự, along with Chữ
Nôm 3) all share glyphs, collectively called “Han characters” (or CJK characters4). These
languages at some point in their history borrowed the Chinese writing system, and made their own
changes to it to tailor to their needs.

Now, the Han characters are ideographs. This is not a phonetic script; individual characters
represent words. The word/idea they represent is not always consistent across languages. The
pronounciation is usually different too. Sometimes, the glyph is drawn slightly differently based on
the language used. There are around 80,000 Han ideographs in Unicode right now.

The concept of ideographs itself breaks some of our Latin-1 assumptions. For example, how
do you define Levenshtein edit distance for text using Han ideographs? The straight answer is that
you can’t, though if you step back and decide why you need edit distance you might be able
to find a workaround. For example, if you need it to detect typos, the user’s input method
may help. If it’s based on pinyin or bopomofo, you might be able to reverse-convert to the
phonetic script, apply edit distance in that space, and convert back. Or not. I only maintain
an idle curiosity in these scripts and don’t actually use them, so I’m not sure how well this would
work.

The concept of halfwidth character is a quirk that breaks some assumptions.

In the space of Unicode in particular, all of these scripts are represented by a single set of
ideographs. This is known as “Han unification”. This is a pretty controversial issue, but the
end result is that rendering may sometimes be dependent on the language of the text, which
e.g. in HTML you set with a . The wiki page has some examples of
encoding-dependent characters.

Unicode also has a concept of variation selector, which is a code point that can be used to
select between variations for a code point that has multiple ways of being drawn. These
do get used in Han scripts.

While this doesn’t affect rendering, Unicode, as a system for describing text,
also has a concept of interlinear annotation characters. These are used to represent
furigana / ruby. Fonts don’t render this, but it’s useful if you want to represent
text that uses ruby. Similarly, there are ideographic description sequences which
can be used to “build up” glyphs from smaller ones when the glyph can’t be encoded in
Unicode. These, too, are not to be rendered, but can be used when you want to describe
the existence of a character like biáng. These are not things a programmer
needs to worry about; I just find them interesting and couldn’t resist mentioning them 🙂

Japanese speakers haven’t completely moved to Unicode; there are a lot of things out there
using Shift-JIS, and IIRC there are valid reasons for that (perhaps Han unification?). This
is another thing you may have to consider.

Finally, these scripts are often written vertically, top-down. Mongolian, while
not being a Han script, is written vertically sideways, which is pretty unique. The
CSS writing modes spec introduces various concepts related to this, though that’s mostly in the
context of the Web.

Thai / Khmer / Burmese / Lao

These scripts don’t use spaces to split words. Instead, they have rules for what kinds of sequences
of characters start and end a word. This can be determined programmatically, however IIRC the
Unicode spec does not attempt to deal with this. There are libraries you can use here instead.

Latin scripts themselves!

Turkish is a latin-based script. But it has a quirk: The uppercase of “i” is
a dotted “İ”, and the lowercase of “I” is “ı”. If doing case-based operations, try to use
a Unicode-aware library, and try to provide the locale if possible.

Also, not all code points have a single-codepoint uppercase version. The eszett (ß) capitalizes
to “SS”. There’s also the “capital” eszett ẞ, but its usage seems to vary and I’m not exactly
sure how it interacts here.

While Latin-1 uses precomposed characters, Unicode also introduces ways to specify the same
characters via combining diacritics. Treating these the same involves using the normalization
algorithms (NFC/NFD).

Emoji

Well, not a script5. But emoji is weird enough that it breaks many of our assumptions. The
scripts above cover most of these, but it’s sometimes easier to think of them
in the context of emoji.

The main thing with emoji is that you can use a zero-width-joiner character to glue emoji together.

For example, the family emoji 👩‍👩‍👧‍👦 (may not render for you) is made by using the woman/man/girl/boy
emoji and gluing them together with ZWJs. You can see its decomposition in uniview.

There are more sequences like this, which you can see in the emoji-zwj-sequences file. For
example, MAN + ZWJ + COOK will give a male cook emoji (font support is sketchy).
Similarly, SWIMMER + ZWJ + FEMALE SIGN is a female swimmer. You have both sequences of
the form “gendered person + zwj + thing”, and “emoji containing human + zwj + gender”,
IIRC due to legacy issues6

There are also modifier characters that let you change the skin tone of an emoji that
contains a human (or human body part, like the hand-gesture emojis) in it.

Finally, the flag emoji are pretty special snowflakes. For example, 🇪🇸 is the Spanish
flag. It’s made up of two regional indicator characters for “E” and “S”.

Unicode didn’t want to deal with adding new flags each time a new country or territory pops up. Nor
did they want to get into the tricky business of determining what a country is, for example
when dealing with disputed territories. So instead, they just defined these regional indicator
symbols. Fonts are supposed to take pairs of RI symbols7 and map the country code to a flag.
This mapping is up to them, so it’s totally valid for a font to render a regional indicator
pair “E” + “S” as something other than the flag of Spain. On some Chinese systems, for example,
the flag for Taiwan (🇹🇼) may not render.


I hightly recommend comparing against this relatively small list of scripts the next time you
are writing code that does heavy manipulation of user-provided strings.

Read More

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published.

Tech

USB logos finally make sense, thanks to a redesign

Published

on

By

USB logos finally make sense, thanks to a redesign


Author: Mark Hachman
, Senior Editor

As PCWorld’s senior editor, Mark focuses on Microsoft news and chip technology, among other beats. He has formerly written for PCMag, BYTE, Slashdot, eWEEK, and ReadWrite.

Read More

Continue Reading

Tech

Cheaper OLED monitors might be coming soon

Published

on

By

Cheaper OLED monitors might be coming soon


Author: Michael Crider
, Staff Writer

Michael is a former graphic designer who’s been building and tweaking desktop computers for longer than he cares to admit. His interests include folk music, football, science fiction, and salsa verde, in no particular order.

Read More

Continue Reading

Tech

New Pixel Watch leak reveals watch faces, strap styles and more

Published

on

By

New Pixel Watch leak reveals watch faces, strap styles and more
Google Pixel watch



The Google Pixel Watch is incoming
(Image credit: Google)

We’re expecting the Google Pixel Watch to make its full debut on Thursday, October 6 – alongside the Pixel 7 and the Pixel 7 Pro – but in the meantime a major leak has revealed much more about the upcoming smartwatch.

Seasoned tipster @OnLeaks (opens in new tab) has posted the haul, which shows off some of the color options and band styles that we can look forward to next week. We also get a few shots of the watch interface and a picture of it being synced with a smartphone.

Watch faces are included in the leak too, covering a variety of different approaches to displaying the time – both in analog and digital formats. Another image shows the watch being used to take an ECG reading to assess heartbeat rate.

Just got my hands on a bunch of #Google #PixelWatch promo material showing all color options and Watch Bands for the first time. Some details revealed as well…@Slashleaks 👉🏻 https://t.co/HzbWeGGSKP pic.twitter.com/N0uiKaKXo0October 1, 2022

See more

Full colors

If the leak is accurate, then we’ve got four silicone straps on the way: black, gray, white, and what seems to be a very pale green. Leather straps look to cover black, orange, green and white, while there’s also a fabric option in red, black and green.

We already know that the Pixel Watch is going to work in tandem with the Fitbit app for logging all your vital statistics, and included in the leaked pictures is an image of the Pixel Watch alongside the Fitbit app running on an Android phone.

There’s plenty of material to look through here if you can’t wait until the big day – and we will of course be bringing you all the news and announcements as the Google event unfolds. It gets underway at 7am PT / 10am ET / 3pm BST / 12am AEDT (October 7).


Analysis: a big moment for Google

It’s been a fair while since Google launched itself into a new hardware category, and you could argue that there’s more riding on the Pixel Watch than there is on the Pixel 7 and Pixel 7 Pro – as Google has been making phones for years at this point.

While Wear OS has been around for a considerable amount of time, Google has been leaving it to third-party manufacturers and partners to make the actual hardware. Samsung recently made the switch back to Wear OS for the Galaxy Watch 5 and the Galaxy Watch 5 Pro, for example.

Deciding to go through with its own smartwatch is therefore a big step, and it’s clear that Google is envious of the success of the Apple Watch. It’s the obvious choice for a wearable for anyone who owns an iPhone, and Google will be hoping that Pixel phones and Pixel Watches will have a similar sort of relationship.

What’s intriguing is how Fitbit fits in – the company is now run by Google, but so far we haven’t seen many signs of the Fitbit and the Pixel lines merging, even if the Pixel Watch is going to come with support for the Fitbit app.

Dave is a freelance tech journalist who has been writing about gadgets, apps and the web for more than two decades. Based out of Stockport, England, on TechRadar you’ll find him covering news, features and reviews, particularly for phones, tablets and wearables. Working to ensure our breaking news coverage is the best in the business over weekends, David also has bylines at Gizmodo, T3, PopSci and a few other places besides, as well as being many years editing the likes of PC Explorer and The Hardware Handbook.

Read More

Continue Reading

Trending

Copyright © 2022 Xanatan