UTF-8 Croatian code page

UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode standard, and the name is derived from Unicode Transformation Format - 8-bit. Almost every web-page is stored in UTF-8 code. UTF-8 is capable of encoding all 1'112'064 valid Unicode scalar values using a variable-width encoding of one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII code. The first 128 code points (ASCII) need 1 byte. The next 1'920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, three bytes are needed for the remaining 61'440 code points of Asian and Cyrillic characters. Four bytes are needed for the 1'048'576 code points which include pictographic symbols. In the further description it will be taken only the latin alphabet will be taken into account, and an example for Croatia and the United States is also taken. Thus, the special characters that define the second byte according to the previous description of the UTF-8 code. The UTF-8 code is used by most web-sites, and depending on the character that needs to be presented, the number of bytes that will be used automatically changes.

Diacritical marks are signs of various shapes - dots, dashes, ticks, circles, etc. They are added to a letter (from any side) for the purpose of giving a special sound marking to a letter or word. Such letters are called diacritics or diacritical letters. In the Croatian language, the diacritical marks in the alphabet - are č, ć, đ, š and ž, and the diacritics are č, ć, đ, dž, š and ž (composed of diacritic marks).

If uppercase and lowercase letters are taken into account, and the double character ' dž ' is except, then this set of graphemes is as follows:

Č, Ć, Đ, Š, Ž - č, ć, đ, š, ž

When looking at the text for Croatian and English content on this page, Croatian diacritics are seen equally on both pages, even though one is defined for Croatian and the other for English. So, it is a case of second byte analysis. Both pages were created with the 'Dreamweaver' software.

If you look at the record of these HTML files in a HEX-Editor, you get the results as in the following Figures.

Figure 1. Codes for Croatian graphemes according to the code page 'windows-1250' and 'utf-8'

According to the above, the following applies:

    
           windows-1250                 utf-8
        ==================       ==================
           Č = C8 = 200            Č = C48C = 268
           Ć = C6 = 198            Ć = C486 = 266
           Đ = D0 = 208            Đ = C490 = 272
           Š = 8A = 138            Š = C5A0 = 352
           Ž = 8E = 142            Ž = C5BD = 381
           č = E8 = 232            č = C48D = 269
           ć = E6 = 230            ć = C487 = 267
           đ = F0 = 240            đ = C491 = 273
           š = 9A = 154            š = C5A1 = 353
           ž = 9E = 158            ž = C5BE = 382

The first code group corresponds to the inclosure e.), as 'Latin 2'. The second code group corresponds to the inclosure g.), as 'Latin Extended-A'. You can find the complete set of characters for the UTF-8 code at https://www.fileformat.info/info/charset/UTF-8/list.htm.

The Windows OS natively uses the 'windows-1250' (essentially Latin 2) codepage, and the web-page editing software (eg 'Dreamweaver') needs to know how to change to the 'UTF-8' codepage. So, the change should be made using the menu for changing page parameters, and the grapheme codes will be changed automatically. If a page editor is used that does not know how to make the specified change, the result will be invalid: �, �, �, �, � - �, �, �, �, �.

The author's computer has the 'Windows 10 Enterprise 2016 LTSB - x86' operating system installed. If these pages are read in the 'Edge' web browser, and the paragraph with diacritic characters is copied to 'Notepad' and saved as a .txt file, the mentioned characters are displayed correctly. However, if after the above action, the content is copied from the text file to a newly opened file in 'Dreamweaver' and saved as a .html file, the mentioned diacritic characters are still displayed correctly. For a 'casual' user, no change is visible. However, if both saved files are read with a hexadecimal editor, e.g. 'HxD Hex Editor', it can be seen that the .txt file is encoded in the 'Latin 2' code page, and the .html file in to the code page 'UTF-8', which means that the OS and 'Dreamweaver' software ensures that this code conversion takes place properly. Of course, 'Dreamweaver' must be set with unicode default page. So, there is no need to worry about this, which was not the case two decades ago, more precisely until 1991, when the 'Unicode Standard' was established by the association 'The Unicode Consortium', which takes care of updating the code on a daily basis. Almost all IT industries respect this standard. Without the described, the author of these lines would not have been able to write this story.