Difference between revisions of "Charset policy"

From Rosalab Wiki
Jump to: navigation, search
m (1 revision)
 
Line 1: Line 1:
 +
[[ru:Charset_policy]]
 
[[Category:Translation and localization]]
 
[[Category:Translation and localization]]
 
[[Category:Packaging Policies]]
 
[[Category:Packaging Policies]]

Latest revision as of 10:53, 30 May 2012

This page describes why the text encoding in spec files should be in UTF-8 or ASCII, and what kind of problems would arise otherwise.


In short, all of the spec file should be written in UTF-8 encoding if any non-ASCII has to be used. Places where such characters occur frequently include changelogs, descriptions, summaries and names. Using legacy encoding inside a spec file introduces a number of risks, which may render the spec file broken at worst.


Why?

Though rpm supports translation of description and summary inside the spec file itself, this is proven to be a nightmare, since packagers use their own encoding of choice, leaving a multiple-encoded file as result. Such a spec file can't possibly be handled properly by any editor if more than one translation exists. Thus for description or summary translation, po file is the way to go.

In ROSA, such translations are forbidden by policy. Summary translations can be done in the mdv-rpm-summary package, which also have some disadvantages. There are several proposals on how the translation process can be improved (and uncoupled from the build process), but none of them is implemented yet.


Problematic to read your file

Everybody has their own system, and they may not migrate because the old encoding "just works". However, lots of the encodings in this world are not compatible with each other. Their only common part is ASCII characters (so ASCII is safe for spec files). ISO-8859-1 characters are only shown as junk for other locales.

This, of course, requires both of writer and reader to migrate to UTF-8; but this is already a de-facto standard in the Linux world. Otherwise, people would less likely help because what you code or write is unreadable -- unless your intention is exactly not expecting others to help.


Editors may damage spec file

Related to the first point but worse. For systems with non-ISO-8859-1 legacy charset, text editors may not handle the "junk" text, and may even attempt to "correct" it by modifying characters, rendering the file even more broken.


How to fix?

In spec file itself

It is the spec file that matters, so remember to use UTF-8 throughout the file. To test if your spec file is indeed in UTF-8, use iconv to filter it:

iconv -f UTF-8 -t UTF-8 -o /dev/null yourpackage.spec

If it doesn't complain, the file is in UTF-8 (or ASCII). Otherwise it will tell you the UTF-8 test is broken in which position; you can also remove =-o /dev/null= argument to have a look yourself.


In various config files

Remember to use UTF-8 in ~/.rpmmacros, from where your name is read (in %packager line):

%distribution ROSA Linux
%vendor ROSA
%packager Test Packager <blahblah@rosalinux.org>

If you also use rebuild-rpm and similar building tools, please remember to change your name in the corresponding config files too.


UTF-8 enabled editors

There are many text editors that support UTF-8 natively, be it GUI one or text mode one.


Language environment

Use UTF-8 in your shell environment too whenever possible, that would eliminate lots of headaches. For example, this is locale settings from locale command:

LANG=ru_RU.UTF-8
LC_CTYPE=ru_RU.UTF-8
LC_NUMERIC=ru_RU.UTF-8
LC_TIME=ru_RU.UTF-8
LC_COLLATE=ru_RU.UTF-8
LC_MONETARY=ru_RU.UTF-8
LC_MESSAGES=ru_RU.UTF-8
LC_PAPER=ru_RU.UTF-8
LC_NAME=ru_RU.UTF-8
LC_ADDRESS=ru_RU.UTF-8
LC_TELEPHONE=ru_RU.UTF-8
LC_MEASUREMENT=ru_RU.UTF-8
LC_IDENTIFICATION=ru_RU.UTF-8
LC_ALL=

You can change language settings in the ~/.i18n file located under your home directory.


Note:
This Policy is based on the Mandriva Charset Policy.