JIS-Unicode conversion table for pTeX/upTeX

Created at 24 Feb 2013

Abstract

In the several major platforms, many kind of conversion tables between "Unicode" and "JIS X 0208 (JIS)" are implemented. It sometimes causes unstable round trip conversion when data are transferred between different platforms. In order to avoid the instability, a method of many-to-one conversion is widely used, where "many", "one" is plural code points in "Unicode", a single code point in "JIS", respectively.
In this article, a practical method of conversion for pTeX/upTeX in the ptexenc library is proposed.

Strategy

To attempt stable conversion in multi-platform environment, the following strategy is selected. The conversion from Unicode to JIS is mainly used in pTeX (internal encoding is Shift_JIS or EUC-JP with the JIS X 0208 character set). That from JIS to Unicode is used in upTeX (internal encoding is Unicode).

The conversion table

The following is the conversion table proposed here. "from Unicode", "to Unicode" respectively denotes conversion from Unicode to JIS, conversion from JIS to Unicode.
JIS X 0208UnicodeImplementationptexenc
hex ku-tenCharacter NameCMap H,VUCS2Character NameCMap UniJIS-UTF16-H,VJISWindowsMACJavato Unicodefrom Unicode
0x2126 1-6KATAKANA MIDDLE DOT638 U+2022 •BULLET119-----△ ‡2
U+2219 ∙BULLET OPERATOR117-----△ ‡2
U+22C5 ⋅DOT OPERATOR------△ ‡2
U+30FB ・KATAKANA MIDDLE DOT638 †1
0x2131 1-17OVERLINE649 U+203E ‾OVERLINE226--✓(98)--
U+FFE3  ̄FULLWIDTH MACRON649✓(99) (†1)
0x213D 1-29EM DASH661 U+2012 ‒FIGURE DASH114-----△ ‡2
U+2013 –EN DASH114-----△ ‡2
U+2014 —EM DASH138-✓(140)-
U+2015 ―HORIZONTAL BAR661--✓(131) †2
0x2141 1-33WAVE DASH665 U+223C ∼TILDE OPERATOR100-----△ ‡2
U+223E ∾INVERTED LASY S------△ ‡2
U+301C 〜WAVE DASH665- †3
U+FF5E ~FULLWIDTH TILDE665----
0x2142 1-34DOUBLE VERTICAL LINE666 U+2016 ‖DOUBLE VERTICAL LINE666- †4
U+2225 ∥PARALLEL TO15489 ‡1----
0x2143 1-35VERTICAL LINE667 U+2223 ∣DIVIDES------△ ‡2
U+FF5C |FULLWIDTH VERTICAL LINE667 †1
0x2144 1-36HORIZONTAL ELLIPSIS668 U+2026 …HORIZONTAL ELLIPSIS668✓(99) (†1)
U+22EF ⋯MIDLINE HORIZONTAL ELLIPSIS---✓(98)--
0x215D 1-61MINUS SIGN693 U+2212 −MINUS SIGN693- †3
U+FF0D -FULLWIDTH HYPHEN-MINUS693----
0x216F 1-79YEN SIGN711 U+00A5 ¥YEN SIGN61---
U+FFE5 ¥FULLWIDTH YEN SIGN711(✓)- †5
0x2171 1-81CENT SIGN713 U+00A2 ¢CENT SIGN102--
U+FFE0 ¢FULLWIDTH CENT SIGN713--- †5
0x2172 1-82POUND SIGN714 U+00A3 £POUND SIGN103--
U+FFE1 £FULLWIDTH POUND SIGN714--- †5
0x224C 2-44NOT SIGN751 U+00AC ¬NOT SIGN153--
U+FFE2 ¬FULLWIDTH NOT SIGN751--- †5
0x227E 2-94LARGE CIRCLE779 U+20DD   ⃝COMBINING ENCLOSING CIRCLE16328 ‡1-----△ ‡2
U+25EF ◯LARGE CIRCLE779 †1
"✓" denotes "Yes". "-" denotes "No". "△" denotes "Negligible.".
JIS: Conversion table defined by JIS X 0208. It is not 規定 (standard) but 参考 (references).
Windows: Conversion table of Windows CP932.
MAC: Conversion table of Macintosh OS. (98) and (99) denotes MAC OS before August 1998 and after Septempber 1999, respectively.
Java: Conversion table of Java JRE. (131) and (140) denotes Java before JRE1.3.1 and after JRE1.4.0, respectively.
†1: The other candidates are rare and negligible.
†2: Windows conversion is illegal under JIS definition of character name. However, if we obey JIS conversion, CMap H,V and CMap UniJIS-UTF16-H,V are not consistent.
†3: Both candidates have same CID in CMap UniJIS-UTF16-H,V. Windows conversion is illegal under JIS definition of character name.
†4: If we obey Windows conversion, CMap H,V and CMap UniJIS-UTF16-H,V are not consistent. Moreover, Windows conversion is illegal under JIS definition of character name.
†5: If we obey JIS conversion, CMap H,V and CMap UniJIS-UTF16-H,V are not consistent. In addition, Windows conversion and JIS conversion differs only character width whether FULLWIDTH or not.
‡1: Adobe-Japan1-5.
‡2: The candidate is rare and negligible.

References

JIS-Unicode間の変換表の選択について by 市岡 耕平さん (in Japanese)
CMap Resources at sourceforge
増補改訂 JIS漢字辞典 by 日本規格協会 ISBN4-542-20129-5 (in Japanese)
従来の文字コードとUnicodeの対応に関する諸問題 by ITO Takayuki-san (in Japanese)

$Lastupdate: Sun Feb 24 01:28:29 2013 $
Tanaka