weixin_39818691
weixin_39818691
2020-11-23 05:22

ICU-9562 Update language tag mapping per the latest IANA registry

uloc_forLanguageTag has a few mapping tables to map grandfathered language tags and deprecated language subtags to their preferred or modern values.

Update them based on the latest version of the IANA language subtag registry. [1]

Five grandfathered tags without a preferred value are still mapped to what ICU has mapped them to for the backward compatibility until the wisdom of continuing to do so is reviewed.

In addition, map redundant language tags to their preferred values regardless of whether they're followed by other subtags or not. (e.g. zh-yue vs zh-yue-u-co-pinyin) .

Similary, ja-latn-hepburn-heploc is mapped to ja-latn-alaic97 (the variant subtag 'hepburn-helploc' with the prefix 'ja-latn' has the preferred value, 'alaic97') .

Update the mapping for deprecated language subtags (e.g. 'jw' to 'jv' and a bunch of 3-letter language codes).

Add a new table for deprecated region subtags to map them to their modern values. (e.g. 'DD' to 'DE').

Add a new test case for deprecated language and region mapping and a few more cases for updated grandfathered and redundant tag mapping.

[1] https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Checklist
  • [x] Issue filed at https://unicode-org.atlassian.net : ICU-______
  • [x] Update PR title to include Issue number
  • [x] Issue accepted
  • [x] Tests included
  • [ ] Documentation is changed or added

该提问来源于开源项目:unicode-org/icu

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

18条回答

  • weixin_39818691 weixin_39818691 5月前

    it turned out that git misapplied a part of my change at a wrong place during rebase. It's fixed and all the tests should pass now.

    点赞 评论 复制链接分享
  • weixin_39756273 weixin_39756273 5月前

    This test failure needs some further investigation:

    FAIL: en-GB-oed; got "en_GB_OXENDICT"; expected "en_GB=oed"

    The strings on the right hand side, "en_GB_OXENDICT" and "en_GB=oed" are the ICU internal representation (our legacy locale ID strings) and the updated mapping of the BCP-47 grandfathered language tags has caused this to change.

    Is that right? Should this change? Is "en_GB_OXENDICT" really a valid locale ID? Might there be code out there expecting "en_GB=oed" that now will fail?

    If this all is as it should be, then the expected string would just need to be updated in TestForLanguageTag() in icu4c/source/test/intltest/loctest.cpp.

    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    The test was written with the old data in mind. I just updated the expectation. The expected value should be en_GB_OXENDICT (en-GB-oxendict is the preferred value for en-GB-oed per the latest IANA language subtag registry. ). Before this PR, en-GB-oed was converted to en-GB-x-oed leading to en_GB=oed after uloc_forLanguageTag.

    Remember comment on your PR adding that test ? :-)

    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    "en_GB_OXENDICT"

    I believe that that's the way BCP47's variant subtag is handled by uloc_forLanguageTag.

    点赞 评论 复制链接分享
  • weixin_39756273 weixin_39756273 5月前

    This question remains unanswered: Is the legacy ICU locale ID "en_GB_OXENDICT" both valid and the correct one to use for Oxford English?

    点赞 评论 复制链接分享
  • weixin_39963255 weixin_39963255 5月前

    Since 2015, en-GB-oxendict is the proper replacement for en-GB-oed.

    See https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

    
    Type: grandfathered
    Tag: en-GB-oed
    Description: English, Oxford English Dictionary spelling
    Added: 2003-07-09
    Deprecated: 2015-04-17
    Preferred-Value: en-GB-oxendict
    
    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    This question remains unanswered: Is the legacy ICU locale ID "en_GB_OXENDICT" both valid and the correct one to use for Oxford English?

    The answer is positive. See how 'de-1901', 'de-Latn-1901' or 'de-DE-1901' is handled by uloc_forLanguageTag. '1901' is a variant subtag. So is 'oxendict' in en-GB-oxendict.

    can chime in. :-)

    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry has this :

    
    %%
    Type: variant
    Subtag: oxendict
    Description: Oxford English Dictionary spelling
    Added: 2015-04-17
    Prefix: en
    %%
    
    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    Another exhibit:

    uloc_toLanguageTag( uloc_forLanguageTag ("en-GB-oed"))

    gives back 'en-GB-oxendict'.

    点赞 评论 复制链接分享
  • weixin_39756273 weixin_39756273 5月前

    I took a look at the CLDR data and the ICU data built from it, and what I found was that OXENDICT is registered there as a variant which means that "en_GB_OXENDICT" is the correct locale ID to use.

    A quick test with icu::Locale::getDisplayName() shows that this indeed is the locale ID that ICU itself understands to be Oxford English:

    
    Locale ID    : "en_GB_OXENDICT"
    Display Name : "English (United Kingdom, Oxford English Dictionary spelling)"
    
    Locale ID    : "en_GB=oed"
    Display Name : "English (United Kingdom, Private-Use=oed)"
    

    This means that ICU has been doing this wrong, converting the BCP-47 language tag "en-GB-oed" into an internal representation that it doesn't understand itself, for the past 8 years. I had not expected that.

    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    PTAL. Thanks

    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    , do you have any concern? If so, let me know. Otherwise, could you merge?
    approved it :-) so that he can merge it if we don't hear from you. Thanks !

    I'll follow up this PR with a PR for ICU4J.

    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    BTW, this PR does not deal with handling of two variant subtags for 'hy' that were deprecated this year with preferred values. I've just discovered it while looking at one of tests in Ecma 262 test suite v8 fails. It's a bit tricky to handle correctly ("arevela" has to be dropped when lang subtag is hy. "arevmda" has to be dropped and 'hy' has to be replaced by 'hyw' when the lang subtag is hy) , but there might be a way (the way preeuro is handled may be utilized?) .

    
    %%
    Type: variant
    Subtag: arevela
    Description: Eastern Armenian
    Added: 2006-09-18
    Deprecated: 2018-03-24
    Preferred-Value: hy
    Prefix: hy
    %%
    Type: variant
    Subtag: arevmda
    Description: Western Armenian
    Added: 2006-09-18
    Deprecated: 2018-03-24
    Preferred-Value: hyw
    Prefix: hy
    %%
    
    点赞 评论 复制链接分享
  • weixin_39517054 weixin_39517054 5月前

    , do you have any concern? If so, let me know. Otherwise, could you merge?

    No. Looks fine to me. As long as you have a plan for ICU4J porting soon (after ICU 63), please go ahead and merge this changes to ICU4C (for ICU 63). Thanks.

    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    Thanks. I don't have a privilege to press merge button. Could you (or or ) press the button? Thanks :-)

    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    , can you take a look? Thanks !

    : FYI

    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    I'll make a separate PR for Java.

    点赞 评论 复制链接分享
  • weixin_39818691 weixin_39818691 5月前

    Hold on. Rebasing made a new test fail. Let me fix that.

    点赞 评论 复制链接分享

为你推荐