You are here: Home > Participate > Join a Discussion > RIPE Forum > Database Working Group > [db-wg] Internationalized domain names in the data abase?
RIPE Forum v1.4.1

Database Working Group

Threaded
Collapse

[db-wg] Internationalized domain names in the data abase?

Ronald F. Guilmette

2019-11-02 21:55:50 CET

Sorry if I am interrupting any ongoing discussion,  but I just
have a quick and simple question...

Is it permitted to have internationalized domain names appear
within the database?

By that I really mean to ask if it is permissible to have there
appear in the data base IDNs which are written in their UTF-8
encoded forms, rather than, say, in punycode?

I have found at least one specific case where an IDN does appear
in the data base as a UTF-8 encoded string, but since I had
never seen that before, I just wanted to know if that was an
anomalous mistake or if it was consider normal, acceptable,
and routine.

User Image

Ed Shryane

2019-11-05 09:58:52 CET

RIPE NCC staff member

Hello Ronald,

> On 2 Nov 2019, at 21:55, Ronald F. Guilmette via db-wg <db-wg _at_ ripe _dot_ net> wrote:
> 
> Sorry if I am interrupting any ongoing discussion,  but I just
> have a quick and simple question...
> 
> Is it permitted to have internationalized domain names appear
> within the database?
> 

Currently the RIPE database supports the Latin-1 (ISO-8859-1) character set only.

There was previous discussion in April - May 2015 to support UTF-8:

https://www.ripe.net/ripe/mail/archives/db-wg/2015-April/004516.html
https://www.ripe.net/ripe/mail/archives/db-wg/2015-May/004542.html

The proposal was to allow UTF-8 in free-text attributes, except for primary keys.

> By that I really mean to ask if it is permissible to have there
> appear in the data base IDNs which are written in their UTF-8
> encoded forms, rather than, say, in punycode?
> 

The RIPE database only contains reverse domain objects (i.e. to register reverse delegations).


> I have found at least one specific case where an IDN does appear
> in the data base as a UTF-8 encoded string, but since I had
> never seen that before, I just wanted to know if that was an
> anomalous mistake or if it was consider normal, acceptable,
> and routine.
> 


Please let me know in which object you found this. The DB team spent some effort recently improving (non-) Latin-1 character handling (in updates and queries), there shouldn't be any non Latin-1 characters remaining.

Regards
Ed Shryane
RIPE NCC



Ronald F. Guilmette

2019-11-06 00:13:21 CET

In message <B335DD85-CED0-41A3-A504-E0A7E6E41D2B _at_ ripe _dot_ net>, 
Edward Shryane <eshryane _at_ ripe _dot_ net> wrote:

>> Is it permitted to have internationalized domain names appear
>> within the database?
>
>Currently the RIPE database supports the Latin-1 (ISO-8859-1) character
>set only.

Yes.  Please forgive me.  I asked the Wrong Question entrely.  See below.

>> I have found at least one specific case where an IDN does appear
>> in the data base as a UTF-8 encoded string, but since I had
>> never seen that before, I just wanted to know if that was an
>> anomalous mistake or if it was consider normal, acceptable,
>> and routine.

Mea culpa!  I misspoke.

What I found was *not* an internationalized domain name, per se.  Well,
maybe it was/is and maybe it wasn't/isn't.  I'll let you all decide,
and then you can tell me if I have used improper terminology to
descrtbe what I found.

The issue came up as I was performing some automated processing relating
to certain abuse contact email addresses relating to certain RIPE ASNs.

More specifically, one of my automated tools got rather badly confused
by the abuse reporting addresses for AS5464 and AS42486, both of which
consist of the email address:

    abuse@zürich.email

The domain name portion of this address may or may not be a proper sort
of internationalized domain name.  I am frankly not sure about that now,
one way or the other.  I just saw a character that was not a traditional
7-bit ASCII character and then I improperly lept to the conclusion that
this must be one of those internationalized domain name that have bedeviled
some of my other home-brew tools in the past.

The problem, of course, is that one lower-case letter "u" with the associated
umlaut above it.  On my system here, the "od -c" command indicates that this
one character is encoded NOT as any kind of UTF-8 sequence, but rather that
it is simply encoded as a single byte with the value 374 (octal).

As I now know, that byte value, when construed in accordance with ISO-8859-1,
does in fact represent a lower-case "u" with an umlaut.  So at least in
this limited sense I now know what the person who put that domain name
into the data base had intended.  However I am not yet persuaded that
simply using ISO-8859-1 encoding was either the best choice or even an
entirely appropriate choice in this instance.  It was certainly convenient
for the writer that a lower-case "u" with an umlaut could be represented
within ISO-8859-1, thus making it unnecessary to resort to UTF-8 in this
particular instance, but it does cause me to wonder a bit about what may
transpire on the day when some RIPE member finds it appropriate and
necessary to add to the data base some contact email address consisting
in part of an IDN, where said IDN is, in its native form, something in
Arabic, Farsi, Hebrew or Chinese.

For my own part, I am merely an out-of-date and ancient relic of a happier
and simpler time, here in the United States, when 7-bit ASCII was sufficient
for anything and everything.  As such, I cannot help but long for a return
to that level of simplicity, parochial as it might be.  But since that is
not going to happen anytime soon, I can only hope that RIPE and other
regions will come to some agreement regarding the proper representation of
IDNs within their respective data bases.  If ISO-8859-1 is the standard
chosen, I wll certainly adjust my tools accordingly.  If however some
other standard is set, then I merely hope that I will be on the circulation
list when that memo is issued.


Regards,
rfg


P.S.  Not that anybody should really care, but for this one lone resarcher
it would be maximally convenient if all domain names represented within the
data base were encoded as punycode, where necessary.  In fact, it is my
belief that 99.99% of them already are, which thus renders the "transition"
to that standard essentially pain free.

User Image

Ed Shryane

2019-11-06 16:04:11 CET

RIPE NCC staff member

Hello Ronald, DB-WG,

> On 6 Nov 2019, at 00:13, Ronald F. Guilmette via db-wg <db-wg _at_ ripe _dot_ net> wrote:
> 
>>> I have found at least one specific case where an IDN does appear
>>> in the data base as a UTF-8 encoded string, but since I had
>>> never seen that before, I just wanted to know if that was an
>>> anomalous mistake or if it was consider normal, acceptable,
>>> and routine.
> 
> Mea culpa!  I misspoke.
> 

Thanks for clarifying!

> What I found was *not* an internationalized domain name, per se.  Well,
> maybe it was/is and maybe it wasn't/isn't.  I'll let you all decide,
> and then you can tell me if I have used improper terminology to
> descrtbe what I found.
> 

The email address you found, is the only IDN (i.e. non-ASCII) email address in the RIPE database (so far).

It's currently considered a valid value in the RIPE database, as it's composed of Latin-1 characters, and the attribute syntax check passes.

There is also an MX record for the domain (although the host dc-eb0309b6496a.xn--zrich-kva.email is currently unreachable for me).

However, it may cause inter-operability issues, as the sending mail server needs to handle IDN addresses correctly.

DB-WG: should we allow non-ASCII addresses in the RIPE database?

> P.S.  Not that anybody should really care, but for this one lone resarcher
> it would be maximally convenient if all domain names represented within the
> data base were encoded as punycode, where necessary.  In fact, it is my
> belief that 99.99% of them already are, which thus renders the "transition"
> to that standard essentially pain free.
> 

DB-WG: is punycode for domain names a viable alternative for encoding non-ASCII email addresses?

For example, the punycode equivalent abuse@xn--zrich-kva.email is already a valid value for the e-mail (or abuse-c) attribute.

Regards
Ed Shryane
RIPE NCC



User Image

Piotr Strzyżewski

2019-11-06 21:17:42 CET

On Wed, Nov 06, 2019 at 04:04:11PM +0100, Edward Shryane via db-wg wrote:

Hi!

> DB-WG: should we allow non-ASCII addresses in the RIPE database?

Do you mean email addresses or street addresses as well?

> > P.S.  Not that anybody should really care, but for this one lone resarcher
> > it would be maximally convenient if all domain names represented within the
> > data base were encoded as punycode, where necessary.  In fact, it is my
> > belief that 99.99% of them already are, which thus renders the "transition"
> > to that standard essentially pain free.
> > 
> 
> DB-WG: is punycode for domain names a viable alternative for encoding non-ASCII email addresses?

Works for me.

> For example, the punycode equivalent abuse@xn--zrich-kva.email is already a valid value for the e-mail (or abuse-c) attribute.

-- 
Piotr Strzyżewski

User Image

Ed Shryane

2019-11-06 21:47:33 CET

RIPE NCC staff member

Hi Piotr, DB-WG,

> On 6 Nov 2019, at 21:17, Piotr Strzyzewski <Piotr.Strzyzewski _at_ polsl _dot_ pl> wrote:
> 
>> DB-WG: should we allow non-ASCII addresses in the RIPE database?
> 
> Do you mean email addresses or street addresses as well?
> 

I mean to continue to allow non-ASCII (i.e. Latin-1 encoded) IDN email addresses, such as the example mentioned. Or, do we automatically encode non-ASCII characters as punycode.

>>> P.S.  Not that anybody should really care, but for this one lone resarcher
>>> it would be maximally convenient if all domain names represented within the
>>> data base were encoded as punycode, where necessary.  In fact, it is my
>>> belief that 99.99% of them already are, which thus renders the "transition"
>>> to that standard essentially pain free.
>>> 
>> 
>> DB-WG: is punycode for domain names a viable alternative for encoding non-ASCII email addresses?
> 
> Works for me.

We can make explicit support for the punycode format, and allows (full) IDN email addresses to be used (as this syntax should be interchangeable with the normal form).

Whois could automatically translate to and from the punycode format, if an IDN format address is encountered.

> 
>> For example, the punycode equivalent abuse@xn--zrich-kva.email is already a valid value for the e-mail (or abuse-c) attribute.
> 
> -- 
> Piotr Strzyżewski

Regards
Ed Shryane
RIPE NCC


User Image

Piotr Strzyżewski

2019-11-06 22:05:54 CET

On Wed, Nov 06, 2019 at 09:47:33PM +0100, Edward Shryane wrote:

Hi Edward, DB-WG,

> > On 6 Nov 2019, at 21:17, Piotr Strzyzewski <Piotr.Strzyzewski _at_ polsl _dot_ pl> wrote:
> > 
> >> DB-WG: should we allow non-ASCII addresses in the RIPE database?
> > 
> > Do you mean email addresses or street addresses as well?
> > 
> 
> I mean to continue to allow non-ASCII (i.e. Latin-1 encoded) IDN email addresses, such as the example mentioned. Or, do we automatically encode non-ASCII characters as punycode.

I do not object having properly coded non-ASCII email addresses in the
database. 

> >>> P.S.  Not that anybody should really care, but for this one lone resarcher
> >>> it would be maximally convenient if all domain names represented within the
> >>> data base were encoded as punycode, where necessary.  In fact, it is my
> >>> belief that 99.99% of them already are, which thus renders the "transition"
> >>> to that standard essentially pain free.
> >>> 
> >> 
> >> DB-WG: is punycode for domain names a viable alternative for encoding non-ASCII email addresses?
> > 
> > Works for me.
> 
> We can make explicit support for the punycode format, and allows (full) IDN email addresses to be used (as this syntax should be interchangeable with the normal form).
> 
> Whois could automatically translate to and from the punycode format, if an IDN format address is encountered.
> 
> > 
> >> For example, the punycode equivalent abuse@xn--zrich-kva.email is already a valid value for the e-mail (or abuse-c) attribute.

And what about RDAP?

Piotr

-- 
Piotr Strzyżewski

Ronald F. Guilmette

2019-11-06 22:22:31 CET

In message <DBB71EC8-7564-4AAB-B490-5A894B39AF72 _at_ ripe _dot_ net>, 
Edward Shryane <eshryane _at_ ripe _dot_ net> wrote:

>> What I found was *not* an internationalized domain name, per se.  Well,
>> maybe it was/is and maybe it wasn't/isn't.  I'll let you all decide,
>> and then you can tell me if I have used improper terminology to
>> descrtbe what I found.
>
>The email address you found, is the only IDN (i.e. non-ASCII) email
>address in the RIPE database (so far).

What I found is definitely *not* "US-ASCII" i.e. 7-but ASCII.  It is
a separate question as to whether or not what I found qualifies,
properly, under the relevant RFCs, as being a proper sort of a
representation of an "IDN".  (I suspect it does not.)

The relevant current RFCs appear to be RFC5890 and possibly RFC5891,
RFC5892, and RFC5894, but I'm sorry to say that each of these is rather
complex, and I do not have time available right now to dredge into
them and learn the real current rules.  All I can say is that a brief
glance at these RFCs seems to indicate that RFC5892 is the most
directly relevant, and that RFC5892 appears to say that Unicode
must be used for representation of IDNs.

The domain name I found *is* ISO-8859-1 (Latin-1) but does not appear
to me to be Unicode.

>It's currently considered a valid value in the RIPE database, as it's
>composed of Latin-1 characters, and the attribute syntax check passes.

Yes.

>There is also an MX record for the domain (although the host
>dc-eb0309b6496a.xn--zrich-kva.email is currently unreachable for me).
>
>However, it may cause inter-operability issues, as the sending mail
>server needs to handle IDN addresses correctly.

Yes.

>DB-WG: should we allow non-ASCII addresses in the RIPE database?

More precisely, the question should be, I think:  (a) Should charcters
that are non-US-ASCII be allowed in the data base generally, and
separately (b) how should IDNs be represented in the data base?

>DB-WG: is punycode for domain names a viable alternative for encoding
>non-ASCII email addresses?

I think that in order to be comprehensive, domain names appearing in the
data base *must* be encoded *either* as punycode *or* else as UTF-8.
I don't believe that ISO-8859-1 (Latin-1) will be able to do the job
entirely, but the other two options will.

>For example, the punycode equivalent abuse@xn--zrich-kva.email is
>already a valid value for the e-mail (or abuse-c) attribute.

Yes, and the same can be said generally. i.e. the (punycoded) domain name
xn--zrich-kva.email is in all respects a substitute for its Unicode
equivalent.  Thus, xn--zrich-kva.email may be used, for example, as
the argument to the "dig" command, and/or in all other contexts where
a fully qualified domain name may be used.

Regards,
rfg

Ronald F. Guilmette

2019-11-06 23:35:58 CET

In message <1AA95AD8-3729-4BBB-A921-1535429A9658 _at_ ripe _dot_ net>, 
Edward Shryane <eshryane _at_ ripe _dot_ net> wrote:

>>> DB-WG: should we allow non-ASCII addresses in the RIPE database?
>>
>> Do you mean email addresses or street addresses as well?
>>
>
>I mean to continue to allow non-ASCII (i.e. Latin-1 encoded) IDN email
>addresses, such as the example mentioned. Or, do we automatically encode
>non-ASCII characters as punycode.

I want to be crystal clear here.  Street addresses, person names, city
names, or any other data value (except for ASNs, IP addresses, ISO 3166
country codes, and domain names) that are encoded in full 8-bit ISO-8859-1
within the data base do not present any terrific problems for me personally
because, generally speaking, I don't anticipate that I will ever be trying
to parse those person names, street names, city names, etc.  I will just
use them "as is" and in whatever encoding they happen to be in when I
receive them.

Quite certainly, within the RIPE region there are billions upon billions
of person names, street names, and city names that cannot be accurately
represented in US-ASCII, nor even, I must note, in ISO-8859-1.  (I am
thinking of your fellow RIPE members in places where cyrillic is used,
and also your fellow RIPE members in Israel and elsewhere.)

In ancient times (e.g. prior to the issuance of, for example, RFC3490
in March, 2003) 7-bit US-ASCII was used fairly exclusively within the
data bases of all of the Regional Internet Registries.  And I, for one,
am greatly appreciative of all of the effort and contortions, over so
many years, that so many people have gone through in order to try, as
best as they could, to anglicize person, street, and city names, especially
those that were not really amenable to that process, and to convert them
all into some 7-bit ASCII approximation of the actual "native" strings.
Even though this conversion process has often rendered thye resulting
anglicized versions substantially inaccurate, it has served to keep
processing code simple, at least up until now.

Now however I see that 8-bit ISO-8859-1 encodings are creaping in, at
least to the RIPE data base.  I am torn by this.  On the one hand this
new development augurs a sea change which will likely end by complicating
a lot of tools, and not only my own.  On the other hand, the benfits are
clear; more accurate representations of person, street, and city names
within the data base... BUT still quite limited to names that can be
accurately represented within ISO-8859-1, a character set which excludes
some very large swaths of RIPE territory.

Even at the risk of making my own life more complicated, I have to say
that I personally place a higher value on accuracy than I do on simplicity.
For this reason, it is my feeling that the data base should evolve in
the direction of UTF-8 and *not* in the rather different and far more
limiting direction of ISO-8859-1.

That having been said however, domain names are a really very special
and different concern.  I personally am not aware of any standard which
suggests that domain names should ever be written in ISO-8859-1.  Rather,
for domain names, the available choices of representation seem to be either
(a) 7-bit US-ASCII or else (b) punycode (RFC3492) or else (c) UTF-8.

Obviously, 7-bit US-ASCII is really no longer an option, and hasn't been
ever since the publication of RFC3490 in 2003.  At the present moment,
punycode can be used, and can represent all domain names with 100%
accuracy, even while allowing the evolution of the encoding of other
data base fields to proceed and to be debated independently.

The bottom line is that in the short term, and for the immediate future,
I believe that there is no other sensible choice except to decree that
all domain names within the data base shall be represented in punycode
form.


>Whois could automatically translate to and from the punycode format, if
>an IDN format address is encountered.

Yes, but please just leave this to the WHOIS *client* to handle.  It is
less desirable, I think, to perform this conversion on the server side.


Regards,
rfg

Ronald F. Guilmette

2019-11-06 23:46:13 CET

In message <20191106210554.GB5460 _at_ hydra.ck.polsl _dot_ pl>, 
Piotr Strzyzewski <Piotr.Strzyzewski _at_ polsl _dot_ pl> wrote:

>I do not object having properly coded non-ASCII email addresses in the
>database. 

First, just to be clear, we are really only discussing the representation
of domain names within the data base.  Of course, any email address contains
one of those, but we are specifically -not- discussing the representation
of the user-ID portion of any email address in the data base.

Second, it is nice that you are OK with "properly coded non-ASCII" domain
names in the data base.  So I am I.  That's not the question.  The question
is how should IDNs be -represented- within the data base.

As I have stated, it is my opinion that the only two viable options at
the present time are either (a) punycode or else (b) UTF-8.

ISO-8859-1 is not, as far as I know, a standardized or appropriate way
of encoding IDNs in any context.  If I am wrong about that, then please
do correct me an please do point me at the RFC which states otherwise.


Regards,
rfg

User Image

Ed Shryane

2019-11-08 11:10:10 CET

RIPE NCC staff member

Hello Ronald, DB-WG,

> On 6 Nov 2019, at 23:46, Ronald F. Guilmette via db-wg <db-wg _at_ ripe _dot_ net> wrote:
> 
> In message <20191106210554.GB5460 _at_ hydra.ck.polsl _dot_ pl>, 
> Piotr Strzyzewski <Piotr.Strzyzewski _at_ polsl _dot_ pl> wrote:
> 
>> I do not object having properly coded non-ASCII email addresses in the
>> database. 
> 
> First, just to be clear, we are really only discussing the representation
> of domain names within the data base.  Of course, any email address contains
> one of those, but we are specifically -not- discussing the representation
> of the user-ID portion of any email address in the data base.
> 

Understood. The user-ID (local) portion of an email address is not affected, only the domain.

> Second, it is nice that you are OK with "properly coded non-ASCII" domain
> names in the data base.  So I am I.  That's not the question.  The question
> is how should IDNs be -represented- within the data base.
> 
> As I have stated, it is my opinion that the only two viable options at
> the present time are either (a) punycode or else (b) UTF-8.

DB-WG:

- if (a), should the RIPE database automatically convert IDN domain names in email addresses into punycode?
- or if (b), should the RIPE database support UTF-8 for the domain part of IDN email addresses? This is technically possible in the Whois server side, but it's a large change for clients.

> 
> ISO-8859-1 is not, as far as I know, a standardized or appropriate way
> of encoding IDNs in any context.  If I am wrong about that, then please
> do correct me an please do point me at the RFC which states otherwise.
> 

Using ISO-8859-1 to encode IDN email addresses in the RIPE database does cause some issues:

- Only a small subset of the UTF-8 character set is supported, characters outside ISO-8859-1 are substituted with a '?' on Whois update.
- ISO-8859-1 encoded email addresses may not be handled properly by Whois clients or mail servers.

> 
> Regards,
> rfg
> 

Regards
Ed Shryane
RIPE NCC


Nick Hilliard

2019-11-08 11:19:02 CET

Edward Shryane via db-wg wrote on 08/11/2019 10:10:
> - if (a), should the RIPE database automatically convert IDN domain
> names in email addresses into punycode?

Where though?  Only in fields which contain email addresses?  Or 
free-text fields too?

Nick


User Image

Ed Shryane

2019-11-08 11:25:56 CET

RIPE NCC staff member

Hi Nick,

> On 8 Nov 2019, at 11:19, Nick Hilliard <nick _at_ foobar _dot_ org> wrote:
> 
> Edward Shryane via db-wg wrote on 08/11/2019 10:10:
>> - if (a), should the RIPE database automatically convert IDN domain
>> names in email addresses into punycode?
> 
> Where though?  Only in fields which contain email addresses?  Or free-text fields too?
> 

I suggest only in fields which contain email addresses: upd-to, mnt-nfy, notify, e-mail.


> Nick
> 

Regards
Ed Shryane
RIPE NCC


User Image

Piotr Strzyżewski

2019-11-08 16:02:27 CET

On Wed, Nov 06, 2019 at 02:46:13PM -0800, Ronald F. Guilmette via db-wg wrote:
> In message <20191106210554.GB5460 _at_ hydra.ck.polsl _dot_ pl>, 
> Piotr Strzyzewski <Piotr.Strzyzewski _at_ polsl _dot_ pl> wrote:
> 
> >I do not object having properly coded non-ASCII email addresses in the
> >database. 
> 
> First, just to be clear, we are really only discussing the representation
> of domain names within the data base.  Of course, any email address contains
> one of those, but we are specifically -not- discussing the representation
> of the user-ID portion of any email address in the data base.
> 
> Second, it is nice that you are OK with "properly coded non-ASCII" domain
> names in the data base.  So I am I.  That's not the question.  The question
> is how should IDNs be -represented- within the data base.

Properly. As I said. The specification for that is in the relevant RFC.
I do not see any reason for not allowing people to use proper email
addresses.

-- 
Piotr Strzyżewski

User Image

Piotr Strzyżewski

2019-11-08 16:06:10 CET

On Fri, Nov 08, 2019 at 11:10:10AM +0100, Edward Shryane via db-wg wrote:
> DB-WG:
> 
> - if (a), should the RIPE database automatically convert IDN domain names in email addresses into punycode?

That is just a workaround for the general problem of having UTF8 in the
DB.

> - or if (b), should the RIPE database support UTF-8 for the domain part of IDN email addresses? This is technically possible in the Whois server side, but it's a large change for clients.

We should make our minds about UTF8. One way or another.

-- 
Piotr Strzyżewski

Ronald F. Guilmette

2019-11-08 23:23:09 CET

In message <0A69C7DE-D5E2-4B95-9643-82103F87B92B _at_ ripe _dot_ net>, 
Edward Shryane <eshryane _at_ ripe _dot_ net> wrote:

>Using ISO-8859-1 to encode IDN email addresses in the RIPE database does
>cause some issues:

We agree on that point, 100%.

>- Only a small subset of the UTF-8 character set is supported,
>characters outside ISO-8859-1 are substituted with a '?' on Whois
>update.

Yes.  And this is really rather entirely sub-optimal.

>- ISO-8859-1 encoded email addresses may not be handled properly by
>Whois clients or mail servers.

I personally am not too concerned about WHOIS client tools.  They can
adapt or die. :-)

It is certainly the case however that most or all existing WHOIS clients
do not contain any UTF-8 decoding logic, and that they thus will display
only 7-bit US-ASCII or, in some cases that and alo ISO-8859-1 encoded
single byte characters.

For all of these existing clients & tools it would be maximally convenient
to be able to cut-and-paste email addresses out of the WHOIS data base
records, as these tools render them, and directly into mail clients.
Either a UTF-8 encoding or a punycode encoding (of domain name) -might-
possibly work for that.

I personally prefer punycode because it is effectively the lowest common
denominator.  It does not force WHOIS clients or tools to support anything
beyond simple and primitive 7-bit US-ASCII, and yet it can still express
100% of all modern IDNs.


Regards,
rfg