Why we normalize RDAP responses

Two correct RDAP clients return different strings for the same domain. One returns GOOGLE.COM, the other google.com. One returns ns1.google.com., the other ns1.google.com. One returns ["Server Transfer Prohibited"], the other ["server transfer prohibited"]. The DNS zone is the same. The bytes are not.

The moment you dedupe a cache by domain name, compare two abuse contacts across registrars, or sort domains by expiration, that difference becomes a bug. This post shows one domain in raw form and in our normalized form, then walks through the rules behind every change.

One domain, two responses

A real Verisign response for google.com, condensed to the fields you would actually read. (The full response also ships ICANN notices, a terms-of-service block, and rdapConformance metadata that we omit here.)

{
  "handle": "2138514_DOMAIN_COM-VRSN",
  "ldhName": "GOOGLE.COM",
  "status": [
    "client delete prohibited", "client transfer prohibited", "client update prohibited",
    "server delete prohibited", "server transfer prohibited", "server update prohibited"
  ],
  "nameservers": [
    { "ldhName": "NS1.GOOGLE.COM" },
    { "ldhName": "NS2.GOOGLE.COM" },
    { "ldhName": "NS3.GOOGLE.COM" },
    { "ldhName": "NS4.GOOGLE.COM" }
  ],
  "events": [
    { "eventAction": "registration", "eventDate": "1997-09-15T04:00:00Z" },
    { "eventAction": "expiration",   "eventDate": "2028-09-14T04:00:00Z" },
    { "eventAction": "last changed", "eventDate": "2019-09-09T15:39:04Z" }
  ],
  "entities": [{
    "handle": "292",
    "roles": ["registrar"],
    "links": [
      { "href": "http://www.markmonitor.com", "type": "text/html", "rel": "about" }
    ],
    "publicIds": [{ "type": "IANA Registrar ID", "identifier": "292" }],
    "vcardArray": ["vcard", [
      ["version", {}, "text", "4.0"],
      ["fn", {}, "text", "MarkMonitor Inc."]
    ]],
    "entities": [{
      "roles": ["abuse"],
      "vcardArray": ["vcard", [
        ["version", {}, "text", "4.0"],
        ["fn", {}, "text", ""],
        ["tel", {"type": "voice"}, "uri", "tel:+1.2086851750"],
        ["email", {}, "text", "[email protected]"]
      ]]
    }]
  }]
}

The same domain through our API:

{
  "domain": "google.com",
  "handle": "2138514_DOMAIN_COM-VRSN",
  "status": [
    "client delete prohibited", "client transfer prohibited", "client update prohibited",
    "server delete prohibited", "server transfer prohibited", "server update prohibited"
  ],
  "nameservers": ["ns1.google.com", "ns2.google.com", "ns3.google.com", "ns4.google.com"],
  "dates": {
    "registered": "1997-09-15T04:00:00Z",
    "expires":    "2028-09-14T04:00:00Z",
    "updated":    "2019-09-09T15:39:04Z"
  },
  "registrar": {
    "name": "MarkMonitor Inc.",
    "iana_id": "292",
    "abuse_email": "[email protected]",
    "abuse_phone": "+12086851750",
    "url": "http://www.markmonitor.com"
  }
}

The rest of the post explains each change.

Hostnames: case, trailing dots, stuffed annotations

The GOOGLE.COM → google.com and NS1.GOOGLE.COM → ns1.google.com changes above came from one rule applied to three different registry quirks.

Verisign (.com) returns uppercase. GOOGLE.COM and google.com point to the same DNS zone, but "GOOGLE.COM" === "google.com" is false. .org, .de, .co.uk, and .fi all return lowercase; .com is the registry where this matters.

denic (.de) and Nominet (.co.uk) add a trailing dot:

"nameservers": [
  { "ldhName": "ns1.google.com." },
  { "ldhName": "ns2.google.com." },
  { "ldhName": "ns3.google.com." },
  { "ldhName": "ns4.google.com." }
]

The trailing dot is valid DNS notation, but most clients strip it before storage. Equality breaks the moment one of them does and the other does not.

Traficom (.fi) puts status, IP addresses, and free text inside the hostname field:

ns1.hostingpalvelu.fi [31.217.192.71] [ok]
ns1.cloudcity.fi [185.220.76.33] [2a0b:f240:0:3::33] [ok]
a.dns.gandi.net [technical check not done]

RFC 9083 §5.2 says ldhName must be a valid LDH (letter, digit, hyphen) name. Status belongs in a separate status array. The Finnish registry ignores both.

Our rule: take the first whitespace-separated token, lowercase it, drop a trailing dot, reject anything that does not match ^[a-z0-9.-]+$. Applied to the three rows above:

ns1.hostingpalvelu.fi [31.217.192.71] [ok]                → ns1.hostingpalvelu.fi
ns1.cloudcity.fi [185.220.76.33] [2a0b:f240:0:3::33] [ok] → ns1.cloudcity.fi
a.dns.gandi.net [technical check not done]                → a.dns.gandi.net

The per-nameserver DNS check status that .fi smuggled inside the name is dropped. If you need it, query the nameserver object directly and read its status array, which is where the spec puts it.

Dates: positional with garbage values

The events array above became three named fields (dates.registered, dates.expires, dates.updated). Two normalizations happen in the same loop.

The first is structural. Every consumer of RDAP writes the same loop to find expiration: iterate events, match eventAction === "expiration", read eventDate. We do it once. If a registry does not publish one of the three named events (denic, for example, ships only last changed), the field is null. Extra event types some registries include (reregistration, last update of RDAP database) are not exposed in the normalized output.

The second is value sanitation. Everything comes out in ISO 8601 UTC Z form, fractional seconds dropped, timezone offsets converted:

Raw	Normalized
`2018-03-12T21:44:25+01:00`	`2018-03-12T20:44:25Z`
`1998-10-21T04:00:00.896Z`	`1998-10-21T04:00:00Z`
`1969-12-31T23:59:59Z`	`null` (rejected, pre-1985)
`2200-01-01T00:00:00Z`	`null` (rejected, post-2100)
`0000-00-00T00:00:00Z`	`null` (rejected)
`not-a-date`	`null` (rejected)

Pre-1985 and post-2100 values are almost always database junk: Unix epoch leakage, off-by-1000 typos, default DBMS values. A domain expiration monitor that trusts a raw 1969-12-31 date will mark the domain as already expired. We return null and let the caller decide.

Statuses: two valid forms, RFC 8056 maps between them

The status array above passed through unchanged because Verisign currently ships the RDAP form already. So do denic, AFNIC, SIDN, MarkMonitor, Nominet, and PIR when you curl any of them today. But the RDAP spec allows a second form too, and a normalizer that does not handle it will break the day a registry switches.

RFC 9083 §10.2.2 defines the RDAP status vocabulary in lowercase with spaces (client transfer prohibited, pending delete, redemption period). RFC 8056 §2 provides a one-to-one mapping back to the EPP wire form (clientTransferProhibited, pendingDelete, redemptionPeriod) that registries use internally.

An input in the EPP form like

"status": ["clientTransferProhibited", "serverDeleteProhibited", "OK"]

becomes

"status": ["client transfer prohibited", "server delete prohibited", "active"]

OK maps to active. Mixed case (Server Transfer Prohibited) is lowercased. Values not in the RFC 8056 table (AFNIC's server recover prohibited, registry-specific values like reserved or premium) pass through lowercased. Duplicates are removed.

We do not invent a canonical form for registry-specific statuses. AFNIC's server recover prohibited and a hypothetical registry's server recovery prohibited stay distinct strings; we are not in a position to guess that they mean the same thing.

Phone numbers: libphonenumber to E.164

Phone numbers arrive in vcardArray as tel: URIs, in whatever shape the upstream chose. MarkMonitor's abuse phone above (tel:+1.2086851750 → +12086851750) is one case; the rest of the input space looks like this:

tel:+1 208 685 1750
tel:+33.899701761
tel:+49 228 18123797

Under string equality, each of these is a different cache key, a different dedupe row, a different contact entry. We push every parseable number through libphonenumber and return E.164. A few edge cases took most of the work.

Stray leading zeros. Some upstreams emit a real country code behind a stray zero: +0.218915035945 for Libya, +033.534272850 for France. The ambiguous case is +0.6479286442. Strip the zero and you get +6479286442, which libphonenumber parses as a valid New Zealand number (+64). The actual number is a Toronto landline in area code 647 (+1.647...). Without context, the wrong rescue is indistinguishable from the right one.

So we gate the rescue on the contact's address country. If adr.cc is CA, we reject the NZ interpretation and pass the original through. If adr.cc is LY, the Libya rescue goes through. When the contact has no country, the rescue is skipped entirely.

Redaction markers. REDACTED, Redacted for Privacy, GDPR Masked. These are not phone numbers, they are signals. We leave them unchanged so you can tell "masked for privacy" apart from "missing."

Unparseable numbers. +1.555 and +886.000000 fail libphonenumber's isValidNumber check. (208) 685-1750 has no leading +, so the country is ambiguous. In every case we return the trimmed original rather than drop the field, so an upstream value is never silently lost.

Encoding: when the upstream lies about UTF-8

RDAP responses are JSON, and RFC 8259 requires JSON to be UTF-8. A few registries ship Windows-1250, ISO-8859-2, Big5, or GBK while declaring Content-Type: application/rdap+json; charset=utf-8. The google.com example above was ASCII; a Polish registrant exposes the problem immediately.

What a naive client sees when decoding mislabelled bytes as UTF-8:

"registrant": {
  "name": "Wojew?dztwo ?l?skie",
  "city": "?ód?"
}

What we return for the same response:

"registrant": {
  "name": "Województwo Śląskie",
  "city": "Łódź"
}

The bytes on the wire are identical. Only one of the two is usable.

Our decoder, in order:

If Content-Type declares a non-UTF-8 charset, trust it and convert via iconv.
Otherwise, if the body is valid UTF-8, pass it through.
Otherwise, run mb_detect_encoding against a short list (Windows-1252, Windows-1251, Windows-1254, ISO-8859-1, ISO-8859-2, Big5, GBK). If a non-UTF-8 candidate matches strictly, convert it.
As a last resort, replace invalid bytes so the response does not break a downstream JSON parser.

What we cannot do reliably: detect a registry that mislabels Windows-1250 bytes as UTF-8. PHP's mbstring does not ship Windows-1250 detection, and the byte ranges overlap with other encodings just enough that guessing is unsafe. Step 1 handles every registry that declares its charset honestly; the case where a registry both ships Windows-1250 and lies about it is the remaining gap.

Contacts: vCard flattened

The MarkMonitor entities[0].vcardArray[1] walk above produced registrar.name and the nested abuse vCard produced registrar.abuse_email and registrar.abuse_phone. The same flattening applies to registrant, admin, tech, and billing contacts when a registry publishes them.

.com does not include a registrant in the registry response, so for a richer example, here is the registrant block from an AFNIC lookup of google.fr:

"entities": [{
  "objectClassName": "entity",
  "handle": "GIHU100-FRNIC",
  "roles": ["registrant"],
  "vcardArray": ["vcard", [
    ["version", {}, "text", "4.0"],
    ["fn", {}, "text", "Google Ireland Holdings Unlimited Company"],
    ["org", {}, "text", "Google Ireland Holdings Unlimited Company"],
    ["adr", {"cc": "IE"}, "text", ["", "", "70 Sir John Rogerson's Quay", "Dublin", "", "2", ""]],
    ["email", {}, "text", "[email protected]"],
    ["tel", {"type": "voice"}, "uri", "tel:+353.14361000"]
  ]]
}]

To read the name, you walk entities[0].vcardArray[1], find the entry whose first element is "fn", and take its fourth element. The address is a seven-element positional array (PO box, extended, street, locality, region, postcode, country), most of which are empty strings.

After flattening:

"entities": {
  "registrant": {
    "handle": "GIHU100-FRNIC",
    "name": "Google Ireland Holdings Unlimited Company",
    "organization": "Google Ireland Holdings Unlimited Company",
    "email": "[email protected]",
    "phone": "+35314361000",
    "address": "70 Sir John Rogerson's Quay\nDublin\n2",
    "country_code": "IE"
  }
}

Contacts are keyed by role, not by array index. Missing fields are null. Addresses are joined with newlines, each line trimmed, empty positional slots dropped. Email goes lowercase. Country code comes from the adr parameter cc and is uppercased to match ISO 3166-1 alpha-2. The phone goes through libphonenumber and comes out in E.164, as in the section above.

We keep one value per field. A vCard can ship two email entries or three tel entries with different type parameters; we take the first non-fax telephone and the first email, and the rest are dropped.

Try it on a domain you care about

curl -H "Authorization: Bearer YOUR_TOKEN" \
  https://rdapapi.io/api/v1/domain/google.com

The actual response:

{
  "domain": "google.com",
  "unicode_name": null,
  "handle": "2138514_DOMAIN_COM-VRSN",
  "status": [
    "client delete prohibited", "client transfer prohibited", "client update prohibited",
    "server delete prohibited", "server transfer prohibited", "server update prohibited"
  ],
  "registrar": {
    "name": "MarkMonitor Inc.",
    "iana_id": "292",
    "abuse_email": "[email protected]",
    "abuse_phone": "+12086851750",
    "url": "http://www.markmonitor.com"
  },
  "dates": {
    "registered": "1997-09-15T04:00:00Z",
    "expires":    "2028-09-14T04:00:00Z",
    "updated":    "2019-09-09T15:39:04Z"
  },
  "nameservers": ["ns1.google.com", "ns2.google.com", "ns3.google.com", "ns4.google.com"],
  "dnssec": false,
  "entities": {},
  "meta": {
    "rdap_server": "https://rdap.verisign.com/com/v1/",
    "raw_rdap_url": "https://rdap.verisign.com/com/v1/domain/google.com",
    "cached": false
  }
}

Run it against your hardest domain. If you find a registry quirk we have missed, send the raw response to [email protected] and we will add it.

The same lookup in our SDKs:

# pip install rdapapi
from rdapapi import RdapApi

api = RdapApi("YOUR_TOKEN")
domain = api.domain("google.com")

// npm install rdapapi
import { RdapClient } from "rdapapi";

const api = new RdapClient("YOUR_TOKEN");
const domain = await api.domain("google.com");

// composer require rdapapi/rdapapi-php
$api = new \RdapApi\RdapApi("YOUR_TOKEN");
$domain = $api->domain("google.com");

// go get github.com/rdapapi/rdapapi-go
client := rdapapi.NewClient("YOUR_TOKEN")
domain, _ := client.Domain("google.com", nil)

// io.rdapapi:rdapapi-java (Maven Central)
RdapClient client = new RdapClient("YOUR_TOKEN");
DomainResponse domain = client.domain("google.com");

Full API and SDK documentation.