[routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)
- Previous message (by thread): [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)
- Next message (by thread): [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Daniel Suchy
danny at danysek.cz
Wed Feb 17 17:22:46 CET 2021
Hello, I agreee with Job that reliability of TA needs to be improved and I fully support ideas described by Job below. - Daniel On 2/17/21 4:58 PM, Job Snijders via routing-wg wrote: > Dear RIPE NCC, > > On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote: >>> The multitude of RPKI service impacting events as a result from >>> maloperation of the RIPE NCC trust anchor are starting to give me >>> cause for concern. >> >> I’m sorry to hear this. Transparency is key for us, this means that we >> report any event. In this case, we were not compliant with our CPS and >> this non-publishing period had operational impact. > > From the previous email there might be a misunderstanding about what > rpki-client and Routinator do. Both utilities help Relying Parties > validate X.509 signed CMS objects and transform the validated content > into authorizations and attestations. Neither utility is a SLA or CPS > compliance monitor. RIPE NCC - as CA operator - needs different tools. > > Neither utility has been designed to interpret the Certification > Practise Policy (written in a natural language) and subsequently > programmatically transform the described 'Service Level' into metrics > suitable for monitoring. > > A relying party can never tell the difference between a publication > pipeline being empty because CAs didn't issue new objects, or a > publication pipeline being empty because of a malfunction in one of RIPE > NCC's RPKI subsystems. > > More examples of 'out of scope' functionality for Relying Party > software: validators don't monitor whether lirportal.ripe.net is > functional, whether RIPE NCC's BPKI API endpoints are operational, or > whether LIRs paid their invoices, the list is quite long. The validators > by themselves are the wrong tool for RPKI CPS/SLA monitoring. > > You state "transparency is key for us", but I fear ad-hoc low-quality > a-posteriori reports are not the appropriate mechanism to impress and > reassure this community regarding 'transparency'. > > I have some tangible suggestions to RIPE NCC that will improve the > reliability of the Trust Anchor and potentially help rebuild trust: > > A need for Certificate Transparency > ----------------------------------- > > RIPE NCC should set up a Certificate Transparency project which publicly > shows which certificates (fingerprints) were issued when, and store such > information in immutable logs, accessible to all. > > How can anyone trust a Trust Anchor which does not offer transparency > about its issuance process? > > Lack of transparency to signer software > --------------------------------------- > > The RIPE NCC WHOIS database software is open source, as is most of the > software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has > undertaken over the years. > > Why has the signer source code still not open sourced? Why can't members > review the code related to scheduled changes? Why is an organisation > proclaiming 'transparency' being opaque about how the RPKI certificates > are issued? > > Lack of Public status dashboard > ------------------------------- > > RIPE NCC should set up a website like https://rpki-status.ripe.net/ > which shows dashboards with graphs and traffic lights related to each > (best effort) commitment listed in the CPS. RIPE NCC should continuously > publish & revoke & delete objects and verify whether those activities > are visible externally, and then automatically report whether any > potential delays observed are within the Service Levels outlined in the > CPS. > > Metrics that come to mind: > > * delta between last certificate issuance & successful publication > * Object count in the repository, repo size (and graphs) > * Time-To-Keyroll (and graphs on duration & frequency) > * Resource utilisation of various RPKI subsystems > * aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, rsync) > * Graphs & logs of overlap between INRs listed on EE certificates under > the RIPE TA and other commonly used TAs, matched against known > transfers. This will help detect compromises as well as understand > whether transfers are successful or not. > * Unique client IP count for RSYNC & RRDP for last hour/day/week > * Number of CS/hostmaster tickets mentioning RPKI > > There is are plenty of aspects to monitor, perhaps some notes should be > copied from how the DNS root is monitored. > > Lack of operational experience with BGP ROV at RIPE NCC > ------------------------------------------------------- > > I believe the number of potential future incidents related to the RIPE > NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE > NCC themselves apply RPKI based BGP Origin Validation 'invalid == > reject' policies on the AS 3333 EBGP border routers. RIPE NCC OPS > themselves having a dependency on the RPKI services will increase > organization-wide exposure to the (lack of) well-being of the Trust > Anchor services, and given the short communication channels between the > OPS team and the RPKI team my expectation is that we'll see problems > being solved faster and perhaps even problems being prevented. > > An analogy: RIPE NCC is a kitchenchef refusing to eat their own food. > How can we trust RIPE NCC to operate RPKI services, when RIPE NCC > themselves refuses to apply the cryptographic products to their BGP > routing decisions? "RPKI for thee but not for me?" > > Surely RIPE NCC staff has not disabled DNSSEC sig checking on their > resolvers, or disabled WebPKI TLS validation in their browsers? I'm not > joking, it makes zero sense to participate in a PKI and at the same time > not participate in the same way everyone outside RIPE NCC depend on the > service. > > I am very aware of potential for circular dependencies between BGP and > RPKI, and I know exactly how catch-22s can be avoided. Unfortunately it > appears my feedback is ignored, problem reports remain unresolved. > > Reporting issues has become a thankless effort, useless because no > remediation actions are taken, and obviously RIPE NCC staff are growing > tired of hearing about problems (but if one wishes to stop hearing about > problems... perhaps solve the issues, instead of a 'head in the sand' > approach?!) > > Conclusion & Call to action > --------------------------- > > There is a fair chunk of work ahead for RIPE NCC, but RIPE NCC has a > multi-million budget and talented dedicated staff to achieve the above. > None of the above is impossible or unreasonable to ask from Trust > Anchors. > > I recognize how the above information reflects negatively on the current > state of the RIPE NCC Trust Anchor. But the reality of the situation is > that we see an outage every few weeks, there is an apparent lack of > architectural oversight how to improve. I really hope this is a > temporarily state of being, on which we can look back a year from now as > "haha remember those RPKI teething pains?". I wish for RIPE NCC to > be successful in operating the Trust Anchor. > > RIPE NCC would to well to allow themselves to be vulnerable to criticism > by exposing service level metrics and efforts like production of public > merkle tree logs - reflecting the certificate issuance process. RIPE NCC > should allow itself to be held accountable - which can only happen if > there is visibility into where friction exists. > > Does RIPE NCC understand the precariousness of the current situation and > the negative impact on the long term viability of the RPKI if course is > not corrected? > > This email outlines deliverables, will RIPE NCC commit to those? What > timelines can the community expect? What kind of help is needed to > achieve this? > > Kind regards, > > Job >
- Previous message (by thread): [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)
- Next message (by thread): [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]