For a project like Signal, there are competing aspects of security:
-
privacy and anonymity: keep as little identifiable information around as possible. This can be a life or death thing under repressive governments.
-
safety and anti-abuse: reliably block bad actors such as spammers, and make it possible for users to reliably block specific people (e.g. a creepy stalker). This is really important for Signal to have a chance at mass appeal (which in turn makes it less suspicious to have Signal installed).
Phone number verification is the state of the art approach to make it more expensive for bad actors to create thousands of burner accounts, at the cost of preventing fully anonymous participation (depending on the difficulty of getting a prepaid SIM in your country).
Signal points out that sending verification SMS is actually one of its largest cost centers, currently accounting for 6M USD out of their 14M USD infrastructure budget: https://signal.org/blog/signal-is-expensive/
I’m sure they would be thrilled if there were cheaper anti-abuse measures.
The text does technically give the reason on the first page:
Here, “regular language” is a technical term, and the statement is correct.
The text goes on to discuss Perl regexes, which I think are able to parse at least all languages in
LL(*)
. I’m fairly sure that is sufficient to recognize XML, but am not quite certain about HTML5. The WHATWG standard doesn’t define HTML5 syntax with a grammar, but with a stateful parsing procedure which defies normal placement in the Chomsky hierarchy.This, of course, is the real reason: even if such a regex is technically possible with some regex engines, creating it is extremely exhausting and each time you look into the spec to understand an edge case you suffer 1D6 SAN damage.