Blog internal - How IPs are masked (v6)

06 Oct 2024

This post revisits an entry I made in October '22. Shortly after the IP masking went into effect, changes were made to also match and mask IPv6 addresses. The method used was sub-optimal until I recently fixed it. This post documents the changes.

IP masking before changes (only IPv4 support)
s/^.*[[:space:]](([[:digit:]]{1,3}\.){3})[[:digit:]]{1,3}(.*$)/$1XXX$3/;

IP masking after changes (with IPv6 support)
s/^.*\s((\d{1,3}\.){3}|(\w{0,4}:){1,5})((\d{1,3}\s)|(\w{0,4}:?){0,4}\s)(.*$)/$1XXX $7/;

deconstructing the substitution regex (by capture groups)
^.*\s # start
(\d{1,3}\.){3} # CG2
(\w{0,4}:){5} # CG3
CG2|CG3 # CG1

(\d{1,3}\s) # CG5
(\w{0,4}:?){0,4} # CG6
CG5|CG6\s # CG4
(.*$) # CG7

/$1XXX $7/; # substitute (return only "CG1XXX CG7")

input:
blog 192.168.1.1 entry1
blog 2001:db8:1234:ffff:ffff:ffff:ffff:ffff entry2
blog 2001:db8:a::123 entry3

returns:
blog 192.168.1.XXX entry1
blog 2001:db8:1234:ffff:ffff:XXX entry2
blog 2001:db8:a::XXX entry3

Above it is shown that capture group 4 is discarded. This is similar to: the remaining octet of an IPv4 address "or" (|) the remaining bits of an IPv6 address. This method swallows 48 bits of an IPv6 address and replaces them with three X characters.

In addition to improvements in address anonymization, I'd like to mention here that the effectiveness of this method for anonymization has been criticized. The extend to which stripping bits from an address contributes to anonymization is still under discussion, given the assignment algorithms used for IPv6 addresses. See [Arxiv 1707.03900] for more information on this topic. Since the traffic on my blog is too low to make any assumptions between IPv6 addresses, I'll stick to the much more simpler method of stripping bits.