Comment by alphazard

1 month ago

I seem to remember Yahoo finance (I think it was them, maybe someone else) introducing benign errors into their market data feeds, to prevent scraping. This lead to people doing 3 requests instead of just 1, to correct the errors, which was very expensive for them, so they turned it off.

I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.

3 comments

alphazard

coppsilgold 1 month ago

> I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.

This is a very common assumption that turns out to be false.

There are Tardos probabilistic codes (see the paper I linked) which have the watermark scale as the square of the traitor count.

For example, with a watermark of just 400 bits, 4 traitors (who try their best to corrupt the watermark) will stand out enough to merit investigation and with 800 bits be accused without any doubt. This is for a binary alphabet, with text you can generate a bigger alphabet and have shorter watermarks.

These are typically intended for tracing pirated content, so they carry the so-called Marking Assumption (if given two or more versions of a piece of content, you must choose one. A pirate isn't going to corrupt or remove a piece of video, that would be unsuitable for leaking). So it would likely be possible to get better results with documents, may require larger watermarks to get such traitors reliably.

alphazard 1 month ago
This was a fascinating read, thanks for posting.
I'm not totally convinced that the threat model is realistic. The watermarker has to embed the watermark, the only place to do that is in the least significant bits of whatever the message is. If it's an audio file then the least significant bits of each sample would work. If it's a video file then the LSBs in a DCT bin may also be unnoticeable. It can really only go in certain places, without it affecting the content in a meaningful way. If it's in a header, or separate known location, then the pirate can just delete those bits.
The threat model presented says the pirates have to go with one of the copies, or only correct errors that are different between 2 copies. That's the part that I don't think is realistic. If the pirates knew that the file was marked, and the scheme used to mark it, but didn't know the key (a standard threat model for things like encryption), then they could inject their own noise into wherever the watermark could be hiding, and now the problem is the watermarker trying to send a message on a noisy channel, where the pirates have a jammer. I don't even think you have to sacrifice quality, since the copy you have already has noise, and you just need to inject the same amount (or more).
- coppsilgold 1 month ago
  
  It's more sophisticated than that. A single movie can be fragmented into 1000s of fragments, each fragment carries 1 bit. It's called A/B forensic watermarking. So you need to insert a 1-bit watermark into a video segment that is a few megabytes, there is no feasible way to defeat this as a pirate unless the watermarker is incompetent. Averaging will not work.
  See AWS offering:
  For large-scale per-viewer, implement a content identification strategy that allows you to trace back to specific clients, such as per-user session-based watermarking. With this approach, media is conditioned during transcoding and the origin serves a uniquely identifiable pattern of media segments to the end user. A session to a user-mapping service receives encrypted user ID information in the header or cookies of the request context and uses this information to determine the uniquely identifiable pattern of media segments to serve to the viewer. This approach requires multiple distinctly watermarked copies of content to be transcoded, with a minimum of two sets of content for A/B watermarking. Forensic watermarking also requires YUV decompression, so encoding time for 4K feature length content can take upwards of 20 hours. DRM service providers in the AWS Partner Network (APN) are available to aid in the deployment of per-viewer content forensics.
  <https://docs.aws.amazon.com/wellarchitected/latest/streaming...>
  This will be more challenging for text. Not as difficult for images.
  > the only place to do that is in the least significant bits
  This is also false, it's the most naive way to watermark content. They do it in the mid range frequencies these days. And then make the watermarks robust to resizing, re-encoding, cropping and even rotation in some cases. They survive when someone holds a camera to record a screen.