Comment by nttylock

4 hours ago

The false positive rate you're describing matches what we see running similarity detection on generated text instead of code: cosine similarity alone flags a lot of same-topic pairs that aren't actually duplicates. What helped was combining the embedding score with a structural signal (AST edit distance for code, overlapping headings and citations for text) so no single metric makes the call. Also worth surfacing the raw similarity score in the CLI output instead of just a binary duplicate flag, since people will want to tune the threshold per codebase.

My solution for false positives is simpler:

1. The tool uses only cosine similarity plus boost depending on distance in the codebase.

2. Classification with LLM. This can be done by coding agent used with project giving better results than integrating this pass in the tool. LLMs used for coding are pretty good.

I assumed that this is not a problem I need to solve inside the tool. I'm aware this is not deterministic, but this is by design.

Regarding information about raw similarity: currently, the score (raw similarity + boost) is visible in the report, so this value can be configured based on data. The raw similarity threshold can also be configured, but it's not displayed. I will think about how to handle this.