Comment by klibertp
4 hours ago
Correct me if I'm wrong, but looking at [1] it seems to be specifically using function definitions (I'm guessing this works with functions, methods, and lambdas (the "<unknown>" part)?) as units of repetition. If yes, that's fine, but I would seriously consider adding some settings to allow the user to control that granularity. Sometimes, the repeated code is a conditional branch within larger functions (i.e., "every else:" or "every except Ex:" looks the same). If the functions are large enough, the dissimilarity of the rest of the body would (probably?) cause such things to be missed.
I would also consider - perhaps as a separate pass, with scoring set differently - to analyze comments (especially docstrings in Python). If I read the code correctly, you're currently just stripping them, which is the right thing to do when looking for code duplication, but duplicated docstrings are also often a signal that something is wrong in the codebase. The "different scoring" is because we expect docstring to be structured similarly (at least more than normal code), so some tweaking would be needed.
Finally: very nice project, congrats! :)
[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...
Currently, only whole functions (including function-like constructs depending on language) are considered as unit.
Skipping the extraction of conditional branches was my decision to not overcomplicate the first versions, which was intended to validate the idea. I will add this in future versions because I agree it's needed for large functions.
I don't think it needs configurable granularity. In the current version, there is an analogous mechanism: when functions are nested, both outer and inner are embedded separately. When both are similar to each other, this pair is excluded. Inner or outer functions can appear in results depending on similarity to other units.
Regarding comments, they are removed and I will think about handling them. The challenge is not with extraction, but with how to present this in a report. This may be a nice addition because coding agents often add comments.
Thanks for the feedback.