Rafal-qa/slopo: Embedding-based Code Duplication Detector · GitHub

A lightweight CLI tool for detecting non-exact code duplication using embedding models.

It focuses on similar code that is hardest to detect and most damaging: identically written snippets, sitting far apart in the codebase, often spread across different modules or isolated within a single large file. Exact copy-pastes are easy to identify by other tools, and duplicates that are close to each other are easy to identify by humans or AI.

See slopo.dev for a more high-level description of the problem.

Python, TypeScript, JavaScript, Java, Kotlin, C#, Go, Rust

It takes a different approach than normal duplication detection. For each code unit, it computes an embedding, then looks for pairs whose embeddings are close. Identical code is not necessarily a duplicate, so each pair is a potential duplicate to confirm. The code is doing the same thing but implemented in a completely different way which produces distant embeddings and will not be detected.

The result is groups of similar code units, sorted by similarity and distance in the codebase. These clusters serve as input for your AI coding agent, which can check whether the cluster is a true duplicate or not. Reviewed groups can be marked as ignored or moved forward for refactoring.

View documents/example-reports generated from Slopo code, src directory, git tag v0.2.0.

This example confirmed that there is a lot of duplication in the code parsers for each language, some are exact-copies, some are identical forms. This needs to be reimplemented.

This command uses uv (installing uv), a Python package manager, to install slowpo from PyPI in an isolated virtual environment. There is no need to obtain Python separately.

run slopo init To create a config file template containing further instructions. A directory containing only the code is required for analysis and embedding model configuration.

Embeddings are calculated using an external provider. For best results, consider a model dedicated to code, for example Voyage AI (it works fine with lower dimensions). 512).

You can use any model provider compatible with LightLLM, see details here.

The provider API key can be set as an environment variable for better security.

run slopo show-config To verify your configuration and show all configurable parameters, most are optional with sensible defaults.

You are now ready to index the code, calculate embeddings, and generate a report:

slopo index
slopo embed
slopo analyze

This section demonstrates how Slopo can be used in a real development workflow.

It uses incremental re-indexing (update index only with changed files) slopo.ignore.txt To discard groups that have already been reviewed.

Create your first analysis and check the results. you will notice index.md Which contains a list of all clusters and cluster details per file.
You may want to exclude certain directories or file patterns, it’s usually a good idea to exclude tests. You can also tune the threshold if the result is too large or too small.
Once satisfied with the analysis results, ask your AI coding agent to filter out groups that are not true duplicates. This is a common case as there is no repetition for each similar code to be processed. Ask AI agent to add discarded cluster hashes slopo.ignore.txt.
Rerun the analysis to generate the report without the reviewed clusters. This is the basis for refactoring, which can be performed by an AI agent.
ignore The file can be committed to your Git repository and reused cross-team. New and modified clusters will reappear in the report. A configuration file can also be committed without an API key. don’t commit slopo.dbThis is your local data.

run slopo --help And slopo show-config To find out for yourself at any time.

Most configuration is done with a configuration file, with two exceptions:

The location of the configuration file can be overridden with --config Option.
The API key can be set with SLOPO_EMBEDDING_API_KEY Environment variables, also lifted from A .env File in current directory.

Keep in mind that some parameters cannot be changed after the first index. you have to remove slopo.db And index/embed from start: source_dir, embedding_model, embedding_dimensions, body_node_count_threshold.

all configurable parameters

source_dir: Source directory with code for index, absolute or relative path.
source_dir_exclude: .gitignore-style patterns will be excluded from indexing.
db_file: SQLite database file with tool data.
report_dir:Output directory for analysis reports.
ignore_file: Text file with ignored groups.
embedding_model:Embedding the model name in LightLLM format.
embedding_dimensions:Embedding dimensions compatible with the used model.
embedding_api_key: API key for the embedding provider. Optional if configured with environment variables.
embedding_batch_size And embedding_batch_chars: Requests to the embedding API are batched for performance. The defaults are fine for most cases.
similarity_threshold: Controls the minimum cosine similarity between embeddings.
rerank_threshold: Controls the minimum similarity after applying boost that reflects distance in the codebase.
body_node_count_threshold: Number of AST nodes inside the body (except signatures and annotations). This value represents the minimum code complexity of the code unit involved, which is more accurate than the length of the text. Escalate if you see unwanted, very small code units in the report.

Identical code units are filtered in two passes, each with its own configurable threshold. The pipeline is as follows:

similarity_threshold Filters out code unit pairs whose embeddings are not sufficiently similar. The calculated values range from cosine similarity -1 To 1 Where? 1 Means the same.
Similar pairs are divided into groups.
Units in groups are re-ranked after boosts are applied. Boost is calculated based on the number of directory hops required to reach the other file in the pair (maximum 15%). If they are in the same file, the boost is calculated based on the distance in number of lines (maximum 10%). rerank_threshold Filters out groups whose highest scoring pair is not high enough.

The main goal of this tool is to detect non-exact code duplication, but exact copies (identical code on multiple paths) are also reported, just handled slightly differently from identical code:

The report shows the code once, listing each path where it appears, rather than repeating the same snippets.
analyze The command reports the “similarity ratio” (the share of code units marked as identical) in two types: including and excluding exact copies.

<a href

rafal-qa/slopo: Embedding-based code duplication detector · GitHub

all configurable parameters

Like this:

Related

Leave a Comment Cancel reply

all configurable parameters

Share this:

Like this:

Related

Leave a Comment Cancel reply