Skip to content

1.2.1 rc#111

Open
rostilos wants to merge 52 commits intomainfrom
1.2.1-rc
Open

1.2.1 rc#111
rostilos wants to merge 52 commits intomainfrom
1.2.1-rc

Conversation

@rostilos
Copy link
Owner

@rostilos rostilos commented Feb 3, 2026

Summary by CodeRabbit

  • New Features

    • Added per-project token limits for analysis configuration, replacing global AI connection settings.
    • Implemented token estimation to prevent oversized pull request analysis.
    • Enhanced PR analysis with historical issue tracking and deduplication across versions.
    • Improved webhook processing with locking mechanisms and duplicate detection.
  • Bug Fixes

    • Fixed concurrent PR analysis race conditions through distributed locking.
  • Chores

    • Refactored RAG indexing system with improved collection and branch management.

rostilos and others added 30 commits January 23, 2026 03:33
fix: Implement atomic upsert for command rate limiting to prevent rac…
fix: Enhance alias management by implementing backup and migration strategy for direct collections
fix: Enhance AI analysis by incorporating full PR issue history and resolution tracking
fix: Update issue reconciliation logic to handle previous issues in both incremental and full modes
fix: Improve handling of issue resolution status and logging for bett…
fix: Implement method to retrieve branch differences from GitLab API
- Updated JobService to use REQUIRES_NEW transaction propagation for deleting ignored jobs, ensuring fresh entity retrieval and preventing issues with the calling transaction.
- Removed token limitation from AI connection model and related DTOs, transitioning to project-level configuration for token limits.
- Adjusted AIConnectionDTO tests to reflect the removal of token limitation.
- Enhanced Bitbucket, GitHub, and GitLab AI client services to check token limits before analysis, throwing DiffTooLargeException when limits are exceeded.
- Updated command processors to utilize project-level token limits instead of AI connection-specific limits.
- Modified webhook processing to handle diff size issues gracefully, posting informative messages to VCS when analysis is skipped due to large diffs.
- Cleaned up integration tests to remove references to token limitation in AI connection creation and updates.
…sis processing. Project PR analysis max analysis token limit implementation
fix: Enhance logging and implement deterministic context retrieval in RAG pipeline
@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

1 similar comment
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

3 similar comments
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@codecrow-local codecrow-local bot deleted a comment from codecrow-ai bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@rostilos
Copy link
Owner Author

rostilos commented Feb 3, 2026

/codecrow analyze

@codecrow-ai
Copy link

codecrow-ai bot commented Feb 3, 2026

⚠️ CodeCrow Command Failed

Comment commands are not enabled for this project


Check the job logs in CodeCrow for detailed error information.

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Feb 3, 2026
@codecrow-local
Copy link

codecrow-local bot commented Feb 3, 2026

⚠️ Code Analysis Results

Summary

Pull Request Review: 1.2.1 rc

Status PASS WITH WARNINGS
Risk Level HIGH
Review Coverage 103 files analyzed in depth
Confidence HIGH

Executive Summary

PR 1.2.1 rc is a major cross-ecosystem release for CodeCrow, introducing a significant database migration to remove token limitations, security configuration updates, and a refactored RAG pipeline. While the architectural direction is sound, the current implementation contains several high-severity logic errors—including broken Tree-sitter integration and missing metadata in core Java services—that will cause runtime failures in the analysis pipeline.

Recommendation

Decision: PASS WITH WARNINGS

While this PR is technically a candidate for release, there are several high-risk blockers regarding AST parsing and data integrity that must be resolved before merging. It is strongly recommended to address the identified NameErrors, NullPointerExceptions, and broken Python query runners to ensure system stability.


Note: Detailed technical findings and suggested fixes have been posted in a separate comment.

Issues Overview

Severity Count
🔴 High 4 Critical issues requiring immediate attention
🟡 Medium 8 Issues that should be addressed
🔵 Low 3 Minor issues and improvements

Analysis completed on 2026-02-03 01:56:18 | View Full Report | Pull Request


📋 Detailed Issues (15)

🔴 High Severity Issues

Id on Platform: 1777

Category: 🐛 Bug Risk

File: .../service/CodeAnalysisService.java:132

Issue: The variable 'commitHash' is passed as an argument to 'createIssueFromData' but is never declared or initialized within the 'fillAnalysisData' method scope. While 'prNumber' and 'analysisId' are correctly extracted from 'savedAnalysis', 'commitHash' is missing.

💡 Suggested Fix

Retrieve 'commitHash' from the 'savedAnalysis' object before entering the issues loop.

--- a/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java
+++ b/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java
@@ -115,2 +115,3 @@
             Long prNumber = savedAnalysis.getPrNumber();
+            String commitHash = savedAnalysis.getCommitHash();

View Issue Details


Id on Platform: 1780

Category: 🐛 Bug Risk

File: .../index_manager/indexer.py:246

Issue: Variable 'existing_other_branch_points' is referenced in the 'finally' block but is never defined in 'index_repository'. This will raise a 'NameError' every time the indexing completes or fails, preventing 'gc.collect()' from running and potentially confusing error logs.

💡 Suggested Fix

Remove the 'del existing_other_branch_points' line. The refactored code uses 'stream_copy_points_to_collection' which does not store the points in a local list variable.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
@@ -252,5 +252,4 @@
             raise e
         finally:
-            del existing_other_branch_points
             gc.collect()

View Issue Details


Id on Platform: 1783

Category: 🐛 Bug Risk

File: .../splitter/query_runner.py

Issue: The code attempts to use QueryCursor from the tree_sitter package. However, QueryCursor was removed in the tree-sitter Python bindings version 0.22.0. Since tree_parser.py explicitly targets v0.23+, this code will fail with an ImportError or AttributeError during execution.

💡 Suggested Fix

Remove usage of QueryCursor and call query.matches() directly on the Query object as per the modern tree-sitter API.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
@@ -194,8 +194,8 @@
         try:
-            # Use QueryCursor.matches() for pattern-grouped results
-            # Each match is (pattern_id, {capture_name: [nodes]})
-            from tree_sitter import QueryCursor
-            cursor = QueryCursor(query)
-            raw_matches = list(cursor.matches(tree.root_node))
+            # In tree-sitter 0.22+, QueryCursor is removed.
+            # Use query.matches() directly which returns a list of Match objects.
+            raw_matches = query.matches(tree.root_node)
         except Exception as e:
@@ -206,3 +206,5 @@
-        for pattern_id, captures_dict in raw_matches:
+        for m in raw_matches:
+            pattern_id = m.pattern_index
+            captures_dict = m.captures
             # Determine pattern type from captures

View Issue Details


Id on Platform: 1787

Category: 🐛 Bug Risk

File: .../service/ProjectService.java:591

Issue: Potential NullPointerException due to unboxing. The variable 'currentMaxTokenLimit' is a primitive 'int'. The code checks if 'currentConfig' is not null, but it does not check if 'currentConfig.maxAnalysisTokenLimit()' returns null. For projects created before this field was added, the JSON-mapped field will be null, and assigning this null value to an 'int' will cause a NullPointerException at runtime.

💡 Suggested Fix

Add a null check for the result of currentConfig.maxAnalysisTokenLimit() before unboxing it to a primitive int.

--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/service/ProjectService.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/service/ProjectService.java
@@ -589,7 +589,8 @@
         var ragConfig = currentConfig != null ? currentConfig.ragConfig() : null;
         var commentCommands = currentConfig != null ? currentConfig.commentCommands() : null;
-        int currentMaxTokenLimit = currentConfig != null ? currentConfig.maxAnalysisTokenLimit() : ProjectConfig.DEFAULT_MAX_ANALYSIS_TOKEN_LIMIT;
+        int currentMaxTokenLimit = (currentConfig != null && currentConfig.maxAnalysisTokenLimit() != null)
+                ? currentConfig.maxAnalysisTokenLimit() : ProjectConfig.DEFAULT_MAX_ANALYSIS_TOKEN_LIMIT;
 
         Boolean newPrAnalysis = prAnalysisEnabled != null ? prAnalysisEnabled :

View Issue Details


🟡 Medium Severity Issues

Id on Platform: 1776

Category: 🐛 Bug Risk

File: .../queries/python.scm

Issue: The 'type_alias_statement' in the standard 'tree-sitter-python' grammar (supporting PEP 695) uses the field name 'name' for the identifier, not 'left'. The 'left' field is typically used in 'assignment' nodes but not in type aliases. Furthermore, the identifier node type is 'type_identifier'. The current query using 'left: (type)' will likely fail to match on standard parsers.

💡 Suggested Fix

Update the field name to 'name' and the node type to 'type_identifier' to align with the standard tree-sitter grammar for Python.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/python.scm
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/python.scm
@@ -26,4 +26,3 @@
 ; Type alias (Python 3.12+)
-; tree-sitter grammar uses field "left" with type "type" for the alias name
 (type_alias_statement
-  left: (type) @name) @definition.type_alias
+  name: (type_identifier) @name) @definition.type_alias

View Issue Details


Id on Platform: 1778

Category: 🐛 Bug Risk

File: .../ai/AiAnalysisRequestImpl.java

Issue: The deduplication logic fails to merge resolved status when the input list is sorted in descending order (newest first). Since 'existing' in the map will always be the newer version and 'issue' from the loop will be the older version, the condition 'currentVersion > existingVersion' will never be true for historical issues. Consequently, the resolved status from previous iterations will not be propagated to the current version.

💡 Suggested Fix

Add an 'else if' block to handle the case where the existing issue is newer than the current one, ensuring that if the older issue (current iteration) was resolved, that status is merged into the newer one.

--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
@@ -317,6 +317,11 @@
                         } else {
                             deduped.put(fingerprint, issue);
                         }
+                    } else if (existingVersion > currentVersion) {
+                        // Existing is newer - preserve resolved status if current (older) is resolved
+                        if (currentResolved && !existingResolved) {
+                            deduped.put(fingerprint, mergeResolvedStatus(existing, issue));
+                        }
                     } else if (existingVersion == currentVersion) {
                         // Same version - prefer resolved one
                         if (currentResolved && !existingResolved) {

View Issue Details


Id on Platform: 1779

Category: ⚡ Performance

File: .../index_manager/branch_manager.py

Issue: The method 'copy_points_to_collection' accepts a 'List' of points. In RAG applications, collections can contain millions of high-dimensional vectors. Loading an entire collection into a Python list will likely trigger an Out-Of-Memory (OOM) error. Since 'stream_copy_points_to_collection' already exists as a memory-efficient alternative, this method is redundant and dangerous.

💡 Suggested Fix

Remove the 'copy_points_to_collection' method and encourage the use of 'stream_copy_points_to_collection' to handle data migration in batches.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py
@@ -181,27 +181,2 @@
-    
-    def copy_points_to_collection(
-        self,
-        points: List,
-        target_collection: str,
-        batch_size: int = 50
-    ) -> None:
-        """Copy preserved points to a new collection."""
-        if not points:
-            return
-        
-        logger.info(f"Copying {len(points)} points to {target_collection}...")
-        
-        for i in range(0, len(points), batch_size):
-            batch = points[i:i + batch_size]
-            points_to_upsert = [
-                PointStruct(
-                    id=p.id,
-                    vector=p.vector,
-                    payload=p.payload
-                ) for p in batch
-            ]
-            self.client.upsert(
-                collection_name=target_collection,
-                points=points_to_upsert
-            )
-        
-        logger.info("Points copied successfully")

View Issue Details


Id on Platform: 1781

Category: 🐛 Bug Risk

File: .../splitter/languages.py

Issue: The TREESITTER_MODULES dictionary is missing mappings for several languages explicitly listed in AST_SUPPORTED_LANGUAGES and LANGUAGE_TO_TREESITTER, such as Kotlin, Scala, Lua, Perl, Swift, and Haskell. This will cause tree_parser.py to fail when attempting to load these languages, effectively disabling AST support for them despite the code indicating they are supported.

💡 Suggested Fix

Add the missing tree-sitter module mappings to TREESITTER_MODULES for all languages listed in AST_SUPPORTED_LANGUAGES.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py
@@ -121,3 +121,9 @@
     'ruby': ('tree_sitter_ruby', 'language'),
     'php': ('tree_sitter_php', 'language_php'),
+    'kotlin': ('tree_sitter_kotlin', 'language'),
+    'scala': ('tree_sitter_scala', 'language'),
+    'lua': ('tree_sitter_lua', 'language'),
+    'perl': ('tree_sitter_perl', 'language'),
+    'swift': ('tree_sitter_swift', 'language'),
+    'haskell': ('tree_sitter_haskell', 'language'),
 }

View Issue Details


Id on Platform: 1782

Category: 🧹 Code Quality

File: .../splitter/metadata.py:165

Issue: The Java method name regex requires an explicit visibility modifier (public, private, or protected). It will fail to capture package-private methods (which have no modifier) and methods preceded by annotations on the same line (e.g., @OverRide public void ...).

💡 Suggested Fix

Update the regex to make visibility modifiers optional and allow for potential annotations or other modifiers.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
@@ -164,3 +164,3 @@
                 re.compile(r'(?:public\s+|private\s+|protected\s+)?(?:abstract\s+|final\s+)?class\s+(\w+)', re.MULTILINE),
                 re.compile(r'(?:public\s+)?interface\s+(\w+)', re.MULTILINE),
-                re.compile(r'(?:public|private|protected)\s+(?:static\s+)?[\w<>,\s]+\s+(\w+)\s*\(', re.MULTILINE),
+                re.compile(r'(?:@\w+\s+)*(?:(?:public|private|protected|static|final|synchronized)\s+)*[\w<>,\s\[\]]+\s+(\w+)\s*\(', re.MULTILINE),

View Issue Details


Id on Platform: 1784

Category: 🐛 Bug Risk

File: .../webhookhandler/GitHubPullRequestWebhookHandler.java

Issue: When a DiffTooLargeException or AnalysisLockedException is caught and re-thrown, the logic to update the placeholder comment with an error message (located in the generic Exception catch block) is bypassed. This leaves the PR with a stale 'Analysis in progress' comment that never updates.

💡 Suggested Fix

Move the placeholder error update logic into a 'finally' block or ensure it is called within the specific exception catch block before re-throwing.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
@@ -188,6 +188,9 @@
             if (acquiredLockKey != null) {
                 analysisLockService.releaseLock(acquiredLockKey);
             }
+            if (placeholderCommentId != null) {
+                vcsReportingService.updatePlaceholderWithError(project, Long.parseLong(payload.pullRequestId()), placeholderCommentId, e.getMessage());
+            }
             log.warn("PR analysis failed with recoverable exception for project {}: {}", project.getId(), e.getMessage());
             throw e;

View Issue Details


Id on Platform: 1788

Category: 🐛 Bug Risk

File: .../api/api.py

Issue: The hybrid search logic could lead to incomplete context. Files in 'all_pr_changed_files' are excluded from the main branch query. If these files are not subsequently returned by the semantic search of PR-indexed data (which is limited to 'top_k=15' and based on relevance), they will be entirely missing from the context. This may omit critical files that were specifically requested in 'changed_files' but failed to rank high in the semantic retrieval.

💡 Suggested Fix

Ensure all files in 'changed_files' are explicitly fetched from the PR index by path if they are missing from the semantic results, or increase the result limit for PR-specific chunks to ensure better coverage.

No suggested fix provided

View Issue Details


Id on Platform: 1790

Category: 🐛 Bug Risk

File: .../utils/utils.py

Issue: The heuristic used to identify archive root directories—specifically checking for 2+ hyphens and any digit—is likely to produce false positives. Common directory names in many projects (e.g., 'web-api-v1', 'react-app-2', 'service-worker-0') match this pattern. This would cause the top-level directory to be incorrectly stripped from the file path, leading to incorrect indexing in the RAG pipeline or potential file collisions.

💡 Suggested Fix

Strengthen the heuristic by requiring a higher hyphen count or a minimum string length (e.g., > 30 characters) for parts with multiple hyphens to better target VCS-generated archive names which usually include long commit hashes.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/utils/utils.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/utils/utils.py
@@ -103,5 +103,5 @@
     looks_like_archive = (
         '-' in first_part and len(first_part) > 20 or  # owner-repo-commit
         len(first_part) >= 40 or  # Just commit hash
-        (first_part.count('-') >= 2 and any(c.isdigit() for c in first_part))  # Has digits and multiple hyphens
+        (first_part.count('-') >= 2 and len(first_part) > 30 and any(c.isdigit() for c in first_part))
     )

View Issue Details


🔵 Low Severity Issues

Id on Platform: 1785

Category: 🧹 Code Quality

File: .../webhookhandler/GitLabMergeRequestWebhookHandler.java:187

Issue: Inconsistency between GitHub and GitLab handlers: the GitLab handler does not re-throw DiffTooLargeException or AnalysisLockedException. This prevents the WebhookAsyncProcessor (which likely manages job state/retries) from knowing the specific failure cause.

💡 Suggested Fix

Align error handling with GitHubPullRequestWebhookHandler by specifically catching and re-throwing recoverable/business exceptions.

No suggested fix provided

View Issue Details


Id on Platform: 1786

Category: 🎨 Style

File: .../gitlab/GitLabClient.java

Issue: The GitLab API 'diff' field already contains the unified diff headers ('--- a/...' and '+++ b/...'). Manually appending these lines creates redundant headers in the final diff string, which while often ignored by parsers, is non-standard.

💡 Suggested Fix

Remove the manual append of '---' and '+++' lines as they are already included in the 'diffContent' provided by GitLab.

--- a/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/gitlab/GitLabClient.java
+++ b/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/gitlab/GitLabClient.java
@@ -533,16 +533,6 @@
                     diffBuilder.append("rename to ").append(newPath).append("\n");
                 }
                 
-                // Proper unified diff headers: /dev/null for new/deleted files
-                if (newFile) {
-                    diffBuilder.append("--- /dev/null\n");
-                    diffBuilder.append("+++ b/").append(newPath).append("\n");
-                } else if (deletedFile) {
-                    diffBuilder.append("--- a/").append(oldPath).append("\n");
-                    diffBuilder.append("+++ /dev/null\n");
-                } else {
-                    diffBuilder.append("--- a/").append(oldPath).append("\n");
-                    diffBuilder.append("+++ b/").append(newPath).append("\n");
-                }
-                
                 if (diffContent != null && !diffContent.isEmpty()) {

View Issue Details


Id on Platform: 1789

Category: ⚡ Performance

File: .../api/api.py:736

Issue: The 'index_pr_files' endpoint accepts a list of full file contents. For large PRs, this synchronously loads the entire set of files into memory and processes them. This could cause memory spikes or timeouts (OOM) for the worker process.

💡 Suggested Fix

Implement a batch size limit for the indexing request or move the indexing logic to an asynchronous background task.

No suggested fix provided

View Issue Details


Files Affected

  • .../api/api.py: 2 issues
  • .../splitter/metadata.py: 1 issue
  • .../splitter/languages.py: 1 issue
  • .../webhookhandler/GitLabMergeRequestWebhookHandler.java: 1 issue
  • .../gitlab/GitLabClient.java: 1 issue
  • .../webhookhandler/GitHubPullRequestWebhookHandler.java: 1 issue
  • .../service/ProjectService.java: 1 issue
  • .../utils/utils.py: 1 issue
  • .../splitter/query_runner.py: 1 issue
  • .../service/CodeAnalysisService.java: 1 issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant