Skip to content

Add Claude Code SDK integration and randomization features#1

Open
scouzi1966 wants to merge 1 commit into
mainfrom
feature/claude-sdk-integration
Open

Add Claude Code SDK integration and randomization features#1
scouzi1966 wants to merge 1 commit into
mainfrom
feature/claude-sdk-integration

Conversation

@scouzi1966

@scouzi1966 scouzi1966 commented Aug 24, 2025

Copy link
Copy Markdown
Owner

Summary

  • Added --context and --instruct options for enhanced Claude Code SDK integration
  • Added --randomize option to shuffle data before train/validation split
  • Replaced direct Anthropic API calls with proper Claude Code SDK integration
  • Updated dependencies and cleaned up project files

New Features

  • --context: Pass a context file to Claude Code for better dataset analysis
  • --instruct: Provide direct instructions to Claude for analysis customization
  • --randomize: Randomize examples before splitting into train/validation sets

Technical Changes

  • Migrated from direct anthropic library to claude-code-sdk
  • Added async Claude Code SDK integration with proper error handling
  • Updated requirements.txt to reflect new dependencies
  • Cleaned up .gitignore to include Claude Code configuration files

Test plan

  • Verify --randomize flag properly shuffles data before splitting
  • Test --context parameter with sample context file
  • Test --instruct parameter with custom instructions
  • Confirm Claude Code SDK integration works when available
  • Ensure backward compatibility when Claude SDK is not installed

🤖 Generated with Claude Code

Summary by Sourcery

Integrate the Claude Code SDK for dataset analysis and add data randomization support via new CLI options

New Features:

  • Add --context option to pass a context file to Claude Code
  • Add --instruct option to supply custom instructions to Claude Code
  • Add --randomize option to shuffle data before train/validation split

Enhancements:

  • Replace direct Anthropic API calls with asynchronous Claude Code SDK client and error handling
  • Extend dataset analysis to use context and instruction parameters and record rationale

Build:

  • Update requirements to use claude-code-sdk instead of anthropic
  • Clean up .gitignore to include Claude Code configuration files

- Add --context and --instruct options for Claude Code SDK integration
- Add --randomize option to shuffle data before train/validation split
- Replace direct Anthropic API calls with Claude Code SDK
- Update requirements.txt to use claude-code-sdk instead of anthropic
- Clean up and update .gitignore to include Claude Code files

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sourcery-ai

sourcery-ai Bot commented Aug 24, 2025

Copy link
Copy Markdown

Reviewer's Guide

This PR replaces direct Anthropic API calls with an async Claude Code SDK integration (including support for custom context and instructions), adds a randomization option before train/validation splitting, extends the CLI with new flags, and updates dependencies and project cleanup.

Sequence diagram for dataset analysis with Claude Code SDK and context/instructions

sequenceDiagram
    participant User
    participant CLI
    participant DatasetAnalyzer
    participant ClaudeSDKClient
    User->>CLI: Run with --use-claude-hook, --context, --instruct
    CLI->>DatasetAnalyzer: analyze_dataset_structure(dataset, sample_size, context_file, instruction)
    DatasetAnalyzer->>ClaudeSDKClient: _query_claude_sdk(prompt with context/instruction)
    ClaudeSDKClient-->>DatasetAnalyzer: Claude analysis response
    DatasetAnalyzer-->>CLI: Analysis results with Claude insights
Loading

Class diagram for updated DatasetAnalyzer with Claude Code SDK integration

classDiagram
    class DatasetAnalyzer {
        - use_claude: bool
        - analysis_results: dict
        - conversion_rationale: list
        + __init__(use_claude: bool = False)
        + analyze_dataset_structure(dataset, sample_size: int = 10, context_file: str = None, instruction: str = None) -> dict
        + _enhance_with_claude_analysis(analysis: dict, sample_data, context_file: str = None, instruction: str = None) -> dict
        + _query_claude_sdk(prompt: str) -> dict
        + generate_conversion_rationale(output_dir: str, analysis: dict, conversion_stats: dict)
    }
    DatasetAnalyzer --> ClaudeSDKClient : uses
    DatasetAnalyzer --> ClaudeCodeOptions : configures
Loading

Flow diagram for randomization before train/validation split

flowchart TD
    A[Converted Examples] -->|--randomize flag| B[Randomize Examples]
    B --> C[Split into Train/Validation]
    A -->|no randomize| C
Loading

File-Level Changes

Change Details Files
Integrate Claude Code SDK and replace Anthropic calls
  • Replace anthropic import with asyncio and claude_code_sdk imports
  • Update analyze_dataset_structure and _enhance_with_claude_analysis signatures to accept context and instruction parameters
  • Read context file and append instruction into the analysis prompt
  • Remove direct client.messages.create calls and route queries through a new async _query_claude_sdk helper with options and error handling
hf_to_apple_jsonl.py
Extend CLI with context, instruction, and randomization flags
  • Add --context, --instruct, and --randomize arguments to the main parser
  • Pass args.context and args.instruct into analyze_dataset_structure
  • Insert random.shuffle on converted_examples when --randomize is enabled
hf_to_apple_jsonl.py
Update dependencies and clean up project files
  • Remove anthropic dependency from requirements.txt
  • Update .gitignore to track Claude Code SDK configuration files
requirements.txt
.gitignore

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Consider adding an optional seed parameter for the --randomize flag so users can reproduce the same shuffled splits.
  • Instead of calling asyncio.run inside _enhance_with_claude_analysis, think about managing a single event loop or converting the CLI entry point to async to avoid creating and closing loops repeatedly.
  • Add validation or a warning when --context or --instruct is provided without --use-claude-hook to prevent confusion for users who don’t enable Claude integration.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider adding an optional seed parameter for the `--randomize` flag so users can reproduce the same shuffled splits.
- Instead of calling `asyncio.run` inside `_enhance_with_claude_analysis`, think about managing a single event loop or converting the CLI entry point to async to avoid creating and closing loops repeatedly.
- Add validation or a warning when `--context` or `--instruct` is provided without `--use-claude-hook` to prevent confusion for users who don’t enable Claude integration.

## Individual Comments

### Comment 1
<location> `hf_to_apple_jsonl.py:257` </location>
<code_context>
-            
-            claude_analysis = json.loads(response.content[0].text)
+            # Use Claude Code SDK instead of direct API calls
+            claude_analysis = asyncio.run(self._query_claude_sdk(prompt))
             analysis['claude_insights'] = claude_analysis

</code_context>

<issue_to_address>
Using asyncio.run in synchronous context may cause issues if already in an event loop.

asyncio.run will fail if an event loop is already active. Consider checking for an existing loop or providing a fallback approach.
</issue_to_address>

### Comment 2
<location> `hf_to_apple_jsonl.py:268` </location>
<code_context>

         return analysis

+    async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:
+        """Query Claude Code SDK with the analysis prompt."""
+        try:
</code_context>

<issue_to_address>
Error handling in _query_claude_sdk could be more robust for SDK failures.

Consider logging or propagating specific details for SDK failures, such as network or authentication errors, to improve troubleshooting.

Suggested implementation:

```python
    import logging

    async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:
        """Query Claude Code SDK with the analysis prompt."""
        try:

```

```python
        try:

```

```python
        try:
            # Assuming ClaudeSDKClient and query are used here
            client = ClaudeSDKClient()
            options = ClaudeCodeOptions()
            result = await query(client, prompt, options)
            return result
        except Exception as e:
            logging.error(f"Claude SDK query failed: {e}", exc_info=True)
            # Optionally, check for specific exception types if Claude SDK provides them
            # For example:
            # except ClaudeSDKNetworkError as ne:
            #     logging.error(f"Network error: {ne}")
            #     return {"error": "network_error", "details": str(ne)}
            # except ClaudeSDKAuthError as ae:
            #     logging.error(f"Authentication error: {ae}")
            #     return {"error": "auth_error", "details": str(ae)}
            return {"error": "sdk_failure", "details": str(e)}

```

If the Claude SDK provides specific exception classes (e.g., `ClaudeSDKNetworkError`, `ClaudeSDKAuthError`), you should catch those explicitly for more granular error handling. Also, ensure that the logging configuration is set up elsewhere in your codebase to capture these logs.
</issue_to_address>

### Comment 3
<location> `hf_to_apple_jsonl.py:542` </location>
<code_context>
         action="store_true",
         help="Enable Claude Code SDK integration for intelligent dataset analysis (requires anthropic package)"
     )
+    parser.add_argument(
+        "--context",
+        help="Path to context file to pass to Claude Code (only used with --use-claude-hook)"
+    )
+    parser.add_argument(
+        "--instruct",
+        help="Instructions to provide to Claude (only used with --use-claude-hook)"
+    )
+    parser.add_argument(
+        "--randomize",
+        action="store_true",
+        help="Randomize the data before splitting into train/validation sets"
+    )

</code_context>

<issue_to_address>
Randomization uses random.shuffle without seeding, which may affect reproducibility.

Consider adding an option for users to set a random seed, or clearly document that enabling randomization will result in non-deterministic splits.

Suggested implementation:

```python
    parser.add_argument(
        "--randomize",
        action="store_true",
        help="Randomize the data before splitting into train/validation sets (non-deterministic unless --random-seed is set)"
    )
    parser.add_argument(
        "--random-seed",
        type=int,
        default=None,
        help="Random seed for reproducible randomization (used with --randomize)"
    )


```

```python
        if self.use_claude:
            analysis = self._enhance_with_claude_analysis(analysis, sample_data, context_file, instruction)

```

You will need to:
1. Import the `random` module at the top of the file if it is not already imported.
2. When randomization is performed (wherever `random.shuffle` is called), set the seed using `random.seed(args.random_seed)` if `args.random_seed` is not None, before shuffling.
3. Pass the `random_seed` argument from the parsed CLI args to the relevant function(s) that handle randomization.
</issue_to_address>

### Comment 4
<location> `hf_to_apple_jsonl.py:580` </location>
<code_context>
     if args.use_claude_hook:
         if not CLAUDE_AVAILABLE:
-            print("Warning: Claude Code SDK not available. Install with: pip install anthropic")
+            print("Warning: Claude Code SDK not available. Install with: pip install claude-code-sdk")
             print("Proceeding without intelligent analysis...")
         else:
</code_context>

<issue_to_address>
Update help text to clarify SDK requirements.

The help text for --use-claude-hook still refers to anthropic; please update it to mention claude-code-sdk for consistency.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        help="Randomize the data before splitting into train/validation sets"
    )

    args = parser.parse_args()

    analyzer = None
    if args.use_claude_hook:
=======
        help="Randomize the data before splitting into train/validation sets"
    )
    parser.add_argument(
        "--use-claude-hook",
        action="store_true",
        help="Enable Claude Code SDK-based analysis (requires claude-code-sdk, install with: pip install claude-code-sdk)"
    )

    args = parser.parse_args()

    analyzer = None
    if args.use_claude_hook:
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread hf_to_apple_jsonl.py

claude_analysis = json.loads(response.content[0].text)
# Use Claude Code SDK instead of direct API calls
claude_analysis = asyncio.run(self._query_claude_sdk(prompt))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Using asyncio.run in synchronous context may cause issues if already in an event loop.

asyncio.run will fail if an event loop is already active. Consider checking for an existing loop or providing a fallback approach.

Comment thread hf_to_apple_jsonl.py

return analysis

async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Error handling in _query_claude_sdk could be more robust for SDK failures.

Consider logging or propagating specific details for SDK failures, such as network or authentication errors, to improve troubleshooting.

Suggested implementation:

    import logging

    async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:
        """Query Claude Code SDK with the analysis prompt."""
        try:
        try:
        try:
            # Assuming ClaudeSDKClient and query are used here
            client = ClaudeSDKClient()
            options = ClaudeCodeOptions()
            result = await query(client, prompt, options)
            return result
        except Exception as e:
            logging.error(f"Claude SDK query failed: {e}", exc_info=True)
            # Optionally, check for specific exception types if Claude SDK provides them
            # For example:
            # except ClaudeSDKNetworkError as ne:
            #     logging.error(f"Network error: {ne}")
            #     return {"error": "network_error", "details": str(ne)}
            # except ClaudeSDKAuthError as ae:
            #     logging.error(f"Authentication error: {ae}")
            #     return {"error": "auth_error", "details": str(ae)}
            return {"error": "sdk_failure", "details": str(e)}

If the Claude SDK provides specific exception classes (e.g., ClaudeSDKNetworkError, ClaudeSDKAuthError), you should catch those explicitly for more granular error handling. Also, ensure that the logging configuration is set up elsewhere in your codebase to capture these logs.

Comment thread hf_to_apple_jsonl.py
Comment on lines +542 to +551
parser.add_argument(
"--context",
help="Path to context file to pass to Claude Code (only used with --use-claude-hook)"
)
parser.add_argument(
"--instruct",
help="Instructions to provide to Claude (only used with --use-claude-hook)"
)
parser.add_argument(
"--randomize",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Randomization uses random.shuffle without seeding, which may affect reproducibility.

Consider adding an option for users to set a random seed, or clearly document that enabling randomization will result in non-deterministic splits.

Suggested implementation:

    parser.add_argument(
        "--randomize",
        action="store_true",
        help="Randomize the data before splitting into train/validation sets (non-deterministic unless --random-seed is set)"
    )
    parser.add_argument(
        "--random-seed",
        type=int,
        default=None,
        help="Random seed for reproducible randomization (used with --randomize)"
    )
        if self.use_claude:
            analysis = self._enhance_with_claude_analysis(analysis, sample_data, context_file, instruction)

You will need to:

  1. Import the random module at the top of the file if it is not already imported.
  2. When randomization is performed (wherever random.shuffle is called), set the seed using random.seed(args.random_seed) if args.random_seed is not None, before shuffling.
  3. Pass the random_seed argument from the parsed CLI args to the relevant function(s) that handle randomization.

Comment thread hf_to_apple_jsonl.py
Comment on lines 577 to 578
analyzer = None
if args.use_claude_hook:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Update help text to clarify SDK requirements.

The help text for --use-claude-hook still refers to anthropic; please update it to mention claude-code-sdk for consistency.

Suggested change
analyzer = None
if args.use_claude_hook:
help="Randomize the data before splitting into train/validation sets"
)
parser.add_argument(
"--use-claude-hook",
action="store_true",
help="Enable Claude Code SDK-based analysis (requires claude-code-sdk, install with: pip install claude-code-sdk)"
)
args = parser.parse_args()
analyzer = None
if args.use_claude_hook:

Comment thread hf_to_apple_jsonl.py
response_text += block.text

# Parse the JSON response
claude_analysis = json.loads(response_text)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant