Add Claude Code SDK integration and randomization features by scouzi1966 · Pull Request #1 · scouzi1966/HfToAppleTrain

scouzi1966 · 2025-08-24T01:21:56Z

Summary

Added --context and --instruct options for enhanced Claude Code SDK integration
Added --randomize option to shuffle data before train/validation split
Replaced direct Anthropic API calls with proper Claude Code SDK integration
Updated dependencies and cleaned up project files

New Features

--context: Pass a context file to Claude Code for better dataset analysis
--instruct: Provide direct instructions to Claude for analysis customization
--randomize: Randomize examples before splitting into train/validation sets

Technical Changes

Migrated from direct anthropic library to claude-code-sdk
Added async Claude Code SDK integration with proper error handling
Updated requirements.txt to reflect new dependencies
Cleaned up .gitignore to include Claude Code configuration files

Test plan

Verify --randomize flag properly shuffles data before splitting
Test --context parameter with sample context file
Test --instruct parameter with custom instructions
Confirm Claude Code SDK integration works when available
Ensure backward compatibility when Claude SDK is not installed

🤖 Generated with Claude Code

Summary by Sourcery

Integrate the Claude Code SDK for dataset analysis and add data randomization support via new CLI options

New Features:

Add --context option to pass a context file to Claude Code
Add --instruct option to supply custom instructions to Claude Code
Add --randomize option to shuffle data before train/validation split

Enhancements:

Replace direct Anthropic API calls with asynchronous Claude Code SDK client and error handling
Extend dataset analysis to use context and instruction parameters and record rationale

Build:

Update requirements to use claude-code-sdk instead of anthropic
Clean up .gitignore to include Claude Code configuration files

- Add --context and --instruct options for Claude Code SDK integration - Add --randomize option to shuffle data before train/validation split - Replace direct Anthropic API calls with Claude Code SDK - Update requirements.txt to use claude-code-sdk instead of anthropic - Clean up and update .gitignore to include Claude Code files 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

sourcery-ai · 2025-08-24T01:22:06Z

Reviewer's Guide

This PR replaces direct Anthropic API calls with an async Claude Code SDK integration (including support for custom context and instructions), adds a randomization option before train/validation splitting, extends the CLI with new flags, and updates dependencies and project cleanup.

Sequence diagram for dataset analysis with Claude Code SDK and context/instructions

sequenceDiagram
    participant User
    participant CLI
    participant DatasetAnalyzer
    participant ClaudeSDKClient
    User->>CLI: Run with --use-claude-hook, --context, --instruct
    CLI->>DatasetAnalyzer: analyze_dataset_structure(dataset, sample_size, context_file, instruction)
    DatasetAnalyzer->>ClaudeSDKClient: _query_claude_sdk(prompt with context/instruction)
    ClaudeSDKClient-->>DatasetAnalyzer: Claude analysis response
    DatasetAnalyzer-->>CLI: Analysis results with Claude insights

Class diagram for updated DatasetAnalyzer with Claude Code SDK integration

classDiagram
    class DatasetAnalyzer {
        - use_claude: bool
        - analysis_results: dict
        - conversion_rationale: list
        + __init__(use_claude: bool = False)
        + analyze_dataset_structure(dataset, sample_size: int = 10, context_file: str = None, instruction: str = None) -> dict
        + _enhance_with_claude_analysis(analysis: dict, sample_data, context_file: str = None, instruction: str = None) -> dict
        + _query_claude_sdk(prompt: str) -> dict
        + generate_conversion_rationale(output_dir: str, analysis: dict, conversion_stats: dict)
    }
    DatasetAnalyzer --> ClaudeSDKClient : uses
    DatasetAnalyzer --> ClaudeCodeOptions : configures

Flow diagram for randomization before train/validation split

flowchart TD
    A[Converted Examples] -->|--randomize flag| B[Randomize Examples]
    B --> C[Split into Train/Validation]
    A -->|no randomize| C

File-Level Changes

Change	Details	Files
Integrate Claude Code SDK and replace Anthropic calls	Replace anthropic import with asyncio and claude_code_sdk imports Update analyze_dataset_structure and _enhance_with_claude_analysis signatures to accept context and instruction parameters Read context file and append instruction into the analysis prompt Remove direct client.messages.create calls and route queries through a new async _query_claude_sdk helper with options and error handling	`hf_to_apple_jsonl.py`
Extend CLI with context, instruction, and randomization flags	Add --context, --instruct, and --randomize arguments to the main parser Pass args.context and args.instruct into analyze_dataset_structure Insert random.shuffle on converted_examples when --randomize is enabled	`hf_to_apple_jsonl.py`
Update dependencies and clean up project files	Remove anthropic dependency from requirements.txt Update .gitignore to track Claude Code SDK configuration files	`requirements.txt` `.gitignore`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Consider adding an optional seed parameter for the --randomize flag so users can reproduce the same shuffled splits.
Instead of calling asyncio.run inside _enhance_with_claude_analysis, think about managing a single event loop or converting the CLI entry point to async to avoid creating and closing loops repeatedly.
Add validation or a warning when --context or --instruct is provided without --use-claude-hook to prevent confusion for users who don’t enable Claude integration.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Consider adding an optional seed parameter for the `--randomize` flag so users can reproduce the same shuffled splits.
- Instead of calling `asyncio.run` inside `_enhance_with_claude_analysis`, think about managing a single event loop or converting the CLI entry point to async to avoid creating and closing loops repeatedly.
- Add validation or a warning when `--context` or `--instruct` is provided without `--use-claude-hook` to prevent confusion for users who don’t enable Claude integration.

## Individual Comments

### Comment 1
<location> `hf_to_apple_jsonl.py:257` </location>
<code_context>
-            
-            claude_analysis = json.loads(response.content[0].text)
+            # Use Claude Code SDK instead of direct API calls
+            claude_analysis = asyncio.run(self._query_claude_sdk(prompt))
             analysis['claude_insights'] = claude_analysis

</code_context>

<issue_to_address>
Using asyncio.run in synchronous context may cause issues if already in an event loop.

asyncio.run will fail if an event loop is already active. Consider checking for an existing loop or providing a fallback approach.
</issue_to_address>

### Comment 2
<location> `hf_to_apple_jsonl.py:268` </location>
<code_context>

         return analysis

+    async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:
+        """Query Claude Code SDK with the analysis prompt."""
+        try:
</code_context>

<issue_to_address>
Error handling in _query_claude_sdk could be more robust for SDK failures.

Consider logging or propagating specific details for SDK failures, such as network or authentication errors, to improve troubleshooting.

Suggested implementation:

```python
    import logging

    async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:
        """Query Claude Code SDK with the analysis prompt."""
        try:

```

```python
        try:

```

```python
        try:
            # Assuming ClaudeSDKClient and query are used here
            client = ClaudeSDKClient()
            options = ClaudeCodeOptions()
            result = await query(client, prompt, options)
            return result
        except Exception as e:
            logging.error(f"Claude SDK query failed: {e}", exc_info=True)
            # Optionally, check for specific exception types if Claude SDK provides them
            # For example:
            # except ClaudeSDKNetworkError as ne:
            #     logging.error(f"Network error: {ne}")
            #     return {"error": "network_error", "details": str(ne)}
            # except ClaudeSDKAuthError as ae:
            #     logging.error(f"Authentication error: {ae}")
            #     return {"error": "auth_error", "details": str(ae)}
            return {"error": "sdk_failure", "details": str(e)}

```

If the Claude SDK provides specific exception classes (e.g., `ClaudeSDKNetworkError`, `ClaudeSDKAuthError`), you should catch those explicitly for more granular error handling. Also, ensure that the logging configuration is set up elsewhere in your codebase to capture these logs.
</issue_to_address>

### Comment 3
<location> `hf_to_apple_jsonl.py:542` </location>
<code_context>
         action="store_true",
         help="Enable Claude Code SDK integration for intelligent dataset analysis (requires anthropic package)"
     )
+    parser.add_argument(
+        "--context",
+        help="Path to context file to pass to Claude Code (only used with --use-claude-hook)"
+    )
+    parser.add_argument(
+        "--instruct",
+        help="Instructions to provide to Claude (only used with --use-claude-hook)"
+    )
+    parser.add_argument(
+        "--randomize",
+        action="store_true",
+        help="Randomize the data before splitting into train/validation sets"
+    )

</code_context>

<issue_to_address>
Randomization uses random.shuffle without seeding, which may affect reproducibility.

Consider adding an option for users to set a random seed, or clearly document that enabling randomization will result in non-deterministic splits.

Suggested implementation:

```python
    parser.add_argument(
        "--randomize",
        action="store_true",
        help="Randomize the data before splitting into train/validation sets (non-deterministic unless --random-seed is set)"
    )
    parser.add_argument(
        "--random-seed",
        type=int,
        default=None,
        help="Random seed for reproducible randomization (used with --randomize)"
    )


```

```python
        if self.use_claude:
            analysis = self._enhance_with_claude_analysis(analysis, sample_data, context_file, instruction)

```

You will need to:
1. Import the `random` module at the top of the file if it is not already imported.
2. When randomization is performed (wherever `random.shuffle` is called), set the seed using `random.seed(args.random_seed)` if `args.random_seed` is not None, before shuffling.
3. Pass the `random_seed` argument from the parsed CLI args to the relevant function(s) that handle randomization.
</issue_to_address>

### Comment 4
<location> `hf_to_apple_jsonl.py:580` </location>
<code_context>
     if args.use_claude_hook:
         if not CLAUDE_AVAILABLE:
-            print("Warning: Claude Code SDK not available. Install with: pip install anthropic")
+            print("Warning: Claude Code SDK not available. Install with: pip install claude-code-sdk")
             print("Proceeding without intelligent analysis...")
         else:
</code_context>

<issue_to_address>
Update help text to clarify SDK requirements.

The help text for --use-claude-hook still refers to anthropic; please update it to mention claude-code-sdk for consistency.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        help="Randomize the data before splitting into train/validation sets"
    )

    args = parser.parse_args()

    analyzer = None
    if args.use_claude_hook:
=======
        help="Randomize the data before splitting into train/validation sets"
    )
    parser.add_argument(
        "--use-claude-hook",
        action="store_true",
        help="Enable Claude Code SDK-based analysis (requires claude-code-sdk, install with: pip install claude-code-sdk)"
    )

    args = parser.parse_args()

    analyzer = None
    if args.use_claude_hook:
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-08-24T01:22:48Z

-
-            claude_analysis = json.loads(response.content[0].text)
+            # Use Claude Code SDK instead of direct API calls
+            claude_analysis = asyncio.run(self._query_claude_sdk(prompt))


issue (bug_risk): Using asyncio.run in synchronous context may cause issues if already in an event loop.

asyncio.run will fail if an event loop is already active. Consider checking for an existing loop or providing a fallback approach.

sourcery-ai · 2025-08-24T01:22:48Z


        return analysis

+    async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:


suggestion: Error handling in _query_claude_sdk could be more robust for SDK failures.

Consider logging or propagating specific details for SDK failures, such as network or authentication errors, to improve troubleshooting.

Suggested implementation:

import logging async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]: """Query Claude Code SDK with the analysis prompt.""" try:

try:

try: # Assuming ClaudeSDKClient and query are used here client = ClaudeSDKClient() options = ClaudeCodeOptions() result = await query(client, prompt, options) return result except Exception as e: logging.error(f"Claude SDK query failed: {e}", exc_info=True) # Optionally, check for specific exception types if Claude SDK provides them # For example: # except ClaudeSDKNetworkError as ne: # logging.error(f"Network error: {ne}") # return {"error": "network_error", "details": str(ne)} # except ClaudeSDKAuthError as ae: # logging.error(f"Authentication error: {ae}") # return {"error": "auth_error", "details": str(ae)} return {"error": "sdk_failure", "details": str(e)}

If the Claude SDK provides specific exception classes (e.g., ClaudeSDKNetworkError, ClaudeSDKAuthError), you should catch those explicitly for more granular error handling. Also, ensure that the logging configuration is set up elsewhere in your codebase to capture these logs.

sourcery-ai · 2025-08-24T01:22:48Z

+    parser.add_argument(
+        "--context",
+        help="Path to context file to pass to Claude Code (only used with --use-claude-hook)"
+    )
+    parser.add_argument(
+        "--instruct",
+        help="Instructions to provide to Claude (only used with --use-claude-hook)"
+    )
+    parser.add_argument(
+        "--randomize",


suggestion: Randomization uses random.shuffle without seeding, which may affect reproducibility.

Consider adding an option for users to set a random seed, or clearly document that enabling randomization will result in non-deterministic splits.

Suggested implementation:

parser.add_argument( "--randomize", action="store_true", help="Randomize the data before splitting into train/validation sets (non-deterministic unless --random-seed is set)" ) parser.add_argument( "--random-seed", type=int, default=None, help="Random seed for reproducible randomization (used with --randomize)" )

if self.use_claude: analysis = self._enhance_with_claude_analysis(analysis, sample_data, context_file, instruction)

You will need to:

Import the random module at the top of the file if it is not already imported.

When randomization is performed (wherever random.shuffle is called), set the seed using random.seed(args.random_seed) if args.random_seed is not None, before shuffling.

Pass the random_seed argument from the parsed CLI args to the relevant function(s) that handle randomization.

sourcery-ai · 2025-08-24T01:22:48Z

    analyzer = None
    if args.use_claude_hook:


suggestion: Update help text to clarify SDK requirements.

The help text for --use-claude-hook still refers to anthropic; please update it to mention claude-code-sdk for consistency.

Suggested change

analyzer = None

if args.use_claude_hook:

help="Randomize the data before splitting into train/validation sets"

)

parser.add_argument(

"--use-claude-hook",

action="store_true",

help="Enable Claude Code SDK-based analysis (requires claude-code-sdk, install with: pip install claude-code-sdk)"

)

args = parser.parse_args()

analyzer = None

if args.use_claude_hook:

sourcery-ai · 2025-08-24T01:22:48Z

+                            response_text += block.text
+
+            # Parse the JSON response
+            claude_analysis = json.loads(response_text)


issue (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)

sourcery-ai Bot reviewed Aug 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Claude Code SDK integration and randomization features#1

Add Claude Code SDK integration and randomization features#1
scouzi1966 wants to merge 1 commit into
mainfrom
feature/claude-sdk-integration

scouzi1966 commented Aug 24, 2025 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Aug 24, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Aug 24, 2025

Uh oh!

sourcery-ai Bot Aug 24, 2025

Uh oh!

sourcery-ai Bot Aug 24, 2025

Uh oh!

sourcery-ai Bot Aug 24, 2025

Uh oh!

sourcery-ai Bot Aug 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		return analysis

		async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:

Conversation

scouzi1966 commented Aug 24, 2025 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Features

Technical Changes

Test plan

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for dataset analysis with Claude Code SDK and context/instructions

Class diagram for updated DatasetAnalyzer with Claude Code SDK integration

Flow diagram for randomization before train/validation split

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scouzi1966 commented Aug 24, 2025 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Aug 24, 2025 •

edited

Loading