Add Claude Code SDK integration and randomization features#1
Conversation
- Add --context and --instruct options for Claude Code SDK integration - Add --randomize option to shuffle data before train/validation split - Replace direct Anthropic API calls with Claude Code SDK - Update requirements.txt to use claude-code-sdk instead of anthropic - Clean up and update .gitignore to include Claude Code files 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Reviewer's GuideThis PR replaces direct Anthropic API calls with an async Claude Code SDK integration (including support for custom context and instructions), adds a randomization option before train/validation splitting, extends the CLI with new flags, and updates dependencies and project cleanup. Sequence diagram for dataset analysis with Claude Code SDK and context/instructionssequenceDiagram
participant User
participant CLI
participant DatasetAnalyzer
participant ClaudeSDKClient
User->>CLI: Run with --use-claude-hook, --context, --instruct
CLI->>DatasetAnalyzer: analyze_dataset_structure(dataset, sample_size, context_file, instruction)
DatasetAnalyzer->>ClaudeSDKClient: _query_claude_sdk(prompt with context/instruction)
ClaudeSDKClient-->>DatasetAnalyzer: Claude analysis response
DatasetAnalyzer-->>CLI: Analysis results with Claude insights
Class diagram for updated DatasetAnalyzer with Claude Code SDK integrationclassDiagram
class DatasetAnalyzer {
- use_claude: bool
- analysis_results: dict
- conversion_rationale: list
+ __init__(use_claude: bool = False)
+ analyze_dataset_structure(dataset, sample_size: int = 10, context_file: str = None, instruction: str = None) -> dict
+ _enhance_with_claude_analysis(analysis: dict, sample_data, context_file: str = None, instruction: str = None) -> dict
+ _query_claude_sdk(prompt: str) -> dict
+ generate_conversion_rationale(output_dir: str, analysis: dict, conversion_stats: dict)
}
DatasetAnalyzer --> ClaudeSDKClient : uses
DatasetAnalyzer --> ClaudeCodeOptions : configures
Flow diagram for randomization before train/validation splitflowchart TD
A[Converted Examples] -->|--randomize flag| B[Randomize Examples]
B --> C[Split into Train/Validation]
A -->|no randomize| C
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- Consider adding an optional seed parameter for the
--randomizeflag so users can reproduce the same shuffled splits. - Instead of calling
asyncio.runinside_enhance_with_claude_analysis, think about managing a single event loop or converting the CLI entry point to async to avoid creating and closing loops repeatedly. - Add validation or a warning when
--contextor--instructis provided without--use-claude-hookto prevent confusion for users who don’t enable Claude integration.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider adding an optional seed parameter for the `--randomize` flag so users can reproduce the same shuffled splits.
- Instead of calling `asyncio.run` inside `_enhance_with_claude_analysis`, think about managing a single event loop or converting the CLI entry point to async to avoid creating and closing loops repeatedly.
- Add validation or a warning when `--context` or `--instruct` is provided without `--use-claude-hook` to prevent confusion for users who don’t enable Claude integration.
## Individual Comments
### Comment 1
<location> `hf_to_apple_jsonl.py:257` </location>
<code_context>
-
- claude_analysis = json.loads(response.content[0].text)
+ # Use Claude Code SDK instead of direct API calls
+ claude_analysis = asyncio.run(self._query_claude_sdk(prompt))
analysis['claude_insights'] = claude_analysis
</code_context>
<issue_to_address>
Using asyncio.run in synchronous context may cause issues if already in an event loop.
asyncio.run will fail if an event loop is already active. Consider checking for an existing loop or providing a fallback approach.
</issue_to_address>
### Comment 2
<location> `hf_to_apple_jsonl.py:268` </location>
<code_context>
return analysis
+ async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:
+ """Query Claude Code SDK with the analysis prompt."""
+ try:
</code_context>
<issue_to_address>
Error handling in _query_claude_sdk could be more robust for SDK failures.
Consider logging or propagating specific details for SDK failures, such as network or authentication errors, to improve troubleshooting.
Suggested implementation:
```python
import logging
async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:
"""Query Claude Code SDK with the analysis prompt."""
try:
```
```python
try:
```
```python
try:
# Assuming ClaudeSDKClient and query are used here
client = ClaudeSDKClient()
options = ClaudeCodeOptions()
result = await query(client, prompt, options)
return result
except Exception as e:
logging.error(f"Claude SDK query failed: {e}", exc_info=True)
# Optionally, check for specific exception types if Claude SDK provides them
# For example:
# except ClaudeSDKNetworkError as ne:
# logging.error(f"Network error: {ne}")
# return {"error": "network_error", "details": str(ne)}
# except ClaudeSDKAuthError as ae:
# logging.error(f"Authentication error: {ae}")
# return {"error": "auth_error", "details": str(ae)}
return {"error": "sdk_failure", "details": str(e)}
```
If the Claude SDK provides specific exception classes (e.g., `ClaudeSDKNetworkError`, `ClaudeSDKAuthError`), you should catch those explicitly for more granular error handling. Also, ensure that the logging configuration is set up elsewhere in your codebase to capture these logs.
</issue_to_address>
### Comment 3
<location> `hf_to_apple_jsonl.py:542` </location>
<code_context>
action="store_true",
help="Enable Claude Code SDK integration for intelligent dataset analysis (requires anthropic package)"
)
+ parser.add_argument(
+ "--context",
+ help="Path to context file to pass to Claude Code (only used with --use-claude-hook)"
+ )
+ parser.add_argument(
+ "--instruct",
+ help="Instructions to provide to Claude (only used with --use-claude-hook)"
+ )
+ parser.add_argument(
+ "--randomize",
+ action="store_true",
+ help="Randomize the data before splitting into train/validation sets"
+ )
</code_context>
<issue_to_address>
Randomization uses random.shuffle without seeding, which may affect reproducibility.
Consider adding an option for users to set a random seed, or clearly document that enabling randomization will result in non-deterministic splits.
Suggested implementation:
```python
parser.add_argument(
"--randomize",
action="store_true",
help="Randomize the data before splitting into train/validation sets (non-deterministic unless --random-seed is set)"
)
parser.add_argument(
"--random-seed",
type=int,
default=None,
help="Random seed for reproducible randomization (used with --randomize)"
)
```
```python
if self.use_claude:
analysis = self._enhance_with_claude_analysis(analysis, sample_data, context_file, instruction)
```
You will need to:
1. Import the `random` module at the top of the file if it is not already imported.
2. When randomization is performed (wherever `random.shuffle` is called), set the seed using `random.seed(args.random_seed)` if `args.random_seed` is not None, before shuffling.
3. Pass the `random_seed` argument from the parsed CLI args to the relevant function(s) that handle randomization.
</issue_to_address>
### Comment 4
<location> `hf_to_apple_jsonl.py:580` </location>
<code_context>
if args.use_claude_hook:
if not CLAUDE_AVAILABLE:
- print("Warning: Claude Code SDK not available. Install with: pip install anthropic")
+ print("Warning: Claude Code SDK not available. Install with: pip install claude-code-sdk")
print("Proceeding without intelligent analysis...")
else:
</code_context>
<issue_to_address>
Update help text to clarify SDK requirements.
The help text for --use-claude-hook still refers to anthropic; please update it to mention claude-code-sdk for consistency.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
help="Randomize the data before splitting into train/validation sets"
)
args = parser.parse_args()
analyzer = None
if args.use_claude_hook:
=======
help="Randomize the data before splitting into train/validation sets"
)
parser.add_argument(
"--use-claude-hook",
action="store_true",
help="Enable Claude Code SDK-based analysis (requires claude-code-sdk, install with: pip install claude-code-sdk)"
)
args = parser.parse_args()
analyzer = None
if args.use_claude_hook:
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
|
||
| claude_analysis = json.loads(response.content[0].text) | ||
| # Use Claude Code SDK instead of direct API calls | ||
| claude_analysis = asyncio.run(self._query_claude_sdk(prompt)) |
There was a problem hiding this comment.
issue (bug_risk): Using asyncio.run in synchronous context may cause issues if already in an event loop.
asyncio.run will fail if an event loop is already active. Consider checking for an existing loop or providing a fallback approach.
|
|
||
| return analysis | ||
|
|
||
| async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]: |
There was a problem hiding this comment.
suggestion: Error handling in _query_claude_sdk could be more robust for SDK failures.
Consider logging or propagating specific details for SDK failures, such as network or authentication errors, to improve troubleshooting.
Suggested implementation:
import logging
async def _query_claude_sdk(self, prompt: str) -> Dict[str, Any]:
"""Query Claude Code SDK with the analysis prompt."""
try: try: try:
# Assuming ClaudeSDKClient and query are used here
client = ClaudeSDKClient()
options = ClaudeCodeOptions()
result = await query(client, prompt, options)
return result
except Exception as e:
logging.error(f"Claude SDK query failed: {e}", exc_info=True)
# Optionally, check for specific exception types if Claude SDK provides them
# For example:
# except ClaudeSDKNetworkError as ne:
# logging.error(f"Network error: {ne}")
# return {"error": "network_error", "details": str(ne)}
# except ClaudeSDKAuthError as ae:
# logging.error(f"Authentication error: {ae}")
# return {"error": "auth_error", "details": str(ae)}
return {"error": "sdk_failure", "details": str(e)}If the Claude SDK provides specific exception classes (e.g., ClaudeSDKNetworkError, ClaudeSDKAuthError), you should catch those explicitly for more granular error handling. Also, ensure that the logging configuration is set up elsewhere in your codebase to capture these logs.
| parser.add_argument( | ||
| "--context", | ||
| help="Path to context file to pass to Claude Code (only used with --use-claude-hook)" | ||
| ) | ||
| parser.add_argument( | ||
| "--instruct", | ||
| help="Instructions to provide to Claude (only used with --use-claude-hook)" | ||
| ) | ||
| parser.add_argument( | ||
| "--randomize", |
There was a problem hiding this comment.
suggestion: Randomization uses random.shuffle without seeding, which may affect reproducibility.
Consider adding an option for users to set a random seed, or clearly document that enabling randomization will result in non-deterministic splits.
Suggested implementation:
parser.add_argument(
"--randomize",
action="store_true",
help="Randomize the data before splitting into train/validation sets (non-deterministic unless --random-seed is set)"
)
parser.add_argument(
"--random-seed",
type=int,
default=None,
help="Random seed for reproducible randomization (used with --randomize)"
)
if self.use_claude:
analysis = self._enhance_with_claude_analysis(analysis, sample_data, context_file, instruction)You will need to:
- Import the
randommodule at the top of the file if it is not already imported. - When randomization is performed (wherever
random.shuffleis called), set the seed usingrandom.seed(args.random_seed)ifargs.random_seedis not None, before shuffling. - Pass the
random_seedargument from the parsed CLI args to the relevant function(s) that handle randomization.
| analyzer = None | ||
| if args.use_claude_hook: |
There was a problem hiding this comment.
suggestion: Update help text to clarify SDK requirements.
The help text for --use-claude-hook still refers to anthropic; please update it to mention claude-code-sdk for consistency.
| analyzer = None | |
| if args.use_claude_hook: | |
| help="Randomize the data before splitting into train/validation sets" | |
| ) | |
| parser.add_argument( | |
| "--use-claude-hook", | |
| action="store_true", | |
| help="Enable Claude Code SDK-based analysis (requires claude-code-sdk, install with: pip install claude-code-sdk)" | |
| ) | |
| args = parser.parse_args() | |
| analyzer = None | |
| if args.use_claude_hook: |
| response_text += block.text | ||
|
|
||
| # Parse the JSON response | ||
| claude_analysis = json.loads(response_text) |
There was a problem hiding this comment.
issue (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)
Summary
--contextand--instructoptions for enhanced Claude Code SDK integration--randomizeoption to shuffle data before train/validation splitNew Features
--context: Pass a context file to Claude Code for better dataset analysis--instruct: Provide direct instructions to Claude for analysis customization--randomize: Randomize examples before splitting into train/validation setsTechnical Changes
anthropiclibrary toclaude-code-sdkrequirements.txtto reflect new dependencies.gitignoreto include Claude Code configuration filesTest plan
--randomizeflag properly shuffles data before splitting--contextparameter with sample context file--instructparameter with custom instructions🤖 Generated with Claude Code
Summary by Sourcery
Integrate the Claude Code SDK for dataset analysis and add data randomization support via new CLI options
New Features:
Enhancements:
Build: