This library is used for performant, comprehensive, and customizable name-to-SMILES conversions.
This library can use the following existing name-to-SMILES resolvers:
This library also implements the following new resolvers:
- Manually curated dataset of common names not correctly resolved by other resolvers (e.g. 'NaH')
- Structural formula resolver (e.g. 'CH3CH2CH2COOH')
- Inorganic shorthand resolver (e.g. '[Cp*RhCl2]2')
The following string editing/manipulation strategies may be applied to compounds to assist with name-to-SMILES resolution:
- String sanitization for special characters and mojibake encoding errors
- Name correction for OCR errors, typos, pagination errors, etc.
- Splitting compounds on common delimiters (useful for mixtures of compounds, e.g. 'BH3•THF')
- Peptide shorthand expansion (e.g. 'cyclo(Asp-Arg-Val-Tyr-Ile-His-Pro-Phe)' -> 'cyclo(l-aspartyl-l-arginyl-l-valyl-l-tyrosyl-l-isoleucyl-l-histidyl-l-prolyl-l-phenylalanyl)')
When resolvers disagree on the SMILES for a given compound, a variety of SMILES selection methods can be employed to determine the "best" SMILES for a given compound name. See the documentation for more details.
Install cholla_chem with pip directly from this repo:
pip install git+https://github.com/denovochem/cholla_chem.gitResolve chemical names to SMILES by passing a string or a list of strings:
from cholla_chem import resolve_compounds_to_smiles
resolved_smiles = resolve_compounds_to_smiles(compounds_list=['aspirin'])
"{'aspirin': 'CC(=O)Oc1ccccc1C(=O)O'}"See detailed information including which resolver returned which SMILES with detailed_name_dict=True:
from cholla_chem import resolve_compounds_to_smiles
resolved_smiles = resolve_compounds_to_smiles(
compounds_list=['2-acetyloxybenzoic acid'],
detailed_name_dict=True
)
"{'2-acetyloxybenzoic acid': {
'SMILES': 'CC(=O)Oc1ccccc1C(=O)O',
'SMILES_source': ['pubchem_default', 'opsin_default'],
'SMILES_dict': {
'CC(=O)Oc1ccccc1C(=O)O': ['pubchem_default', 'opsin_default']
},
'additional_info': {}
}}"Many aspects of the name-to-SMILES resolution process can be customized, including the resolvers that are used, the configuration of those resolvers, and the strategy used to pick the best SMILES.
In this example, we resolve chemical names with OPSIN, PubChem, and CIRPy, and use a custom consensus weighting approach to pick the best SMILES:
from cholla_chem import (
OpsinNameResolver,
PubChemNameResolver,
CIRpyNameResolver,
resolve_compounds_to_smiles,
)
opsin_resolver = OpsinNameResolver(
resolver_name='opsin',
resolver_weight=4
)
pubchem_resolver = PubChemNameResolver(
resolver_name='pubchem',
resolver_weight=3
)
cirpy_resolver = CIRpyNameResolver(
resolver_name='cirpy',
resolver_weight=2
)
resolved_smiles = resolve_compounds_to_smiles(
compounds_list=['2-acetyloxybenzoic acid'],
resolvers_list=[opsin_resolver, pubchem_resolver, cirpy_resolver],
smiles_selection_mode='weighted',
detailed_name_dict=True
)
"{'2-acetyloxybenzoic acid': {
'SMILES': 'CC(=O)Oc1ccccc1C(=O)O',
'SMILES_source': ['opsin', 'pubchem', 'cirpy'],
'SMILES_dict': {
'CC(=O)Oc1ccccc1C(=O)O': ['opsin', 'pubchem', 'cirpy']
},
'additional_info': {}
}}"cholla_chem can be used as a command line tool. The command line interface can resolve single chemical names directly from the command line or read from a file.
Resolve compounds directly from the command line:
cholla-chem "aspirin"Resolve compounds from a file:
cholla-chem --input names.txt --output results.tsvSee help for more options:
cholla-chem --helpSee documentation for more details.
Full documentation is available here
- Feature ideas and bug reports are welcome on the Issue Tracker.
- Fork the source code on GitHub, make changes and file a pull request.
cholla_chem is licensed under the MIT license.