Category Properties: Helping the Optimizer Understand Your Categories
Attach numerical descriptors to categorical variables so the optimizer knows how similar your options are.
When you define a categorical variable like "Solvent" with options Ethanol, Methanol, and Toluene, the optimizer has no way to know that Ethanol and Methanol are chemically similar while Toluene is very different. Without that information, it treats all categories as equally distant, which can waste experiments.
Properties (also called descriptors) solve this. By attaching numerical values to each category, you give the optimizer a way to measure similarity and make smarter suggestions.
How It Work
You define one or more properties on a categorical variable (e.g. Molecular Weight, Boiling Point).
You assign a numerical value for each property on each category.
The optimizer uses these values as a dense numerical vector to compute distances between categories.
Instead of treating categories as unrelated labels, the optimizer now sees each one as a point in a numerical space, and can infer that nearby points are likely to behave similarly.
Example: Solvents
Without properties: Ethanol, Methanol, Acetone, and Toluene are four unrelated options. The optimizer must try all of them independently.
With properties:
Solvent | Molecular Weight | XLogP | TPSA |
Ethanol | 46.07 | -0.31 | 20.23 |
Methanol | 32.04 | -0.74 | 20.23 |
Acetone | 58.08 | -0.24 | 17.07 |
Toluene | 92.14 | 2.73 | 0.00 |
Now the optimizer knows that Ethanol and Methanol are similar (close values), while Toluene is very different (high XLogP, zero TPSA). If Ethanol gives a good result, it will prioritize Methanol next rather than Toluene.
Example: Ligands
Ligand | Cone Angle (°) | % Buried Volume | TEP (cm-1) |
PPh3 | 145 | 27.6 | 2068.9 |
PCy3 | 170 | 32.4 | 2056.4 |
P(tBu)3 | 182 | 36.5 | 2056.1 |
dppf | 180 | 31.2 | 2064.3 |
Steric (cone angle, buried volume) and electronic (TEP) descriptors let the optimizer explore the ligand space efficiently without testing every option.
How to Choose Good Properties
Good descriptors capture the physical or chemical differences that matter for your experiment. Here are some guidelines:
For Chemical Compounds
Molecular weight — basic size descriptor, always relevant.
XLogP (partition coefficient) — captures hydrophobicity/polarity.
TPSA (topological polar surface area) — captures polarity and hydrogen bonding.
Boiling point — relevant for reactions where temperature matters.
pKa — relevant for acid/base chemistry.
Steric descriptors (cone angle, % buried volume) — crucial for catalysis.
Electronic descriptors (Hammett sigma, TEP) — for electronic effects in reactions.
For Non-Chemical Categories
Reactor type: volume (mL), max pressure (bar), max temperature (°C)
Supplier: purity (%), lead time (days), cost ($/kg)
Protocol: duration (min), number of steps, temperature range
General Principles
2–5 properties is typical. More is not always better — noisy or irrelevant descriptors can hurt performance.
Choose properties that differentiate. If all categories have the same value for a property, it adds no information.
Use properties with different scales. A mix of size, polarity, and shape descriptors captures more information than three size descriptors.
Physical relevance matters. Properties related to the mechanism of your reaction work better than arbitrary numbers.
Without Properties vs With Properties
| Without Properties | With Properties |
How categories are seen | Unrelated labels (one-hot encoded) | Points in a numerical space |
Similarity | All pairs equally distant | Distance reflects real differences |
Exploration | Must try every category | Can infer from similar categories |
Efficiency | More experiments needed | Fewer experiments to find optimum |
Further Reading
The approach of using physicochemical descriptors to inform Bayesian optimization of categorical variables is based on research in the field:
Gryffin: Bayesian optimization of categorical variables informed by expert knowledge — Demonstrates how descriptor-informed optimization outperforms one-hot encoding across chemistry applications including solvent selection, ligand screening, and perovskite design.
Enhancing Bayesian Optimization by Creating Different Descriptor Datasets — Shows how the choice of descriptors affects optimization performance and compares different descriptor strategies.
Bayesian Optimization for Materials Design with Mixed Variables — Integrates a latent variable approach to handle mixed quantitative and qualitative design variables.
Bayesian optimization with known experimental and design constraints for chemistry — Practical guide to applying constrained Bayesian optimization in chemistry applications.
