Analysis Date: January 2026 | Database: CSD v5.46 (Release Nov 2024)
Machine Learning (ML) and Neural Network Potentials (NNPs) require physically consistent training data. The FlexCryst engine identified 151,157 records in the CSD v5.46 release unsuitable for AI training. This screening filters out unphysical artifacts, misplaced atoms, and symmetry errors that would otherwise lead to divergent energy gradients.
| Category | Count | Exclusion Criteria |
|---|---|---|
| Atoms | 35,110 | System size exceeds Java Heap Space limits. |
| Z' (Asymmetric Unit) | 33,867 | Mismatch between chemical formula and 3D residue count. |
| Hydrogens | 32,795 | Missing or incomplete 3D coordinates for H-atoms. |
| Polymer | 28,976 | Infinite frameworks (MOFs/COFs) excluded for discrete analysis. |
| Fractional | 9,735 | Disordered sites preventing unique atomic mapping. |
| Clash | 3,951 | Unphysical inter-atomic distances (see Section 2). |
| Symmetry | 3,942 | Symmetry operation or space group assignment inconsistencies. |
| Rhombohedral | 1,430 | Ambiguity in R-3 hexagonal/rhombohedral settings. |
| Residues | 682 | Complexity limit: > 40 molecules per unit cell. |
| Hofmann | 511 | Expert-validated outliers (visual artifacts/bonding errors). |
| AtomSymbol | 158 | Rare elements/Noble gases with insufficient parameterization. |
| Total | 151,157 | Total unique Refcode rejections. |
The FlexCryst Clash Test is a critical diagnostic tool for detecting wrong settings and refinement errors. Impossible atomic distances generate "ghost" forces that degrade ML model quality.
| Atom Pair | Threshold (Å) | Empirical Outlier Examples (Refcodes) |
|---|---|---|
| H...H / D...D | 1.89 Å | BIBPUZ (1.79 Å), COSWUC (1.81 Å), RIFSEF01 (1.86 Å), DACDER (1.87 Å), SITQIV (1.88 Å) |
| C...C | 2.44 Å | EMANEN03 (2.44 Å) |
| O...O | 1.89 Å | Radial distribution standard minimum: 2.23 Å. |
In many refinements, heavy atoms are placed accurately, but hydrogen atoms are positioned geometrically without energetic validation. These misplaced H-atoms create artificial clashes. Identifying these via the 1.89 Å threshold prevents AI models from learning unphysical, high-repulsion gradients.
Space group No. 14 (P21/c) frequently suffers from mixed settings: P21/a, P21/b,
and P21/c. Misaligning coordinates with symmetry operations creates artificial atomic overlaps.
FlexCryst uses distance violations as a proxy to detect and exclude these symmetry-misassigned structures.
Structures in space group R-3 are excluded due to frequent ambiguity between hexagonal
and rhombohedral settings, which leads to incorrectly transformed coordinates and false distances.
Asymmetric Unit (Z'): We exclude structures where the residue count contradicts the reported chemical formula, ensuring consistent graph mapping.
Complexity Limit: Systems with more than 40 molecules in the unit cell are rejected for computational stability. Standard co-crystals and solvates remain fully supported.
The full list of 151,157 Refcodes is available for institutional research. Use the secure button below to generate a request email.