← Back to Viewer

Technical Report: AI-Compliance Screening

Analysis Date: January 2026 | Database: CSD v5.46 (Release Nov 2024)

Executive Summary

Machine Learning (ML) and Neural Network Potentials (NNPs) require physically consistent training data. The FlexCryst engine identified 151,157 records in the CSD v5.46 release unsuitable for AI training. This screening filters out unphysical artifacts, misplaced atoms, and symmetry errors that would otherwise lead to divergent energy gradients.

1. Rejection Statistics & Categories

Category Count Exclusion Criteria
Atoms35,110System size exceeds Java Heap Space limits.
Z' (Asymmetric Unit)33,867Mismatch between chemical formula and 3D residue count.
Hydrogens32,795Missing or incomplete 3D coordinates for H-atoms.
Polymer28,976Infinite frameworks (MOFs/COFs) excluded for discrete analysis.
Fractional9,735Disordered sites preventing unique atomic mapping.
Clash3,951Unphysical inter-atomic distances (see Section 2).
Symmetry3,942Symmetry operation or space group assignment inconsistencies.
Rhombohedral1,430Ambiguity in R-3 hexagonal/rhombohedral settings.
Residues682Complexity limit: > 40 molecules per unit cell.
Hofmann511Expert-validated outliers (visual artifacts/bonding errors).
AtomSymbol158Rare elements/Noble gases with insufficient parameterization.
Total151,157Total unique Refcode rejections.

2. Advanced Clash Detection & Diagnostic Thresholds

The FlexCryst Clash Test is a critical diagnostic tool for detecting wrong settings and refinement errors. Impossible atomic distances generate "ghost" forces that degrade ML model quality.

Atom Pair Threshold (Å) Empirical Outlier Examples (Refcodes)
H...H / D...D 1.89 Å BIBPUZ (1.79 Å), COSWUC (1.81 Å), RIFSEF01 (1.86 Å), DACDER (1.87 Å), SITQIV (1.88 Å)
C...C 2.44 Å EMANEN03 (2.44 Å)
O...O 1.89 Å Radial distribution standard minimum: 2.23 Å.

Misplaced Hydrogen Atoms

In many refinements, heavy atoms are placed accurately, but hydrogen atoms are positioned geometrically without energetic validation. These misplaced H-atoms create artificial clashes. Identifying these via the 1.89 Å threshold prevents AI models from learning unphysical, high-repulsion gradients.

Symmetry Misassignments (e.g., Space Group 14)

Space group No. 14 (P21/c) frequently suffers from mixed settings: P21/a, P21/b, and P21/c. Misaligning coordinates with symmetry operations creates artificial atomic overlaps. FlexCryst uses distance violations as a proxy to detect and exclude these symmetry-misassigned structures.

Rhombohedral Setting Ambiguity (Space Group $R\bar{3}$)

Structures in space group R-3 are excluded due to frequent ambiguity between hexagonal and rhombohedral settings, which leads to incorrectly transformed coordinates and false distances.

3. Complexity & Data Integrity Logic

Asymmetric Unit (Z'): We exclude structures where the residue count contradicts the reported chemical formula, ensuring consistent graph mapping.

Complexity Limit: Systems with more than 40 molecules in the unit cell are rejected for computational stability. Standard co-crystals and solvates remain fully supported.

Request the Exclusion List

The full list of 151,157 Refcodes is available for institutional research. Use the secure button below to generate a request email.