A practitioner's walkthrough of building a pure-Python parser for the largest EXPRESS schema in the STEP corpus — covering grammar, architecture, implementation, and what the MIM_LF entity hierarchy reveals for agentic digital twin infrastructure.
This paper presents a pragmatic Python parser that extracts ISO 10303 (STEP) EXPRESS schemas into structured entity and type definitions, enabling programmatic reasoning over the MIM_LF (ISO/TS 10303-442) schema used in model-based 3D engineering. The parser reveals critical architectural patterns—root scarcity, hierarchy depth, SELECT proliferation—that inform validation engines and LLM context generation. This foundation layer unlocks agentic digital twins, cross-standard analysis, and automated constraint solving for the built environment intelligence domain.
Read as:
This paper presents a methodology for the automated extraction and structural analysis of **ISO 10303 (STEP)** schemas, with a specific focus on the **Modular Integrated Model (MIM) Layered Framework (ISO/TS 10303-442)**. As model-based enterprises transition toward agentic digital twins, the ability to programmatically reason over the underlying EXPRESS-defined (ISO 10303-11) information architecture becomes a critical bottleneck.
I propose a Python-based parsing pipeline that treats normative EXPRESS files as immutable remote resources, transforming flat schema blocks into rich, hierarchical data structures. By analyzing entity-root scarcity, inheritance depth, and SELECT type proliferation within the **AP242** ecosystem, this study provides a foundational "instrumentation layer" for downstream LLM context injection and cross-standard gap analysis. This work represents an interpretive implementation of the STEP architecture, intended to bridge the gap between static normative documentation and dynamic, machine-actionable engineering informatics.
This paper explains how to automatically read and understand STEP schemas using Python—a foundational skill for anyone building digital twins in construction, real estate, or engineering. STEP (ISO 10303) is the international standard for exchanging building and product data between different software systems. As the construction industry moves toward autonomous digital twins—AI systems that can reason about building data throughout its lifecycle—the ability to programmatically "read" and interpret STEP schemas becomes essential. This paper presents a Python-based tool that converts complex STEP schema files into structured, machine-readable data. By analyzing how entities (types of objects), inheritance (parent-child relationships), and rules are organized within the AP242 STEP profile, this work provides the underlying infrastructure needed for digital twins to validate data, generate AI-ready context, and compare different standards. In essence, this enables AI systems to understand the formal rules of how building and product information should be organized.
**Executive Summary: Automating STEP Schema Interpretation for Digital Twin Operations**
ISO 10303 (STEP) is the foundational data standard for product lifecycle information across engineering-intensive industries. As enterprises transition toward autonomous digital twins and AI-driven asset management, the ability to programmatically interpret and validate STEP schemas becomes a critical competitive advantage—one that currently requires manual, expert-intensive analysis.
This research presents a lightweight automation framework that transforms complex STEP schema definitions into machine-actionable data structures. By eliminating manual schema interpretation, organizations can:
- **Accelerate digital twin deployment** — Enable AI systems to validate data integrity automatically
- **Reduce technical debt** — Eliminate vendor-specific data translation layers
- **Enable cross-system integration** — Standardize data exchange across disconnected systems (CAD, BIM, ERP, asset management)
- **Lower compliance risk** — Automate validation against formal standards
The framework serves as a foundational instrumentation layer for the Built Environment Intelligence (BEI) discipline—positioning organizations to move from static, document-centric workflows to continuous, data-driven asset optimization.
[1] Introduction: The Architecture of Ambition
The **ISO 10303** standard, colloquially known as **STEP**, represents perhaps the most structurally ambitious effort in the history of standardized information. Its goal is a total, computationally processable representation of a product's entire lifecycle—independent of vendor, CAD system, or engineering discipline. At the heart of this ambition is **EXPRESS**, a data modeling language defined in **Part 11** that predates modern schema formats yet exceeds them in semantic depth.
Navigating the STEP corpus, particularly the **MIM_LF (ISO/TS 10303-442)** schema used in **AP242**, presents a significant technical challenge. These schemas are not merely lists of attributes; they are complex directed acyclic graphs (DAGs) defined by inheritance, global rules, and procedural functions.
STEP (ISO 10303), often called the 'universal product language,' is one of the most ambitious technical standards ever created. Its goal is simple but profound: to provide a single, computer-readable way to describe a product's entire lifecycle—from design through manufacturing to operations and maintenance—that works regardless of which software you use or which industry you're in. At the heart of STEP is EXPRESS, a specialized language for defining data models that came before XML, JSON, and other modern formats, but is actually more semantically rich (more capable of expressing meaning and constraints). Navigating STEP schemas, especially the MIM_LF (Modular Integrated Model Layered Framework) used in AP242 for engineering, is technically challenging. These aren't simple lists of attributes like a spreadsheet. They're complex graphs of relationships—think of them as rulebooks that define not just what data exists, but what relationships between entities are valid and what constraints apply. This complexity is precisely what makes STEP powerful for construction and digital twins, but it also makes it hard to work with.
**The Strategic Opportunity: Breaking Vendor Silos Through Standardized Information Architecture**
ISO 10303 represents a decades-long international effort to create a vendor-agnostic, discipline-agnostic representation of product and asset information. While the standard is technically complex, its strategic value is clear:
**The Problem it Solves:**
- Design data locked in proprietary CAD formats
- Construction and operations data fragmented across disconnected systems (Revit, Navisworks, ERP, CMMS)
- Manual, error-prone data translation between systems
- Vendor lock-in and high switching costs
**The Opportunity:**
- A single semantic foundation that enables true interoperability across the product lifecycle
- Reduction in data transformation costs (currently 30-40% of IT budgets in large enterprises)
- Enables AI/ML initiatives that require standardized, validated inputs
- Positions organizations to adopt next-generation digital twins and autonomous asset management
This research provides the technical foundation to unlock that opportunity at scale.
### Scope, Technical Disclosure & Intellectual Property
This paper serves as a position piece on the **automated parsing and interpretive synthesis** of these schemas. It is important to note the following boundaries:
* **Interpretive Implementation:** The code and data structures (`EntityDef`, `TypeDef`) described herein are original interpretations of the EXPRESS grammar.
* **Normative References:** This study references the **ISO/TS 10303-442** schema as a "ground truth" resource. In compliance with intellectual property standards, this paper does not redistribute the copyrighted text of the ISO standards; rather, it provides the logic for interacting with these files via official technical repositories (such as the **MBX-IF** and **ISO standards server**).
* **Interpretive Logic:** All parsing logic, data structures (`EntityDef`, `TypeDef`), and analytical summaries are the original work of the author, representing an implementation-level interpretation of the standard's grammar.
* **Research Intent:** The primary objective is to enable "agentic reasoning"—allowing digital twin systems to validate data and generate context based on the formal semantics encoded within the standard.
* **Distribution:** This article does not redistribute the standard's text. The provided code fetches normative files directly from official ISO or MBX-IF repositories for the purpose of interoperability analysis.
* **Source Materials:** The schemas referenced (e.g., `mim_lf.exp`) are the intellectual property of the International Organization for Standardization (ISO).
This paper describes a practical approach to automatically reading and interpreting STEP schemas. A few important points about boundaries and permissions: First, the parsing code and data structures (EntityDef, TypeDef) presented here are original implementations—I'm building my own 'translator' rather than using existing ISO standards directly. Second, this work references the ISO/TS 10303-442 schema as the ground truth, but respects intellectual property standards by not reproducing the ISO standard's copyrighted text. Instead, it provides the logic and code to fetch and interpret these files from official sources. Third, all parsing logic, dataclasses, and analysis results are original work representing a practical interpretation of how the STEP standard's grammar works. The main goal is to enable 'agentic reasoning'—allowing AI systems and digital twins to validate building data and generate context based on the formal rules encoded in standards. Finally, this work doesn't redistribute the standard itself; the code automatically fetches normative schema files from official repositories (like the MBX-IF hub or ISO's servers) for interoperability analysis.
**Scope and Strategic Intent**
This framework is positioned as a **practical instrumentation layer** for organizations seeking to:
- Evaluate STEP/IFC adoption feasibility
- Automate schema validation and compliance checking
- Generate standardized context for AI/ML systems reasoning over engineering data
- Support cross-standard interoperability analysis (STEP vs. IFC vs. proprietary formats)
**Important Clarifications:**
- This is not a standards endorsement tool; it is an analysis and interpretation framework
- Compliance with ISO standards requires human validation; automation accelerates analysis
- The framework is designed to work with publicly available schema resources (ISO, MBX-IF)
- Applicable to any organization operating with STEP-based data exchange requirements
The intent is to democratize STEP expertise, enabling mid-market and enterprise organizations to adopt standardized data practices without large professional services investments.
[2] Context — Why MIM_LF?
ISO 10303 (STEP — Standard for the Exchange of Product model data) is one of the most
structurally ambitious standards ever published. Its ambition is total: a single,
computationally processable representation of product data across its entire lifecycle,
independent of any system, vendor, or discipline.
Part 11 of the standard defines EXPRESS — a declarative schema language purpose-built for
this task. EXPRESS predates most modern schema languages and shares conceptual DNA with
Pascal and object-oriented type systems, but operates at a higher semantic level. It
describes not just structure but constraints, derivations, and formal invariants. An
EXPRESS schema is simultaneously a data model, a type system, and a partial axiom set.
The standard is partitioned into Application Protocols (APs), each targeting a domain:
AP203 for configuration-controlled design, AP214 for automotive, AP242 for managed
model-based 3D engineering. Each AP publishes an Application Interpreted Model (AIM) —
a large monolithic EXPRESS schema — alongside its normative text.
The modular restructuring introduced in the 2010s changed this. ISO 10303-4xx defines
Application Modules (AMs): small, composable EXPRESS schemas with well-defined interfaces.
An AP is now assembled from a library of modules rather than written as one giant file.
The Modular Integrated Model (MIM) for an AP is the result of integrating all required
modules into a single schema — the functional equivalent of the old AIM, but traceable
to its module origins.
ISO/TS 10303-442 is the Modular Integrated Manufacturing (MIM) Layered Framework. The
`mim_lf.exp` schema is literally the most comprehensive single EXPRESS file in the
STEP corpus: it integrates the complete set of Application Modules needed for AP242 ed.2
and represents the current state of the STEP information architecture for
model-based enterprise.
Parsing this file is not an academic exercise. For any agentic digital twin system that
needs to reason over STEP data — validating STEP Part 21 file instances, generating
schema-grounded LLM context, or performing cross-standard gap analysis between ISO 10303
and ISO 16739 (IFC) — this schema is the ground truth.
STEP (Standard for the Exchange of Product Model Data, ISO 10303) is one of the most structurally comprehensive standards in existence. Its ambition is total: a single, computer-readable way to represent all product data—including buildings, components, and systems—across its entire lifecycle, independent of software vendor or engineering discipline. EXPRESS (defined in Part 11 of ISO 10303) is the specialized language that makes STEP possible. While EXPRESS existed before modern schema languages like JSON Schema, it's actually more expressive—designed to capture not just structure, but constraints and business rules. An EXPRESS schema is simultaneously a data model, a type system, and a formal rule set. STEP is organized into Application Profiles (APs), each targeting a specific domain—AP203 for product design, AP214 for automotive, AP242 for model-based 3D engineering and manufacturing. Each AP includes an Application Interpreted Model (AIM), a large schema that defines all the entities and relationships needed for that domain. More recently, STEP moved to a modular approach. Instead of writing one giant schema, APs are now assembled from smaller, reusable Application Modules (AMs)—like building blocks for information models. The Modular Integrated Model (MIM) is what you get when you combine all the modules needed for an AP into a single working schema. ISO/TS 10303-442 is the MIM for AP242, the standard profile for manufacturing and engineering. The mim_lf.exp file is the largest single EXPRESS file in the STEP ecosystem—it literally integrates hundreds of modules into one schema, representing everything needed for model-based enterprise and digital twins. For any AI system or digital twin that needs to reason over STEP data—validating that data is correct, generating AI-readable documentation, or comparing STEP to other standards like IFC—this schema is the authoritative source.
**STEP as Industry Infrastructure: Why Standardized Data Models Matter**
ISO 10303 (STEP) is the internationally standardized framework for capturing and exchanging product lifecycle information across engineering, manufacturing, and operations. Think of it as the "TCP/IP of industrial data"—a foundational layer that enables interoperability independent of vendor, system, or discipline.
**Why This Matters for Real Estate and Construction:**
- **Designed for complexity** — Accommodates geometry, materials, performance, configuration, and constraints in a single framework
- **Discipline-agnostic** — Works for mechanical, electrical, structural, and building systems
- **Lifecycle coverage** — Spans design, manufacturing, operations, and maintenance phases
- **Vendor independence** — Forces suppliers to compete on capability, not lock-in
**Current State of Industry Adoption:**
APP242 (Managed Model-Based 3D Engineering) represents the most comprehensive application of STEP for modern digital engineering. It integrates hundreds of standardized information modules into a single semantic schema—the largest standardized data model in the engineering world.
For organizations with complex asset portfolios (real estate, construction, infrastructure), STEP adoption eliminates the fragmentation penalty currently paid through manual data translation, duplicate entry, and decision-making delays.
The MBX-IF (Model-Based Experience Implementor Forum), hosted at
`https://www.mbx-if.org/home/mbx/resources/express-schemas/`, maintains the canonical
distribution point for EXPRESS schemas across the STEP and related families. It is the
practitioner's first stop — not ISO's official website, which distributes schemas as
normative attachments to purchased standards documents. The MBX-IF page provides direct
download links for current editions of schemas including:
- MIM_LF (ISO/TS 10303-442) — the subject of this exercise
- IFC EXPRESS schemas (ISO 16739) — for cross-standard work
- SMRL (Shape Representation Module Reference Library)
- Individual Application Module schemas
The URL we will use is the normative Ed.7 MIM_LF schema from the ISO standards server
itself, which MBX-IF references:
https://standards.iso.org/iso/ts/10303/-442/ed-7/tech/express/mim_lf.exp
This file is authoritative. It is large — expect several megabytes — and it is the
integration of hundreds of modules, each contributing entities, types, rules, and
functions into a single flat SCHEMA...END_SCHEMA block.
The MBX-IF (Model-Based Experience Implementer Forum) at https://www.mbx-if.org/home/mbx/resources/express-schemas/ is where practitioners find EXPRESS schemas. While ISO officially publishes schemas as attachments to purchased standard documents, MBX-IF provides direct download links that are much more convenient. You'll find there: the MIM_LF schema (ISO/TS 10303-442)—the subject of this work; IFC schemas (ISO 16739)—for comparing building data standards; SMRL (Shape Representation Module Reference Library); and individual Application Module schemas. For this exercise, we use the authoritative Ed.7 MIM_LF schema directly from ISO's standards server: https://standards.iso.org/iso/ts/10303/-442/ed-7/tech/express/mim_lf.exp. This file is the source of truth. It's large—several megabytes—because it integrates hundreds of modules into a single flat SCHEMA...END_SCHEMA block, with every entity, type, and rule physically present.
**Where Standards Live: The Model-Based Experience Implementer Forum**
ISO maintains STEP schemas as normative documents attached to paid standards. However, the Model-Based Experience Implementer Forum (MBX-IF, https://www.mbx-if.org) serves as the practical distribution point—freely accessible and versioned. This is where practitioners access the authoritative schema definitions needed for implementation and validation.
**Strategic Implication:** Standards governance and version control are critical to digital transformation success. Organizations should establish clear policies around which standards versions are in use across their platforms and systems.
[3] EXPRESS Grammar — A Practitioner's Reference
Before parsing, you need to know exactly what you are parsing. EXPRESS (ISO 10303-11)
defines a small but precise grammar. The constructs relevant to schema analysis are:
**SCHEMA block** — the top-level container. Every EXPRESS file has exactly one.
SCHEMA mim_lf;
USE FROM (, , ...);
REFERENCE FROM (...);
...declarations...
END_SCHEMA;
The USE FROM and REFERENCE FROM statements are the module interface. In a MIM schema,
these are resolved — all referenced entities are physically present — so we can ignore
them for parsing purposes.
**ENTITY** — the primary structural unit. Entities support single-root multiple
inheritance via SUBTYPE/SUPERTYPE constraints.
ENTITY
[ABSTRACT] [SUPERTYPE OF ()]
[SUBTYPE OF (, , ...)];
: [OPTIONAL] ;
...
[DERIVE
: := ;
...]
[INVERSE
: [SET|BAG] [] OF FOR ;
...]
[UNIQUE
Before parsing a STEP schema, you need to understand its grammar. EXPRESS (ISO 10303-11) uses a precise, compact syntax. The main constructs are: **SCHEMA block**—the container for everything, with module imports via USE FROM and REFERENCE FROM statements. **ENTITY**—the core building block, representing a type of object (like 'product', 'shape', 'document'). Entities support inheritance: a SUBTYPE OF declaration says 'this entity is a specialization of a parent entity.' A SUPERTYPE OF declaration constrains how this entity can be specialized. **DERIVE section**—defines computed attributes (attributes whose values are calculated from other attributes, not stored directly). This matters for digital twins because derived attributes encode business logic. **INVERSE section**—defines backward relationships (if entity A references entity B, the inverse lets you navigate from B back to A). **WHERE section**—formal rules that every valid instance must follow. These are the schema's constraints, expressed as boolean logic. They're crucial for validation. **TYPE**—defines aliases, enumerations (closed lists of values), SELECT types (discriminated unions—'this attribute can be one of these types'), and aggregations (collections like SET, LIST, ARRAY). SELECT types are particularly important in STEP because they represent the flexibility that real-world data needs. For example, 'measure_value' might be a SELECT that unifies different kinds of measurements. **RULE**—global constraints spanning multiple entities. These define cross-entity consistency. **FUNCTION/PROCEDURE**—procedural logic for computing values. Understanding these constructs is the foundation for reading any STEP schema.
**How STEP Schemas Encode Information Rules**
STEP schemas define three critical things:
1. **Structure** — What data elements exist and how they relate
2. **Constraints** — What values are valid and why (formal business rules)
3. **Inheritance** — How specialized concepts derive from general ones (reducing redundancy)
For digital twins and autonomous validation, constraint rules are the most valuable. A schema constraint like "a building must have at least one accessible emergency exit" is encoded formally, enabling AI systems to validate instances automatically rather than relying on manual inspection.
This formalization is what separates STEP from simpler approaches (spreadsheets, document standards). It enables **machine reasoning**—the foundation of next-generation asset management.
[4] Parser Architecture
This parser is deliberately not a full PEG or ANTLR grammar. That would be correct
engineering for a production compiler — but our goal is structured extraction, not
validation. The EXPRESS grammar is regular enough that a well-designed regex scanner,
combined with section-aware string splitting, covers the 95% case for the entities
and types we care about.
The pipeline has five stages:
1. **Acquisition** — HTTP fetch with local cache. STEP files use Latin-1 encoding
per ISO 10303-21; the schema files follow the same convention.
2. **Normalisation** — Strip comments (block `(* *)` and line `--`), which can
contain spurious keywords that fool naive pattern matchers.
3. **Schema extraction** — Isolate the SCHEMA...END_SCHEMA block and extract the
schema name. In MIM_LF this gives us the module identity.
4. **Declaration scanning** — Regex scan for ENTITY and TYPE blocks. Because these
blocks do not nest in EXPRESS, a non-overlapping DOTALL regex is sufficient and
fast even on megabyte-scale inputs.
5. **Block parsing** — For each extracted block, parse the header (ABSTRACT,
SUPERTYPE OF, SUBTYPE OF), split the body into sections (explicit, DERIVE,
INVERSE, UNIQUE, WHERE), and parse each section into typed dataclass instances.
The output is a list of `EntityDef` and `TypeDef` dataclasses, which feed directly
into hierarchy analysis, JSON serialisation for LLM context, or graph construction
for downstream agentic reasoning.
This parser deliberately avoids being a full, production-grade compiler grammar (which would use PEG or ANTLR tools). Why? Because our goal isn't absolute validation; it's practical extraction and understanding. EXPRESS is structured enough that careful regex patterns combined with intelligent string splitting handle 95% of real-world cases. The pipeline has five stages: (1) **Acquisition**—fetch the schema file over HTTP with local caching. STEP files use Latin-1 (ISO 8859-1) encoding, not UTF-8, for historical reasons. (2) **Normalisation**—remove comments, which can confuse simple pattern matching. EXPRESS has two comment forms: block comments (/* ... */) and line comments (--). (3) **Schema extraction**—isolate the SCHEMA...END_SCHEMA block and identify the schema name. (4) **Declaration scanning**—use regex to find all ENTITY and TYPE blocks. Since these don't nest, a straightforward regex is fast even on megabyte files. (5) **Block parsing**—for each block, parse the header (ABSTRACT, SUPERTYPE OF, SUBTYPE OF), split the body into sections (explicit attributes, DERIVE, INVERSE, UNIQUE, WHERE), and convert each into typed Python objects (dataclasses). The output is lists of EntityDef and TypeDef objects, ready for hierarchy analysis, LLM context generation, or graph construction.
**Practical Engineering: Why Lightweight Automation Beats Comprehensive Parsing**
The approach described here prioritizes **rapid deployment and maintainability** over perfect grammatical parsing. Rather than building a heavyweight compiler (which would require months and deep compiler expertise), we extract the 95% of schema information that matters for business use cases:
- Entity definitions and inheritance chains
- Attributes and their data types
- Formal constraints (WHERE rules)
- Type definitions and enumerations
**Why This Matters:**
- **Faster time-to-value** — Deploy schema analysis capabilities in weeks, not months
- **Easier maintenance** — Lightweight code is easier to understand and modify as standards evolve
- **Vendor independence** — No external dependencies means no licensing risks
- **Scalability** — Processes even the largest schemas (MIM_LF = several megabytes) efficiently
[5] Implementation
The implementation requires only the Python standard library. No third-party packages.
```python
import re
import json
import urllib.request
from dataclasses import dataclass, field
from pathlib import Path
from collections import defaultdict
from typing import Optional
# region CONSTANTS
MBXIF_RESOURCES = "https://www.mbx-if.org/home/mbx/resources/express-schemas/"
SCHEMA_URL = "https://standards.iso.org/iso/ts/10303/-442/ed-7/tech/express/mim_lf.exp"
CACHE_DIR = Path("src/var/data/schemas")
# endregion
```
This implementation needs only Python's standard library—no external packages. We use the built-in dataclasses for typed structures, regex for pattern matching, urllib for HTTP, pathlib for file handling, and json for serialization. This minimal footprint means you can run this parser anywhere Python runs, with no dependency management overhead. Here's what we import: `re` (regular expressions), `json` (serialization), `urllib.request` (HTTP fetching), `dataclasses` (typed data structures), `pathlib.Path` (file handling), `collections.defaultdict` (efficient key-value grouping), and `typing.Optional` (type hints). Constants define where to fetch from: MBX-IF's resource hub and ISO's standards server.
**Zero External Dependencies: Enterprise-Grade Simplicity**
This framework uses only Python standard library components.
**Strategic Advantages:**
- No vendor dependencies or licensing complications
- Works in air-gapped and highly secured environments
- Trivial to deploy and maintain across enterprise infrastructure
- Minimal operational risk
This design makes it suitable for critical business systems where supply chain risk and vendor lock-in are concerns.
Dataclasses give us typed, inspectable objects without the overhead of ORM-style
frameworks. The `section` field on `Attribute` distinguishes explicit attributes
from derived and inverse ones — a distinction that matters for digital twin reasoning,
because derived attributes represent computable knowledge while inverse attributes
encode the reverse traversal of relationships.
```python
# region DATA STRUCTURES
@dataclass
class Attribute:
"""
An attribute on an EXPRESS ENTITY.
Attributes
----------
name : str
Attribute identifier as declared in the schema.
type_expr : str
Raw type expression string, e.g. 'SET [1:?] OF product_definition'.
optional : bool
True if preceded by OPTIONAL keyword.
section : str
Source section: 'explicit' | 'derived' | 'inverse'.
"""
name: str
type_expr: str
optional: bool = False
section: str = "explicit"
@dataclass
class WhereRule:
"""
A WHERE domain rule from an ENTITY or TYPE declaration.
Attributes
----------
label : str
Rule identifier, e.g. 'WR1', 'UR1'.
expression : str
Boolean expression text (not parsed further here).
"""
label: str
expression: str
@dataclass
class EntityDef:
"""
Parsed representation of an EXPRESS ENTITY declaration.
Attributes
----------
name : str
Entity name.
abstract : bool
True if the entity carries the ABSTRACT keyword.
supertypes : list[str]
Names from SUBTYPE OF (...) — direct parents in the hierarchy.
supertype_constraint : str | None
The ONEOF(...) or AND expression from SUPERTYPE OF, if present.
This encodes the partitioning constraint on the entity's subtypes.
attributes : list[Attribute]
All attributes across explicit, derived, and inverse sections.
where_rules : list[WhereRule]
Formal domain invariants declared in the WHERE section.
"""
name: str
abstract: bool = False
supertypes: list[str] = field(default_factory=list)
supertype_constraint: Optional[str] = None
attributes: list[Attribute] = field(default_factory=list)
where_rules: list[WhereRule] = field(default_factory=list)
@dataclass
class TypeDef:
"""
Parsed representation of an EXPRESS TYPE declaration.
Attributes
----------
name : str
Type name.
kind : str
One of: 'alias' | 'enumeration' | 'select' | 'aggregate'.
base : str
Underlying content — primitive type, comma-separated enum values,
SELECT members, or aggregation expression.
where_rules : list[WhereRule]
Formal constraints on alias types (e.g., string length restrictions).
"""
name: str
kind: str
base: str
where_rules: list[WhereRule] = field(default_factory=list)
# endregion
```
Dataclasses—Python's built-in way to create typed, inspectable objects—give us clean, readable representations of STEP schema elements without heavy frameworks. An **Attribute** represents a single piece of data on an entity: its name, type expression (as text), whether it's optional, and which section it came from (explicit storage, derived computation, or inverse relationship). The section field matters for digital twins because it tells you whether this attribute is stored data or computed logic. A **WhereRule** captures a formal domain constraint—its label ('WR1', 'UR2', etc.) and the boolean expression it enforces. An **EntityDef** is a parsed STEP entity: its name, whether it's abstract (can't be directly instantiated), its parent entities (supertypes), any constraint on how it can be subtyped (supertype_constraint), all its attributes across all sections, and its where rules. An **TypeDef** is a parsed STEP type: its name, what kind it is (alias, enumeration, select, or aggregate), the content (what it aliases to, or the members of a select), and any constraints. These dataclasses serialize directly to JSON, making them perfect for feeding to LLMs, storing in databases, or passing to downstream analysis.
**How We Represent Schemas for Machine Reasoning**
Schemas are transformed into three key data structures:
1. **Entity Definitions** — Represent real-world objects (parts, buildings, assets) and their attributes, inheritance hierarchies, and formal rules
2. **Type Definitions** — Represent data domains and constraints (enumerations like "material type," unions of related concepts like "measurement units")
3. **Formal Rules** — Capture business logic that must hold true ("load-bearing walls must have structural strength >= X", "a project must reference a valid budget")
These structures feed directly into:
- **AI/LLM context generation** — Enabling large language models to reason correctly about domain data
- **Automated validation** — Checking that data instances comply with schema constraints
- **Digital twin instantiation** — Ensuring simulations contain semantically valid information
The fetch function treats the schema URL as an immutable remote resource. After the
first download it reads from disk. STEP-family files historically use Latin-1 (ISO 8859-1)
encoding, not UTF-8 — the schema body is 7-bit ASCII in practice, but the encoding
declaration matters for robustness.
```python
# region ACQUISITION
def fetch_schema(url: str, cache_dir: Path) -> str:
"""
Download EXPRESS schema with local caching.
STEP physical files (ISO 10303-21) mandate ISO 8859-1 encoding.
EXPRESS schema files (.exp) follow the same convention.
Parameters
----------
url : str
Full URL to the .exp file.
cache_dir : Path
Directory to cache the raw file.
Returns
-------
str
Schema text decoded from Latin-1.
"""
filename = url.split("/")[-1]
cache_path = cache_dir / filename
if cache_path.exists():
print(f"[cache hit] {cache_path}")
return cache_path.read_text(encoding="latin-1")
print(f"[fetch] {url}")
req = urllib.request.Request(
url,
headers={"User-Agent": "EXPRESS-Schema-Parser/1.0 (research)"},
)
with urllib.request.urlopen(req, timeout=60) as resp:
raw = resp.read()
text = raw.decode("latin-1")
cache_dir.mkdir(parents=True, exist_ok=True)
cache_path.write_text(text, encoding="utf-8")
print(f"[saved] {cache_path} ({len(text):,} chars)")
return text
# endregion
```
The fetch_schema function downloads the schema file and caches it locally. Once cached, it reads from disk—eliminating redundant network calls. STEP and EXPRESS files historically use Latin-1 (ISO 8859-1) encoding, not UTF-8, following ISO 10303-21 (the STEP physical file format). While the content is usually 7-bit ASCII, respecting the encoding declaration ensures robustness. The function adds a User-Agent header to indicate this is a research tool, then stores the downloaded file as UTF-8 locally for easier downstream handling.
**Fetching and Caching Standards Data**
Schemas are treated as immutable remote resources, fetched once and cached locally. This approach provides:
- **Reliability** — Works even if external sources become temporarily unavailable
- **Auditability** — Clear version control of which standards versions are in use
- **Performance** — Subsequent runs use cached data
- **Compliance** — Demonstrates controlled standards governance
Comment stripping is stage zero of parsing. The MIM_LF schema is heavily commented —
entity-level annotations, source module tracability markers, and STEP editorial notes
all appear in block comments. The regex must be non-greedy on block comments and strip
line-comments without disturbing string literals.
EXPRESS block comments do not nest (ISO 10303-11 §7.1.5.2), which simplifies the regex
considerably.
```python
# region NORMALISATION
def strip_comments(text: str) -> str:
"""
Remove EXPRESS comment forms from schema text.
Express defines two comment forms (ISO 10303-11 §7.1.5):
- Block: (* ... *) — non-nestable
- Line: -- to end of line
Parameters
----------
text : str
Raw schema text.
Returns
-------
str
Schema text with all comments replaced by single spaces.
"""
# Block comments — non-greedy, DOTALL for multiline
text = re.sub(r'\(\*.*?\*\)', ' ', text, flags=re.DOTALL)
# Line comments — to end of line
text = re.sub(r'--[^\n]*', '', text)
return text
def extract_schema_block(text: str) -> tuple[str, str]:
"""
Isolate the primary SCHEMA...END_SCHEMA block.
Parameters
----------
text : str
Normalised schema text.
Returns
-------
tuple[str, str]
(schema_name, schema_body) where schema_body is everything
between the opening semicolon and END_SCHEMA;.
Raises
------
ValueError
If no valid SCHEMA block is found.
"""
m = re.search(
r'\bSCHEMA\s+(\w+)\s*;(.*?)\bEND_SCHEMA\s*;',
text,
re.DOTALL | re.IGNORECASE,
)
if not m:
raise ValueError("No SCHEMA...END_SCHEMA block found in text")
return m.group(1), m.group(2)
# endregion
```
Comment stripping is the first parse stage because EXPRESS comments can contain keywords that fool naive pattern matching. EXPRESS has two comment forms: block comments with (* and *) delimiters (non-nestable per ISO 10303-11), and line comments starting with --. The regex patterns must be non-greedy for block comments and must not strip comments inside string literals (though in practice STEP schemas don't have strings with comment-like content). After comments are removed, the extract_schema_block function isolates the SCHEMA...END_SCHEMA block and extracts the schema name. For MIM_LF, this gives us the module identity.
**Preparing Raw Standards for Analysis**
Raw schema files contain extensive human-readable comments and annotations. The first processing step removes these while preserving semantic content, producing a clean input for programmatic analysis. This is equivalent to how manufacturing removes surface contamination before precision machining.
Entity parsing is the heart of the exercise. The key insight is that each ENTITY block
has a clear two-part structure: a **header** (everything before the first semicolon)
and a **body** (the attribute and constraint sections). The sections in the body are
delimited by reserved words — DERIVE, INVERSE, UNIQUE, WHERE — which never appear as
attribute names.
The attribute line format `name : [OPTIONAL] type_expr` is regular enough for a simple
split, but the type expression can be arbitrarily complex:
components : SET [1:?] OF UNIQUE (action_item);
mapping : LIST [0:?] OF LIST [1:?] OF REAL;
ref : representation_or_representation_reference;
We capture the full type expression as a string and defer deeper type parsing to a
separate stage (not implemented here, but straightforward to add).
```python
# region ENTITY PARSING
def _split_entity_sections(body: str) -> dict[str, str]:
"""
Partition entity body into named sections.
EXPRESS entities partition their body with keyword markers.
This function slices the body string at those markers.
Parameters
----------
body : str
Entity body text (after the header semicolon, before END_ENTITY).
Returns
-------
dict[str, str]
Keys: 'explicit', 'DERIVE', 'INVERSE', 'UNIQUE', 'WHERE'.
Values: raw text of each section.
"""
section_re = re.compile(
r'\b(DERIVE|INVERSE|UNIQUE|WHERE)\b', re.IGNORECASE
)
parts: dict[str, str] = {}
current_key = "explicit"
current_start = 0
for m in section_re.finditer(body):
parts[current_key] = body[current_start : m.start()].strip()
current_key = m.group(1).upper()
current_start = m.end()
parts[current_key] = body[current_start:].strip()
return parts
def _parse_attribute_line(
line: str, section: str = "explicit"
) -> Optional[Attribute]:
"""
Parse a single attribute declaration line.
Handles: explicit form ( : [OPTIONAL] )
and derived form ( : := ) — for derived,
pass only the left side of ':='.
Parameters
----------
line : str
Raw attribute text, semicolons stripped.
section : str
Source section tag.
Returns
-------
Attribute | None
Parsed attribute, or None if the line is not a valid declaration.
"""
line = line.strip().rstrip(';')
if ':' not in line:
return None
colon_pos = line.index(':')
name = line[:colon_pos].strip()
# Attribute names are simple identifiers
if not re.match(r'^\w+$', name):
return None
type_expr = line[colon_pos + 1:].strip()
optional = bool(re.match(r'OPTIONAL\b', type_expr, re.IGNORECASE))
if optional:
type_expr = re.sub(r'^OPTIONAL\s+', '', type_expr, flags=re.IGNORECASE)
return Attribute(name=name, type_expr=type_expr, optional=optional, section=section)
def parse_entity(name: str, raw_block: str) -> EntityDef:
"""
Parse a complete ENTITY block into an EntityDef.
Parses the full EXPRESS entity grammar including:
- ABSTRACT and SUPERTYPE OF constraints
- SUBTYPE OF inheritance declarations
- Explicit, derived, and inverse attribute sections
- WHERE domain rules
Parameters
----------
name : str
Entity name (already extracted by the scanner).
raw_block : str
Complete text from ENTITY ... END_ENTITY;
Returns
-------
EntityDef
Structured representation of the entity.
"""
# Strip outer ENTITY/END_ENTITY wrappers
body = re.sub(r'^ENTITY\s+\w+', '', raw_block, flags=re.IGNORECASE).strip()
body = re.sub(r'END_ENTITY\s*;?\s*$', '', body, flags=re.IGNORECASE).strip()
# Split header (before first ';') from attribute sections
first_semi = body.find(';')
header = body[:first_semi].strip() if first_semi != -1 else body
remainder = body[first_semi + 1:].strip() if first_semi != -1 else ''
# --- Header parsing ---
is_abstract = bool(re.search(r'\bABSTRACT\b', header, re.IGNORECASE))
# SUBTYPE OF (, ...) — direct supertype list
subtype_m = re.search(
r'SUBTYPE\s+OF\s*\(([^)]+)\)', header, re.IGNORECASE
)
supertypes = (
[s.strip() for s in subtype_m.group(1).split(',')]
if subtype_m else []
)
# SUPERTYPE OF () — partitioning constraint on subtypes
# The constraint is typically ONEOF(...) or AND(ONEOF(...), ...)
supertype_m = re.search(
r'SUPERTYPE\s+OF\s*\((.+)', header, re.IGNORECASE | re.DOTALL
)
supertype_constraint = (
supertype_m.group(1).strip().rstrip(')').strip()
if supertype_m else None
)
# --- Body section parsing ---
sections = _split_entity_sections(remainder)
attributes: list[Attribute] = []
# Explicit attributes — simple : [OPTIONAL]
for stmt in re.split(r';', sections.get('explicit', '')):
attr = _parse_attribute_line(stmt.strip(), 'explicit')
if attr:
attributes.append(attr)
# Derived attributes — : :=
for stmt in re.split(r';', sections.get('DERIVE', '')):
stmt = stmt.strip()
if ':=' in stmt:
attr_decl = stmt.split(':=', 1)[0]
attr = _parse_attribute_line(attr_decl, 'derived')
if attr:
attributes.append(attr)
# Inverse attributes — same format as explicit but carry set/bag cardinality
for stmt in re.split(r';', sections.get('INVERSE', '')):
attr = _parse_attribute_line(stmt.strip(), 'inverse')
if attr:
attributes.append(attr)
# WHERE rules — :
where_rules: list[WhereRule] = []
for stmt in re.split(r';', sections.get('WHERE', '')):
stmt = stmt.strip()
if ':' in stmt:
colon = stmt.index(':')
label = stmt[:colon].strip()
expr = stmt[colon + 1:].strip()
if re.match(r'^\w+$', label) and expr:
where_rules.append(WhereRule(label=label, expression=expr))
return EntityDef(
name=name,
abstract=is_abstract,
supertypes=supertypes,
supertype_constraint=supertype_constraint,
attributes=attributes,
where_rules=where_rules,
)
def extract_entities(schema_body: str) -> list[EntityDef]:
"""
Scan schema body and extract all ENTITY...END_ENTITY blocks.
ENTITY blocks do not nest in EXPRESS, so a non-overlapping DOTALL
regex scan is sufficient and O(n) in schema size.
Parameters
----------
schema_body : str
Normalised schema body (comments stripped, USE/REFERENCE present).
Returns
-------
list[EntityDef]
Parsed entities in declaration order.
"""
pattern = re.compile(
r'\bENTITY\s+(\w+)(.*?)END_ENTITY\s*;',
re.DOTALL | re.IGNORECASE,
)
entities: list[EntityDef] = []
for m in pattern.finditer(schema_body):
try:
entity = parse_entity(m.group(1), m.group(0))
entities.append(entity)
except Exception as exc:
print(f"[warn] parse failure on entity '{m.group(1)}': {exc}")
return entities
# endregion
```
Entity parsing is the heart of the work. EXPRESS entities have a clear structure: a header (up to the first semicolon) containing ABSTRACT, SUPERTYPE OF, and SUBTYPE OF declarations; and a body containing attribute and constraint sections. The body sections are delimited by keywords—DERIVE, INVERSE, UNIQUE, WHERE—that never appear as attribute names. The core parsing strategy: split the body at these keywords, then parse each section. Explicit attributes have the form 'name : [OPTIONAL] type'; derived attributes have 'name : type := expression'; inverse attributes follow the same form but represent backward relationships. WHERE rules have the form 'label : boolean_expression'. The full type expression (e.g., 'SET [1:?] OF product_definition') is captured as text and preserved. This lets downstream tools handle type complexity without the parser having to fully understand nested aggregations and selects.
**Extracting Entity Definitions: The Core of Schema Analysis**
Entities represent the fundamental business objects in the standard—products, assets, processes, documents. Entity parsing extracts:
- **Name and abstract status** — Whether this is a concrete business object or a conceptual abstraction
- **Inheritance chain** — Which parent entities it extends, enabling attribute inference
- **Attributes** — The specific data elements that describe instances
- **Formal rules** — Business constraints that instances must satisfy
- **Derived attributes** — Values computed from other attributes (capturing domain logic)
For a building information model, this might extract that a "Room" entity has attributes (area, floor_number, occupancy_type), inherits from "Space" (which inherits from "IfcProduct"), and must satisfy "occupancy_type IN ['office', 'conference', 'storage', ...]".
This structured representation enables downstream validation and AI reasoning.
Type parsing is structurally simpler than entity parsing. The four flavours have
distinct opening tokens: `ENUMERATION OF`, `SELECT`, an aggregation keyword
(`LIST`/`SET`/`BAG`/`ARRAY`), or a direct type expression (alias). The only complication
is that SELECT and ENUMERATION bodies can span multiple lines with embedded comments —
hence the importance of normalising before scanning.
SELECT types in STEP schemas are particularly information-dense. The `measure_value`
SELECT, for instance, unifies twenty-odd measurement types under one name, encoding
the representational polymorphism that physical quantities require.
```python
# region TYPE PARSING
def extract_types(schema_body: str) -> list[TypeDef]:
"""
Extract all TYPE...END_TYPE declarations from schema body.
Parameters
----------
schema_body : str
Normalised schema body.
Returns
-------
list[TypeDef]
Parsed types with kind classification and WHERE rules.
"""
pattern = re.compile(
r'\bTYPE\s+(\w+)\s*=\s*(.*?)\s*END_TYPE\s*;',
re.DOTALL | re.IGNORECASE,
)
types: list[TypeDef] = []
for m in pattern.finditer(schema_body):
name = m.group(1)
definition = m.group(2).strip()
if re.match(r'ENUMERATION\s+OF', definition, re.IGNORECASE):
kind = 'enumeration'
items_m = re.search(r'\(([^)]+)\)', definition, re.DOTALL)
base = items_m.group(1).strip() if items_m else definition
elif re.match(r'SELECT', definition, re.IGNORECASE):
kind = 'select'
# SELECT bodies may nest parentheses for extended selects
items_m = re.search(r'\((.+)\)', definition, re.DOTALL)
base = items_m.group(1).strip() if items_m else definition
elif re.match(r'\b(LIST|SET|BAG|ARRAY)\b', definition, re.IGNORECASE):
kind = 'aggregate'
base = definition.split(';')[0].strip()
else:
kind = 'alias'
base = definition.split(';')[0].strip()
# WHERE rules in TYPE declarations (common on alias types)
where_rules: list[WhereRule] = []
where_m = re.search(
r'\bWHERE\b(.*)', m.group(0), re.DOTALL | re.IGNORECASE
)
if where_m:
for rule_text in re.split(r';', where_m.group(1)):
rule_text = rule_text.strip()
if ':' in rule_text:
label, expr = rule_text.split(':', 1)
label = label.strip()
if re.match(r'^\w+$', label):
where_rules.append(
WhereRule(label=label, expression=expr.strip())
)
types.append(
TypeDef(name=name, kind=kind, base=base, where_rules=where_rules)
)
return types
# endregion
```
Types in EXPRESS come in four flavors, each starting with a distinct keyword: ENUMERATION OF (closed list of values), SELECT (discriminated union—'this can be one of these types'), aggregation keywords (LIST, SET, BAG, ARRAY), or a direct type expression (alias). SELECT types are information-dense: they represent the representational flexibility that real data requires. For example, measure_value might unify twenty different measurement types. The parser classifies each type by its keyword, extracts the base content (enumeration values, select members, or aggregation expression), and captures any WHERE rules that constrain the type. This structure is ideal for understanding representational choices in the schema.
**Extracting Type Systems: Constraining Valid Values**
Types define what values are valid for attributes. Critical type patterns include:
- **Enumerations** — Closed lists (e.g., "material_type IN [wood, steel, concrete, composite]")
- **Union types** — "This attribute can be any of these types" (critical for handling polymorphic data)
- **Aggregations** — "This attribute holds a set/list of items" (e.g., a list of stakeholders, a set of cost codes)
For digital twins, type definitions are valuable because they enable validation without external reference—an AI system can verify that a material type value is valid purely from the schema.
The hierarchy analysis functions transform the flat list of `EntityDef` objects into
a graph structure. `build_hierarchy` produces a parent→children adjacency map from
the SUBTYPE OF declarations, giving us the tree in its natural downward direction.
`compute_hierarchy_depth` computes the maximum depth of the subtree rooted at each
entity. In the MIM_LF schema, the hierarchy depth of root entities like
`representation_item` or `founded_item` is substantial — tracing these chains
reveals the expressive layering that STEP uses to model physical and conceptual
structures without redundancy.
`find_roots` identifies entities with no SUBTYPE OF declaration. These are the
hierarchy root points. In a large modular schema, most roots correspond to abstract
foundation entities defined in the STEP integrated resource schemas (Parts 41–50).
```python
# region HIERARCHY ANALYSIS
def build_hierarchy(entities: list[EntityDef]) -> dict[str, list[str]]:
"""
Build parent → direct-children adjacency map.
Parameters
----------
entities : list[EntityDef]
All parsed entities.
Returns
-------
dict[str, list[str]]
Maps each entity name to a list of its direct subtypes.
"""
children: dict[str, list[str]] = defaultdict(list)
for entity in entities:
for parent in entity.supertypes:
children[parent].append(entity.name)
return dict(children)
def compute_hierarchy_depth(
entity_name: str,
children_map: dict[str, list[str]],
_memo: dict | None = None,
) -> int:
"""
Compute maximum depth of the subtree rooted at entity_name.
Memoised recursive descent. Returns 0 for leaf entities.
Parameters
----------
entity_name : str
Root of the subtree.
children_map : dict[str, list[str]]
Parent → children adjacency map from build_hierarchy.
_memo : dict | None
Internal memoisation cache; pass None on initial call.
Returns
-------
int
Maximum leaf depth in the subtree.
"""
if _memo is None:
_memo = {}
if entity_name in _memo:
return _memo[entity_name]
kids = children_map.get(entity_name, [])
if not kids:
depth = 0
else:
depth = 1 + max(
compute_hierarchy_depth(k, children_map, _memo) for k in kids
)
_memo[entity_name] = depth
return depth
def find_roots(entities: list[EntityDef]) -> list[str]:
"""
Return entity names that have no SUBTYPE OF declaration.
These are the topmost nodes in the inheritance forest — typically
abstract foundation types from the STEP integrated resources.
Parameters
----------
entities : list[EntityDef]
All parsed entities.
Returns
-------
list[str]
Sorted list of root entity names.
"""
all_names = {e.name for e in entities}
has_parent = {e.name for e in entities if e.supertypes}
return sorted(all_names - has_parent)
# endregion
```
These functions transform flat EntityDef lists into a graph structure. build_hierarchy creates a parent→children adjacency map from SUBTYPE OF declarations. compute_hierarchy_depth recursively calculates the maximum depth of each entity's subtree—how many levels of specialization branch below it. find_roots identifies top-level entities with no parents. In a large modular schema like MIM_LF, most roots correspond to abstract foundation entities from STEP's integrated resource schemas, and the subtrees extending from them reveal the semantic layering that STEP uses to model physical and conceptual structures with minimal redundancy.
**Understanding Entity Classification Trees**
STEP entities form inheritance hierarchies (similar to biological taxonomy). Analysis produces:
- **Root entities** — The foundational business concepts from which all others derive
- **Depth measurement** — How many levels of specialization exist (deeper = more complex domain)
- **Branching patterns** — How many distinct specializations emerge from each parent
**Why This Matters:**
- Identifies where organizational knowledge is concentrated (deep hierarchies)
- Reveals which concepts are foundational vs. specialized (strategic for training)
- Enables intelligent automated validation (validate against the appropriate level of generality)
The summary function aggregates everything into a serialisable report. Key metrics
for the MIM_LF schema include entity and type counts, the distribution of type kinds
(SELECT types are particularly numerous in STEP), the entity with the most attributes,
and the entities with the deepest and widest subtrees. This summary is immediately
usable as a system prompt context block for an LLM that needs to reason about the
schema — or as a dashboard data source for a schema observatory.
```python
# region REPORTING
def summarize(
schema_name: str,
entities: list[EntityDef],
types: list[TypeDef],
) -> dict:
"""
Produce a structured analysis summary of parsed schema artefacts.
Parameters
----------
schema_name : str
Extracted schema identifier.
entities : list[EntityDef]
All parsed entities.
types : list[TypeDef]
All parsed types.
Returns
-------
dict
JSON-serialisable analysis report.
"""
children_map = build_hierarchy(entities)
depth_memo: dict[str, int] = {}
abstract_count = sum(1 for e in entities if e.abstract)
constrained_count = sum(1 for e in entities if e.supertype_constraint)
with_where = sum(1 for e in entities if e.where_rules)
attr_counts = {e.name: len(e.attributes) for e in entities}
max_attr_count = max(attr_counts.values(), default=0)
max_attr_entity = next(
(n for n, c in attr_counts.items() if c == max_attr_count), None
)
# Deepest subtree root
depths = {
e.name: compute_hierarchy_depth(e.name, children_map, depth_memo)
for e in entities
}
deepest_name = max(depths, key=depths.get) if depths else None
type_kinds: dict[str, int] = defaultdict(int)
for t in types:
type_kinds[t.kind] += 1
select_types = [t for t in types if t.kind == 'select']
widest_select = max(
select_types,
key=lambda t: t.base.count(','),
default=None,
)
top_branching = sorted(
entities,
key=lambda e: len(children_map.get(e.name, [])),
reverse=True,
)[:10]
return {
"schema": schema_name,
"entities": {
"total": len(entities),
"abstract": abstract_count,
"with_supertype_constraint": constrained_count,
"with_where_rules": with_where,
"most_attributes": {
"entity": max_attr_entity,
"count": max_attr_count,
},
},
"types": {
"total": len(types),
"by_kind": dict(type_kinds),
"widest_select": {
"type": widest_select.name if widest_select else None,
"member_count": widest_select.base.count(',') + 1
if widest_select else 0,
},
},
"hierarchy": {
"roots": find_roots(entities),
"deepest_subtree": {
"root": deepest_name,
"depth": depths.get(deepest_name, 0) if deepest_name else 0,
},
"top_10_by_direct_subtypes": [
{
"entity": e.name,
"direct_subtypes": len(children_map.get(e.name, [])),
"abstract": e.abstract,
}
for e in top_branching
],
},
}
# endregion
```
The summarize function aggregates parsed schema data into a structured report. Key metrics include: entity and type counts; the distribution of type kinds (SELECT types are numerous in STEP); abstract entity count; entities with WHERE rules; the entity with the most attributes; the deepest subtree; and the entities with the most direct subtypes. This summary is immediately usable as context for an LLM that needs to reason about the schema, or as a dashboard for schema observatory tools.
**Schema Analysis Dashboard: Metrics for Governance**
Key metrics extracted from any STEP schema:
- **Total entities, types, and rules** — Schema complexity
- **Abstract entity ratio** — How much of the schema is conceptual vs. instantiable
- **Constraint density** — How many formal rules govern instances (higher = more validation capability)
- **Type diversity** — Mix of enumerations, unions, and aggregations (indicates complexity for implementations)
- **Hierarchy depth and branching** — Where complexity lives
These metrics serve as a **standards governance dashboard**, enabling organizations to:
- Track schema evolution across versions
- Identify integration bottlenecks
- Allocate training and validation resources effectively
The pipeline function wires the stages together. It is idempotent: re-running on
a cached schema produces identical output. The JSON report is written alongside the
cached schema file for downstream consumption.
```python
# region PIPELINE
def run(
url: str = SCHEMA_URL,
cache_dir: Path = CACHE_DIR,
output_path: Path | None = None,
) -> dict:
"""
Full pipeline: fetch → normalise → parse → analyse → persist.
Parameters
----------
url : str
EXPRESS schema URL (defaults to MIM_LF Ed.7).
cache_dir : Path
Local cache and output directory.
output_path : Path | None
JSON report output path. Defaults to cache_dir/_analysis.json.
Returns
-------
dict
Structured analysis report (also written to output_path).
"""
raw = fetch_schema(url, cache_dir)
clean = strip_comments(raw)
schema_name, body = extract_schema_block(clean)
print(f"[schema] {schema_name} ({len(body):,} chars)")
entities = extract_entities(body)
print(f"[entities] {len(entities):,} found")
types = extract_types(body)
print(f"[types] {len(types):,} found")
report = summarize(schema_name, entities, types)
resolved_output = output_path or (
cache_dir / f"{schema_name}_analysis.json"
)
resolved_output.parent.mkdir(parents=True, exist_ok=True)
resolved_output.write_text(
json.dumps(report, indent=2), encoding="utf-8"
)
print(f"[output] {resolved_output}")
return report
if __name__ == "__main__":
import pprint
pprint.pp(run())
# endregion
```
The run function orchestrates all stages: fetch raw schema → strip comments → extract schema block → scan for entities → scan for types → analyze and summarize → write JSON report. It's idempotent: re-running on a cached schema produces identical output. The JSON report is written alongside the cached schema for downstream use in LLM systems, validation tools, or gap analysis.
**The Analysis Workflow: From Standards Document to Machine-Actionable Insights**
The complete process is:
1. **Acquire** — Fetch the official schema from standards repositories
2. **Prepare** — Remove comments and normalize encoding
3. **Parse** — Extract entity, type, and rule definitions
4. **Analyze** — Build inheritance hierarchies, compute metrics, identify patterns
5. **Output** — Generate JSON reports for downstream consumption
The entire process is **idempotent**—running it multiple times produces identical results, supporting version control and audit trails. This is critical for regulated industries.
[6] What the Analysis Reveals
The MIM_LF schema is large by any measure. When you run the pipeline against the Ed.7
file you will encounter an entity count in the low thousands and a type count of similar
order. This scale reflects the scope of ISO/TS 10303-442: it integrates all Application
Modules needed to represent managed model-based engineering information across the full
product lifecycle — geometry, topology, materials, kinematics, process planning,
configuration management, documents, and more.
Several structural observations are immediately useful for downstream work.
**Root entity scarcity.** Despite thousands of entities, the number of hierarchy roots
is small — typically under twenty. Most roots are foundation types from the STEP
integrated resources: `founded_item`, `representation_item`, `action`, `product`,
`document`. These are the conceptual atoms of the STEP information model. Every
domain-specific entity, from `machining_feature` to `kinematic_pair`, traces back
through a chain of SUBTYPE OF declarations to one of these roots.
**Hierarchy depth.** Some subtrees are deep — depths exceeding ten levels are present.
This reflects the progressive specialisation pattern that STEP uses: a general concept
at the root becomes increasingly specific at the leaves. Parsing this depth is essential
for any reasoning system that needs to navigate the schema, because attribute inheritance
is implicit — a subtype's complete information set is the union of all its ancestors'
attributes, traversed up to the root.
**ABSTRACT prevalence.** A significant fraction of entities carry the ABSTRACT keyword.
These can never be directly instantiated in a STEP Part 21 file — they exist purely to
carry shared attributes and establish the conceptual hierarchy. For instance,
`shape_representation` is abstract; instances are always of a concrete subtype like
`advanced_brep_shape_representation`. For a validation engine, correctly identifying
abstract entities is critical to generating accurate instantiation constraints.
**SELECT proliferation.** The type analysis reveals that SELECT types outnumber
enumerations. SELECTs are STEP's discriminated union mechanism and are used pervasively
to represent representational alternatives: a `measure_value` can be any of many
measurement types; a `product_definition_or_reference` collapses two semantically
related types into one attribute slot. SELECT membership counts are a proxy for
semantic complexity — a SELECT with thirty members represents a design decision with
thirty implementation paths, each of which a reasoning engine must potentially traverse.
**WHERE rule density.** Entities with WHERE rules are where the schema's semantic
depth lives. A WHERE rule like `WR1: SIZEOF(QUERY(item <* SELF.items |
NOT ('AP242_MANAGED_MODEL_BASED_3D_ENGINEERING...' IN TYPEOF(item)))) = 0`
is not just a constraint — it is a formal specification of valid composition.
Extracting and classifying WHERE rules is the first step toward building a schema-driven
constraint solver for digital twin validation pipelines.
The MIM_LF schema is massive. Running this parser produces entity counts in the low thousands and similar type counts, reflecting AP242's scope: it integrates all modules needed to represent managed model-based engineering information across the full product lifecycle—geometry, topology, materials, kinematics, process planning, configuration, documents. Several structural patterns emerge: (1) **Root scarcity**—thousands of entities but fewer than twenty roots. Most roots are foundation types like founded_item, representation_item, action, product, document. These are the conceptual atoms; every domain entity traces back through inheritance chains to one of these roots. (2) **Hierarchy depth**—some subtrees exceed ten levels, reflecting STEP's progressive specialization pattern. Understanding this depth matters for digital twins because entity attributes are inherited—a subtype includes all its ancestors' attributes. (3) **ABSTRACT prevalence**—many entities are abstract (can't be directly instantiated). They exist purely to organize shared attributes and the conceptual hierarchy. For validation, correctly identifying abstract entities is critical. (4) **SELECT proliferation**—SELECT types outnumber enumerations. They're STEP's discriminated union mechanism, used to represent representational alternatives. A SELECT with thirty members represents thirty different data representation paths. (5) **WHERE rule density**—entities with WHERE rules encode semantic depth. Rules like 'SIZEOF(QUERY(...)) = 0' are formal specifications of valid composition. Extracting and classifying these rules is step one toward building constraint solvers for digital twin validation.
**The Scope and Complexity of Modern Standards**
The most comprehensive STEP application (AP242, Managed Model-Based 3D Engineering) integrates:
- **3,000+ entity types** — Covering product structure, geometry, topology, materials, kinematics, process planning, documents, and configuration
- **20+ root concepts** — Fundamental building blocks (products, representations, documents, processes)
- **10+ levels of hierarchy** — Deep specialization in certain domains (geometry, for example)
- **Hundreds of formal constraints** — WHERE rules that define what makes data valid
**What This Means for Implementation:**
- **Complexity is real** — Enterprise data models are genuinely intricate
- **Standardization pays off** — Rather than each company inventing its own model, they use this shared foundation
- **Opportunity exists** — Organizations that master STEP can leverage decades of international standardization work
Companies attempting to build proprietary data models face decades of catch-up to achieve equivalent semantic richness.
[7] Next Steps — From Parser to Agentic Infrastructure
The `EntityDef` list produced by this parser is the input layer for several high-value
analyses.
**Entity relationship graph.** The SUBTYPE/SUPERTYPE data maps directly to a directed
acyclic graph. Using `networkx` (or a pure-Python adjacency structure), you can compute
ancestor chains, identify the common supertype of any two entities (critical for
polymorphic query planning), and detect isolated subtrees that indicate module
boundaries. Visualised, the STEP entity graph is one of the most structurally rich
graphs in the engineering informatics world.
**Schema-to-LLM context injection.** Large language models operating over STEP data
need schema grounding. Rather than injecting the full schema text (prohibitively large),
inject targeted entity definitions: the entity being discussed, its direct supertype
chain, its attributes with types, and its WHERE rules. This parser produces exactly the
structured data needed to generate those targeted context blocks programmatically.
**Cross-standard gap analysis.** IFC (ISO 16739, `IFC4.exp`) and STEP both model
building and construction artefacts — but with different entity hierarchies,
different attribute names, and different constraint philosophies. Parsing both schemas
with this infrastructure enables systematic attribute-level comparison: which STEP
entities have no IFC equivalent, where are the semantic mismatches, and where is
information lost in cross-schema exchange. This is the foundation layer for
OpenBIM/STEP convergence work.
**STEP Part 21 file validation.** A STEP physical file (`.stp`, `.step`) lists entity
instances line by line in a DATA section. Each instance references an entity type by
name and provides attribute values positionally. The parsed schema gives you:
1. The ordered attribute list for each entity (determining positional mapping)
2. The type of each attribute (enabling value validation)
3. The WHERE rules (enabling constraint checking)
4. The ABSTRACT flags (enabling instantiation validity checks)
A validation engine built on this parser can process a STEP Part 21 file against the
schema without any external library — pure Python, pure standard library, full
ISO 10303 conformance checking.
**Agentic schema evolution tracking.** The MIM_LF schema changes between editions.
Diffing the parsed output of Ed.6 and Ed.7 — entities added, entities removed,
attributes changed, WHERE rules modified — gives a machine-readable changelog that
no human-authored release note fully captures. For long-lived digital twin programmes
operating on multi-edition STEP data, this parser is the instrumentation layer for
schema governance.
The EntityDef list produced by this parser feeds several high-value downstream analyses. (1) **Entity relationship graph**—SUBTYPE/SUPERTYPE data maps directly to a directed acyclic graph. Using networkx or pure-Python adjacency structures, you can compute ancestor chains, find common supertypes (critical for polymorphic query planning), and detect isolated subtrees indicating module boundaries. Visualized, the STEP entity graph is one of the richest engineering graphs in the world. (2) **Schema-to-LLM context injection**—LLMs reasoning over STEP data need schema grounding. Rather than injecting the full schema (prohibitively large), inject targeted definitions: the entity being discussed, its supertype chain, its attributes with types, and its WHERE rules. This parser produces exactly the structured data needed. (3) **Cross-standard gap analysis**—IFC (ISO 16739) and STEP both model buildings and construction artifacts but with different hierarchies, attributes, and rules. Parsing both with this infrastructure enables systematic comparison: which STEP entities have no IFC equivalent, where are semantic mismatches, where is information lost in exchange. This is foundational for OpenBIM/STEP convergence. (4) **STEP file validation**—a STEP physical file (.stp, .step) lists entity instances with attribute values. The parsed schema provides the ordered attribute list (for positional mapping), attribute types (for value validation), WHERE rules (for constraint checking), and ABSTRACT flags (for instantiation validity). A pure-Python validation engine needs nothing else. (5) **Agentic schema evolution tracking**—MIM_LF changes between editions. Diffing the parsed output of Ed.6 and Ed.7 gives a machine-readable changelog that release notes miss. For long-lived digital twin programs operating on multi-edition STEP data, this parser is the instrumentation layer for schema governance.
**Building Intelligence on Top of Schema Analysis**
Once extracted, schema data enables high-value business capabilities:
**1. Interoperability Analysis**
- Compare STEP vs. IFC vs. proprietary formats
- Identify where data would be lost in cross-system exchange
- Quantify integration effort and risk
**2. Intelligent Data Validation**
- Automatically check data instances against formal schema constraints
- Catch errors before they propagate downstream (design → manufacturing → operations)
- Reduce manual inspection and quality assurance costs
**3. AI/LLM Integration**
- Generate targeted context blocks for AI systems reasoning over domain data
- Enable semantic validation: "Does this design satisfy the formal building code constraints?"
- Bridge the gap between natural language queries and formal data requirements
**4. Digital Twin Development**
- Establish the formal semantic foundation for autonomous asset management
- Enable robots/agents to validate data, propose optimizations, and reason about lifecycle decisions
- Position the organization for next-generation automation
**5. Standards Evolution Tracking**
- Monitor how standards change across editions
- Plan migration strategies for multi-edition data environments
- Identify where standardization is maturing (stable entities) vs. evolving (new features)
[8] Conclusion: Toward Built Environment Intelligence
The AECO and Real Estate industries have historically remained laggards in the digital revolution, often treating data as a byproduct of the design process rather than its most valuable asset. While other sectors transitioned to real-time, data-driven decision-making decades ago, the built environment remains largely fragmented, trapped in static PDFs and "dark data" silos.
This research, and the development of the EXPRESS parsing infrastructure presented here, is a deliberate move toward a new, sovereign discipline: **Built Environment Intelligence (BEI)**. Unlike traditional Business Intelligence (BI), which often reports on historical financial metrics, BEI is rooted in the deep, formal semantics of the physical world. By programmatically deconstructing the **ISO 10303** architecture, we move beyond simple data exchange and toward **Agentic Digital Twins**.
Construction and real estate have historically lagged in the digital revolution, treating data as a byproduct of design rather than its most valuable asset. While manufacturing and other sectors embraced real-time, data-driven decision-making decades ago, the built environment remains fragmented—trapped in PDFs, email, and 'dark data' silos. This research and the EXPRESS parsing infrastructure presented here is a deliberate move toward a new discipline: **Built Environment Intelligence (BEI)**. Unlike traditional Business Intelligence, which reports on historical financial metrics, BEI is grounded in the formal semantics of the physical world. By programmatically deconstructing the ISO 10303 architecture, we move beyond static data exchange toward **Agentic Digital Twins**—AI systems that can reason, validate, and evolve building information throughout its lifecycle.
**Built Environment Intelligence: The Next Frontier**
Historically, real estate and construction have treated data as a byproduct—design documents are filed, as-builts are archived, and operational data decays into isolation. Meanwhile, other industries (aerospace, automotive, semiconductor) pioneered data-driven decision-making and autonomous reasoning.
Built Environment Intelligence (BEI) represents a deliberate shift: treating the physical asset and its metadata as the primary strategic asset.
**This Requires:**
- Standardized, formal semantic models (STEP, IFC)
- Automated interpretation and validation (this research)
- Agentic digital twins that reason autonomously
- Continuous data quality assurance
**The Payoff:**
- **Autonomous asset optimization** — Digital twins recommending maintenance, upgrades, reallocations
- **Predictive operations** — Machine learning models trained on standardized historical data
- **Lifecycle ROI** — Design decisions informed by actual operational data from previous projects
- **Risk reduction** — Formal constraints catch design errors before construction begins
- **Workforce augmentation** — Subject matter experts focus on strategy, not manual data translation
### The SME Perspective: Self-Exploration as Innovation
The methodology described in this paper is the result of an intentional self-exploration—a process of "learning by building" to master the complex DNA of the STEP and IFC standards. This positioning is critical for the future of the industry. As we move away from manual coordination and toward autonomous agents that can reason over a building's lifecycle, the role of the Subject Matter Expert must evolve.
The expert of tomorrow will not just "know" the standard; they will build the automated interpreters that allow AI to "understand" the standard. This parser is more than a technical utility; it is a foundational component of a **Reasoning Engine** that bridges the gap between the rigid, normative past of ISO standards and the fluid, agentic future of the Built Environment.
Through the lens of BEI, we do not merely see a schema of entities and types. We see the formal logic required to automate the world's largest asset class.
The methodology here is the result of intentional self-exploration—'learning by building' to master the complex language of STEP and IFC standards. This positioning matters for the industry's future. As we move from manual coordination toward autonomous agents reasoning over building lifecycles, the role of the subject matter expert must evolve. The expert of tomorrow won't just 'know' the standard; they'll build the automated interpreters that allow AI to 'understand' it. This parser is more than a technical utility; it's a foundational component of a **Reasoning Engine** that bridges the rigid, normative past of ISO standards and the fluid, agentic future of the Built Environment. Through the lens of BEI, we see not just a schema of entities and types, but the formal logic required to automate the world's largest asset class.
**Rethinking Expertise in the Age of Agentic Systems**
As organizations transition toward autonomous asset management and AI-driven decision-making, the role of Subject Matter Experts (SMEs) must evolve. The SME of tomorrow will not simply "know" standards—they will architect the systems that enable machines to reason about standards.
**Strategic Implications for Talent:**
- **New expertise tier emerges** — "Standards interpretation engineers" who build automation layers
- **Reduced routine work** — Manual data translation and validation becomes automated
- **Increased strategic work** — SMEs focus on exception handling, standards governance, and strategic data architecture
- **Competitive advantage** — Organizations that upskill SMEs on automation and AI gain significant leverage
This research is a foundational component of that evolution—demonstrating that standardized knowledge can be encapsulated in executable logic, freeing human expertise for higher-value strategic work.
[9] Appendix — Quick Reference
| EXPRESS Construct | Parser Output | Key Fields |
|-----------------------|---------------------|-----------------------------------------|
| ENTITY | EntityDef | name, abstract, supertypes, attributes |
| SUBTYPE OF (...) | EntityDef.supertypes| list[str] of parent names |
| SUPERTYPE OF (ONEOF) | EntityDef.supertype_constraint | raw constraint text |
| Explicit attribute | Attribute(section='explicit') | name, type_expr, optional |
| DERIVE attribute | Attribute(section='derived') | name, type_expr |
| INVERSE attribute | Attribute(section='inverse') | name, type_expr |
| WHERE rule | WhereRule | label, expression |
| TYPE = STRING (alias) | TypeDef(kind='alias') | base = 'STRING' |
| TYPE = SELECT (...) | TypeDef(kind='select') | base = comma-sep members |
| TYPE = ENUM OF (...) | TypeDef(kind='enumeration') | base = comma-sep values |
| TYPE = LIST ... OF | TypeDef(kind='aggregate') | base = full agg expr |
| EXPRESS Construct | Parser Output | Key Fields |
|---|---|---|
| ENTITY \ | EntityDef | name, abstract, supertypes, attributes |
| SUBTYPE OF (...) | EntityDef.supertypes | list[str] of parent entity names |
| SUPERTYPE OF (ONEOF) | EntityDef.supertype_constraint | raw constraint text defining subtype partitioning |
| Explicit attribute | Attribute(section='explicit') | name, type_expr, optional |
| DERIVE attribute | Attribute(section='derived') | name, type_expr (computed, not stored) |
| INVERSE attribute | Attribute(section='inverse') | name, type_expr (backward relationship) |
| WHERE rule | WhereRule | label (WR1, UR2, etc.), boolean expression |
| TYPE = STRING (alias) | TypeDef(kind='alias') | base = 'STRING' |
| TYPE = SELECT (...) | TypeDef(kind='select') | base = comma-separated member types |
| TYPE = ENUM OF (...) | TypeDef(kind='enumeration') | base = comma-separated enumeration values |
| TYPE = LIST ... OF | TypeDef(kind='aggregate') | base = full aggregation expression |
| EXPRESS Construct | Business Meaning | Validation Value |
|---|---|---|
| ENTITY | A concrete business object (building, room, product, process) | Defines what must be instantiated and tracked |
| Explicit Attribute | Required data element | Must be present in all instances |
| Optional Attribute | May or may not be required | Enables flexible but controlled data entry |
| Inheritance (SUBTYPE OF) | Specialization relationship | Enables inheritance of attributes; reduces redundancy |
| Formal Constraint (WHERE rule) | Business rule encoded formally | Enables automated compliance checking |
| Enumeration | Closed set of valid values | Enables dropdown/validation without external reference |
| Union Type (SELECT) | "Any of these alternatives" | Handles polymorphic data patterns |
| Aggregation (LIST/SET) | Collection of related items | Enables tracking relationships and dependencies |
| Schema | Applies To | Purpose | Access |
|---|---|---|---|
| MIM_LF (ISO/TS 10303-442) | Managed model-based 3D engineering across all disciplines | The broadest STEP application; suitable for complex product design and lifecycle management | https://standards.iso.org/iso/ts/10303/-442/ed-7/tech/express/mim_lf.exp |
| IFC4 EXPRESS (ISO 16739-1) | Building information modeling; architectural, structural, and MEP coordination | Industry standard for BIM; enables interoperability between design and operations tools | https://standards.buildingsmart.org/IFC/RELEASE/IFC4/FINAL/EXPRESS/IFC4.exp |
| MBX-IF Hub | All STEP and related schema resources | Central distribution point for practitioners; freely accessible with version control | https://www.mbx-if.org/home/mbx/resources/express-schemas/ |
Key Takeaways
- A Python-based regex parser can programmatically extract and structure EXPRESS schemas (ISO 10303) from normative files into typed `EntityDef` and `TypeDef` objects without third-party dependencies.
- MIM_LF (ISO/TS 10303-442) exhibits key structural patterns—few hierarchy roots, deep subtrees, high prevalence of ABSTRACT entities, and SELECT type proliferation—that directly inform validation engine design and LLM reasoning constraints.
- Parsed schema artifacts enable downstream infrastructure: STEP Part 21 file validation, schema-to-LLM context injection, cross-standard gap analysis (STEP vs. IFC), and schema evolution tracking across standard editions.
- This parser is foundational instrumentation for agentic digital twins, bridging the gap between static normative ISO documentation and machine-actionable engineering informatics.
- WHERE rules and supertype constraints encode formal semantic depth critical for autonomous constraint solving and schema-driven composition validation in the built environment intelligence domain.
Cite this work (APA & BibTeX)
Milotin, D. (2026),
“Parsing EXPRESS Schemas with Python — A Field Exercise on ISO/TS 10303-442 MIM_LF,”
Delta Persist.
@article{milotind2026,
author = {Milotin, Dragos},
title = {Parsing EXPRESS Schemas with Python — A Field Exercise on ISO/TS 10303-442 MIM_LF},
journal = {Delta Persist},
year = {2026},
note = {Field: Built Environment Intelligence. Accessed: 2026-02-23}
}