Unlocking Construction Site Intelligence with Ontology-Based LLM Prompting

Revolutionizing Construction Management: How Ontology-Guided LLMs Decode Site Images for Smarter Decisions

In the fast-paced world of construction, real-time insights into on-site activities are crucial. Understanding what workers are doing, how equipment is being used, and whether tasks align with schedules can make or break a project’s success. Traditionally, this has relied on manual reporting or rigid AI models that struggle to adapt to the chaotic, ever-changing nature of construction sites.

Now, a groundbreaking approach is emerging: ontology-based prompting with Large Language Models (LLMs). This innovative method combines structured domain knowledge with the reasoning power of generative AI to infer complex construction activities directly from images—without needing thousands of labeled training examples.

In this comprehensive guide, we’ll explore how researchers have successfully applied this technique to achieve 73.68% accuracy in construction activity recognition, examine the role of ontologies like ConSE, and reveal why this hybrid strategy represents the future of intelligent site monitoring.

What Is Ontology-Based LLM Prompting in Construction AI?

At its core, ontology-based LLM prompting transforms raw visual data into semantically rich, structured inputs that large language models can understand and reason over. Unlike traditional image captioning—which often lacks precision—this method uses a formal construction-domain ontology to represent scene elements such as workers, equipment, materials, spatial relationships, and actions using standardized concepts.

An ontology, in this context, acts as a shared vocabulary and logical framework. It defines:

Classes (e.g., Worker, Excavator, ConcretePouring)
Attributes (e.g., posture, tool usage)
Relationships (e.g., “operating”, “monitoring”, “near”)

This structure allows AI systems to move beyond simple object detection and toward context-aware semantic interpretation—a critical leap for practical applications in construction management.

Key Takeaway: Ontology-based LLM prompting bridges the gap between computer vision outputs and high-level reasoning by grounding visual features in formal domain semantics.

Why Traditional Methods Fall Short

Before diving into the solution, it’s essential to understand the limitations of current approaches:

METHOD	KEY LIMITATIONS
Supervised Learning	Requires massive annotated datasets; poor generalization to unseen scenarios; lacks interpretability
Knowledge-Based Rules	Static logic fails in dynamic environments; hard to scale across diverse projects
Standard Image Captioning + LLMs	Loses fine-grained visual detail; ambiguous descriptions reduce accuracy

These shortcomings hinder scalability and reliability—especially when dealing with rare or complex activities like safety monitoring or multi-agent coordination.

The Proposed Framework: A Step-by-Step Breakdown

The research introduces a novel pipeline that leverages in-context learning within LLMs, guided by an ontology-driven representation of construction scenes.

Core Components of the Ontology-Based LLM Prompting System

1. Diverse Visual Information Extraction

Instead of relying solely on bounding boxes or captions, the system extracts multiple layers of visual data using advanced computer vision techniques:

ALGORITHM	OUTPUT TYPE	ROLE IN ACTIVITY RECOGNITION
Object Detection	Entities (`Worker_1`,`Excavator`)	Identifies key actors and objects
Pose Estimation	Posture & Joint Data	Infers physical actions (e.g., lifting, drilling)
Attention Analysis	Gaze Direction	Reveals focus of attention (e.g., monitoring)
Relative Relationship Analysis	Spatial/Functional Links	Captures interactions (e.g., worker operating crane)
Grid-Based Spatial Indexing	NxN Position Tags	Encodes location context (N=3 used in study)

Each entity is enriched with metadata and mapped to ontology classes, forming symbolic triples like:

(Worker_1, hasPosture, Standing)
(Worker_1, operating, Hoist_1)
(Hoist_1, lifting, Formwork_1)
(Formwork_1, locatedAt, GridCell_5)

2. Ontology-Grounded Scene Representation

Using the ConSE ontology—a task-agnostic model covering low-level visuals to high-level semantics—the system converts image content into structured <Context> blocks. For example:

{
  "entities": ["Worker_1", "Hoist_1", "Formwork_1"],
  "attributes": {"Worker_1": "Standing", "Hoist_1": "Active"},
  "relationships": [
    {"subject": "Worker_1", "predicate": "operating", "object": "Hoist_1"},
    {"subject": "Hoist_1", "predicate": "lifting", "object": "Formwork_1"}
  ],
  "spatial_grid": {"Formwork_1": 5, "Worker_1": 4}
}

This formalism ensures consistency, reduces ambiguity, and enables precise alignment with known activity patterns.

3. Chain-of-Thought (CoT) Prompt Engineering

A major innovation lies in integrating Chain-of-Thought reasoning into prompts. Instead of asking “What activity is shown?”, the model is guided through intermediate steps:

“Based on the presence of a worker operating a hoist that is lifting formwork near a column, and given that no pouring or cutting is occurring, the most likely activity is FormworkSetup.”

This mimics human expert reasoning and significantly improves inference accuracy.

The full prompt architecture includes:

Task Instruction Prompt: Sets role, context, and objective
In-Context Examples: 3 structured <Context> → <Inference> pairs
Output Formatting Guide: Ensures consistent response structure

See Figure 1 in the paper for a complete workflow diagram.

Evaluation: Performance Metrics and Results

To test the system, researchers evaluated GPT-3.5-turbo on a dataset of 53 images representing 29 construction activity types drawn from the ConSE ontology.

Base Experiment Configuration

PARAMETER	SETTING
LLM Model	`gpt-3.5-turbo-0613`
Temperature	0.5
Top_p	0.3
In-Context Examples	3
Visual Inputs	Entity + Attribute + Attention + Relationship + Spatial

Predictions were allowed across all 41 ontology-defined activities, not just the 29 present in the dataset, to assess generalization.

Key Performance Metrics

Four evaluation metrics were used to capture different aspects of performance:

METRIC	FORMULA	INTERPRETATION
Activity Accuracy	Total True Positives / Total Samples	Overall correctness of predicted activity labels
Activity+Elements Accuracy	Exact match of both activity and involved entities	Measures group activity recognition capability
Redundancy Rate	\[\frac{\text{# of ‘None’ added to GT}}{\text{Original # of GT labels}}\]	Indicates over-prediction (false positives)
Miss Rate	\[\frac{\text{# of ‘None’ added to prediction}}{\text{# of GT labels}}\]	Reflects under-prediction (false negatives)

Results from the base experiment:

METRIC	SCORE
Activity Accuracy	73.68%
Activity+Elements Accuracy	50.00%
Redundancy Rate	20.78%
Miss Rate	3.91%

While not surpassing top-performing supervised models (~80%), these results demonstrate strong few-shot generalization—achieving competitive accuracy with only three in-context examples, compared to thousands of labeled images required by conventional deep learning.

Ablation Studies: What Drives Success?

To isolate the impact of each design choice, ablation studies were conducted. Here’s what they revealed:

Impact of Chain-of-Thought Reasoning

Removing CoT components caused a significant drop:

Activity Accuracy: ↓ from 73.68% to 65.78%
Activity+Elements Accuracy: ↓ from 50.00% to 44.73%
Miss Rate nearly doubled

✅ Insight: CoT prompting enhances logical consistency and helps the model connect interactions to higher-level activities.

Value of Advanced Visual Features

FEATURE REMOVED	ACTIVITY ACC DROP	ACTIVITY + ELEMENTS ACC DROP
Only Entities (w/ E)	–9.21%	–27.63%
No Spatial Info (w/o S)	–6.57%	–8.57%
No Attention Cues (w/o Ac)	–6.57%	–13.16%
No Attributes/Relations (w/o AR)	–5.26%	–18.42%

Notably, attention cues had a dramatic effect on ProcessMonitoring recognition:

With cues: 14.29% accuracy
Without: 0.00% accuracy

Similarly, adding attribute and relational data boosted performance from 0% to 40.79% in some configurations.

✅ Insight: Rich contextual signals like gaze direction and functional relationships are vital for disambiguating similar-looking scenes.

Optimal Randomness and Example Count

Interestingly, introducing moderate randomness (temperature = 1.0, top_p = 0.6) improved both metrics:

Activity Accuracy: ↑ to 77.63%
Activity+Elements Accuracy: ↑ to 53.95%

However, high randomness degraded performance due to incoherent outputs.

Regarding in-context examples:

1 example: 71.05% activity accuracy
2 examples: 69.73%
3 examples: 73.68%

✅ Insight: There’s a sweet spot—too few examples limit pattern learning; too many may introduce noise or confusion.

Practical Implications for Construction Professionals

This research isn’t just theoretical—it offers tangible benefits for real-world project management.

Five Strategic Advantages of Ontology-Based LLM Prompting

Data Efficiency
Achieves meaningful results with minimal labeled data—ideal for niche or evolving construction workflows.
Few-Shot Generalization
Adapts quickly to new project types without retraining, supporting rapid deployment across sites.
Semantic Structuring Enhances Interoperability
Ontologies enable integration with BIM, scheduling tools, and safety databases via standardized queries.
Transparent and Explainable Outputs
CoT reasoning provides natural-language explanations, building trust among engineers and site managers.
Scalable Across Tasks
Same framework can be repurposed for hazard detection, progress tracking, or compliance auditing.

For instance, one could extend the prompt to ask:

“Based on detected entities and their interactions, list potential safety risks and cite relevant OSHA regulations.”

This flexibility makes the approach highly adaptable.

Limitations and Future Directions

Despite promising results, several challenges remain:

Lack of Depth Perception: 2D spatial indexing struggles with depth ambiguity, especially in wide-angle shots.
Dependency on Accurate CV Inputs: Errors in pose estimation or object detection propagate into flawed reasoning.
Model Knowledge Gaps: Off-the-shelf LLMs may lack sufficient construction-specific expertise.
Prompt Robustness: Poorly chosen in-context examples can mislead the model.

Future work should explore:

Integration with vision-language models (VLMs) like BLIP-2 for end-to-end feature alignment
Use of MPEG-7 descriptors or bounding box coordinates for richer spatial encoding
Fine-tuning LLMs on construction-specific corpora to improve domain fluency

Conclusion: Toward Intelligent, Context-Aware Construction Sites

Ontology-based LLM prompting marks a pivotal shift in how we analyze construction imagery. By combining structured domain knowledge with generative reasoning, this method delivers a scalable, interpretable, and adaptive solution for activity recognition.

With an accuracy of 73.68% using only three in-context examples, it proves that high-performance AI doesn’t always require big data—it requires smart structuring.

As LLM context windows expand and multimodal models mature, embedding entire ontologies directly into prompts will unlock even deeper understanding, enabling real-time digital twins, automated progress validation, and proactive risk mitigation.

🔍 Want to implement this in your organization?
Explore open-source ontologies like ConSE or DiCon-SII, start curating small sets of annotated site photos, and experiment with GPT-4 or open-weight models like LLaMA via APIs.

We’d love to hear your thoughts! Have you tried using LLMs for construction analysis? Share your experiences in the comments below or download the full research paper here to dive deeper into the technical details.

Based on the research paper “Ontology-based LLM prompting with large language models for inferring construction activities from construction images,” I have written a complete Python implementation of the proposed model’s methodology.

# -*- coding: utf-8 -*-
"""
Ontology-Based Construction Activity Recognition using LLMs

This script provides a complete Python implementation of the methodology proposed
in the paper: "Ontology-based prompting with large language models for inferring
construction activities from construction images" by C. Zeng, T. Hartmann, and L. Ma.

The script simulates the end-to-end workflow described in the paper:
1.  **Visual Information Extraction**: Simulates the output of various computer
    vision models (e.g., object detection, pose estimation) to extract entities,
    attributes, and relationships from an image.
2.  **Ontology-based Scene Context Creation**: Transforms the extracted visual
    information into a structured, ontology-guided text representation, as
    shown in Figure 2 of the paper.
3.  **Prompt Engineering**: Constructs a comprehensive prompt for a Large Language
    Model (LLM) by combining:
    - Task Formulation (Figure 3)
    - In-Context Examples with Chain-of-Thought (CoT) reasoning
    - Output Formulation (Figure 4)
    - The new, unseen scene for inference.
4.  **LLM Inference**: Simulates a call to a causal LLM (like GPT-3.5) to get the
    inferred construction activity. The actual network call is mocked to ensure
    the script runs without an API key.
5.  **Output Parsing**: Displays the final inference from the model.

Note: Since the actual computer vision models, the specific ConSE ontology,
and the LLM API are external dependencies, this script uses mock data and a
simulated API call to focus on demonstrating the core logic and structure of
the proposed prompting technique.
"""
import json

class OntologyBasedActivityRecognizer:
    """
    Encapsulates the entire process of inferring construction activities
    from images using ontology-based prompting with an LLM.
    """

    def __init__(self, llm_api_endpoint="mock_endpoint"):
        """
        Initializes the recognizer.
        Args:
            llm_api_endpoint (str): The endpoint for the LLM API.
                                    Here, it's a mock endpoint.
        """
        self.llm_api_endpoint = llm_api_endpoint
        print("OntologyBasedActivityRecognizer initialized.")

    # --- Step 1: Simulate Visual Information Extraction ---
    def _simulate_visual_information_extraction(self, image_path):
        """
        Simulates the output of running multiple CV algorithms on an image.
        In a real-world application, this function would use models for object
        detection, pose estimation, attention analysis, etc.

        Args:
            image_path (str): The path to the input image.

        Returns:
            dict: A dictionary containing the extracted visual information.
        """
        print(f"\n[Step 1] Simulating visual information extraction for: {image_path}")
        # Mock data based on the image path. This simulates the CV pipeline output.
        if "test_image_formwork.jpg" in image_path:
            # This mock data is based on Figure 7 in the paper.
            return {
                "entities": [
                    {"id": "Worker_1", "class": "Worker", "grid_location": [8]},
                    {"id": "Hoist_1", "class": "Hoist", "grid_location": [2]},
                    {"id": "FormworkRegion_1", "class": "FormworkRegion", "grid_location": [5]},
                    {"id": "Column_1", "class": "Column", "grid_location": [4]},
                    {"id": "Column_2", "class": "Column", "grid_location": [6]},
                    {"id": "WoodenPlateRegion_1", "class": "WoodenPlateRegion", "grid_location": [7]},
                ],
                "attributes": [
                    {"entity_id": "Worker_1", "property": "hasPosture", "value": "Standing"},
                ],
                "relationships": [
                    {"subject": "Worker_1", "predicate": "hasAttentionOn", "object": "Hoist_1"},
                    {"subject": "Hoist_1", "predicate": "isNear", "object": "FormworkRegion_1"},
                    {"subject": "FormworkRegion_1", "predicate": "isAttachedTo", "object": "Column_2"},
                ]
            }
        # Default mock data for any other image
        return {
            "entities": [
                {"id": "Worker_1", "class": "Worker", "grid_location": [5, 8]},
                {"id": "ReinforcementRegion_1", "class": "ReinforcementRegion", "grid_location": [1, 2, 4, 5]},
            ],
            "attributes": [
                {"entity_id": "Worker_1", "property": "hasPosture", "value": "Bending"},
            ],
            "relationships": []
        }

    # --- Step 2: Create Ontology-based Scene Context ---
    def _create_ontology_guided_representation(self, visual_info):
        """
        Formats the extracted visual information into a structured text block
        based on the ontology schema. This implements the concept from Figure 2.

        Args:
            visual_info (dict): The dictionary from the simulation step.

        Returns:
            str: A formatted string representing the scene context.
        """
        print("[Step 2] Creating ontology-guided scene representation.")
        context_str = "<Context>:{\n"
        context_str += "    Entity:{\n"
        for item in visual_info["entities"]:
            loc = ",".join(map(str, item['grid_location']))
            context_str += f"        {item['id']} a {item['class']} ({loc})\n"
        context_str += "    }\n"

        context_str += "    Attribute:{\n"
        for item in visual_info["attributes"]:
            context_str += f"        {item['entity_id']} {item['property']} {item['value']}\n"
        context_str += "    }\n"

        context_str += "    Relationship:{\n"
        for item in visual_info["relationships"]:
            context_str += f"        {item['subject']} {item['predicate']} {item['object']}\n"
        context_str += "    }\n"
        context_str += "}"
        return context_str

    # --- Step 3: Prompt Engineering ---
    def _get_task_formulation_prompt(self):
        """
        Returns the task formulation prompt as described in Figure 3.
        This sets the role, context, and task for the LLM.
        """
        return """Your role is that of a construction engineering professional, and I'm proposing a collaborative effort in understanding of how visual data in a specific context leads to inferences.

To start, I'll provide you with images for in-context learning. I'll address your limitations associated with processing image inputs by presenting examples based on the ontology-based knowledge graphs. The graph has two parts:

The <Context>: This section includes basic visual facts. Each image is divided into a 3x3 grid, and numerical annotations indicate the spatial location of each entity, whether fully or partially occupied, within the image.
The <Inference>: This part encapsulates the results inferred by combining the visual facts with underlying heuristics.

All elements in the input are represented as triples and have explicit semantic meaning. To fully understand the input, your task is to carefully analyze each triple by considering both the given ontology and your expertise in construction. Your primary goal is to learn the implicit heuristics connecting the <Context> and the <Inference>. This will require you to utilize the provided examples, the ontology, and your domain-specific knowledge."""

    def _get_output_formulation_prompt(self):
        """
        Returns the output formulation prompt as described in Figure 4.
        This guides the LLM on how to structure its response.
        """
        return """I believe you have already grasped the implicit heuristics. Now you should apply your learnt heuristics to generate the <Inference>, as per formatted in the examples, based on the new test <Context> I shall provide later. Here are some guidelines to follow when generating your answer:

1. When generating the activity type for an <Activity>, it's crucial to specifically map your understanding to the terms within the brackets that I will provide later.
Activity:[ConcretePouring, FormworkSetUp, ReinforcementAssembly, SoilExcavation, ProcessMonitoring, ...]

2. When selecting participants for an activity, it's essential to ensure that each activity includes, at the very least, either a worker or a heavy machine.
3. When the given context involves concurrent activities, you are encouraged to include multiple activities in your response.
4. Please refrain from making excessive assumptions about the situation. Instead, strive to provide reasonable inferences based on the context provided. If certain entities are not actively participating in any activities, please avoid suggesting activities for them."""

    def _get_in_context_examples(self):
        """
        Provides a list of structured in-context examples for few-shot learning.
        Each example includes the context, the Chain-of-Thought (CoT) components,
        and the final high-level semantic inference.
        """
        return [
            {
                "context": """<Context>:{
    Entity:{
        Worker_1 a Worker (5,8)
        Worker_2 a Worker (8)
        ReinforcementRegion_1 a ReinforcementRegion (1,2,4,5)
        SteelRod_1 a SteelRod (9)
    }
    Attribute:{
        Worker_1 hasPosture Standing
        Worker_2 hasPosture Bending
        Worker_1 hasBodyPart Worker_Hand_1
    }
    Relationship:{
        Worker_1 hasAttentionOn ReinforcementRegion_1
        Worker_2 isConnectedTo SteelRod_1
        Worker_Hand_1 isConnectedTo ReinforcementRegion_1
    }
}""",
                "inference": """<Inference>:{
    the CoT components:{
        WorkerBasicInteraction:{
            Worker_1 installing ReinforcementRegion_1
            Worker_2 assisting Worker_1
        }
    }
    high-level semantics:{
        Activity:{
            Activity_1:{
                Activity type:{
                    ReinforcementAssembly
                }
                Activity elements:{
                    Worker_1, Worker_2, ReinforcementRegion_1, SteelRod_1
                }
            }
        }
    }
}"""
            },
            # Add a second example for better learning
            {
                "context": """<Context>:{
    Entity:{
        Excavator_1 a Excavator (4,5,7,8)
        Worker_1 a Worker (6)
        SoilTrenchRegion_1 a SoilTrenchRegion (7,8)
        SoilGroundRegion_1 a SoilGroundRegion (1,2,3)
    }
    Attribute:{
        Excavator_1 hasPart Excavator_Bucket_1
    }
    Relationship:{
        Worker_1 hasAttentionOn Excavator_1
        Excavator_Bucket_1 isIn SoilTrenchRegion_1
    }
}""",
                "inference": """<Inference>:{
    the CoT components:{
        MachineryInteraction:{
            Excavator_1 excavating SoilTrenchRegion_1
        }
        WorkerBasicInteraction:{
            Worker_1 monitoring Excavator_1
        }
    }
    high-level semantics:{
        Activity:{
            Activity_1:{
                Activity type:{
                    SoilExcavation
                }
                Activity elements:{
                    Excavator_1, SoilTrenchRegion_1
                }
            },
            Activity_2:{
                Activity type:{
                    ProcessMonitoring
                }
                Activity elements:{
                    Worker_1, Excavator_1
                }
            }
        }
    }
}"""
            }
        ]

    def construct_full_prompt(self, new_scene_context):
        """
        Assembles the complete prompt to be sent to the LLM.

        Args:
            new_scene_context (str): The ontology-guided representation of the new image.

        Returns:
            str: The complete, formatted prompt.
        """
        print("[Step 3] Constructing the full LLM prompt.")
        
        prompt = self._get_task_formulation_prompt()
        prompt += "\n\n--- EXAMPLES ---\n"

        for example in self._get_in_context_examples():
            prompt += f"\n{example['context']}\n"
            prompt += f"{example['inference']}\n"
            prompt += "---\n"

        prompt += "\n--- NEW SCENE TO ANALYZE ---\n"
        prompt += self._get_output_formulation_prompt()
        prompt += f"\n\n{new_scene_context}\n"
        prompt += "<Inference>:" # This prompts the model to generate the inference part.

        return prompt

    # --- Step 4: Simulate LLM Inference ---
    def _simulate_llm_inference(self, prompt):
        """
        Simulates making an API call to a large language model.
        In a real application, this would use a library like 'requests' or 'openai'.

        Args:
            prompt (str): The fully constructed prompt.

        Returns:
            str: The simulated text response from the LLM.
        """
        print("[Step 4] Simulating LLM API call...")
        
        # This is a mock response tailored to the "test_image_formwork.jpg" input.
        # It mirrors the inference shown in Figure 7 of the paper.
        if "FormworkRegion_1" in prompt and "Hoist_1" in prompt:
            # A realistic, structured response from the LLM
            mock_response = """{
    "the CoT components":{
        "WorkerBasicInteraction":{
            "Worker_1 monitoring Hoist_1"
        },
        "MachineryInteraction":{
            "Hoist_1 lifting FormworkRegion_1"
        }
    },
    "high-level semantics":{
        "Activity":{
            "Activity_1":{
                "Activity type":{
                    "FormworkSetUp"
                },
                "Activity elements":{
                    "Worker_1", "Hoist_1", "FormworkRegion_1"
                }
            }
        }
    }
}"""
            return mock_response
        
        return "{\n \"error\": \"Could not determine activity based on context.\" \n}"
        
        # --- Example of a REAL API call (e.g., with OpenAI) ---
        #
        # import openai
        # openai.api_key = "YOUR_API_KEY"
        #
        # try:
        #     response = openai.Completion.create(
        #         engine="text-davinci-003", # Or a different model
        #         prompt=prompt,
        #         max_tokens=500,
        #         temperature=0.5, # As used in the paper
        #         top_p=0.3 # As used in the paper
        #     )
        #     return response.choices[0].text.strip()
        # except Exception as e:
        #     return f"An API error occurred: {e}"


    def run_inference_on_image(self, image_path):
        """
        Executes the full pipeline for a single image.

        Args:
            image_path (str): The path to the image to analyze.
        """
        # Step 1
        visual_info = self._simulate_visual_information_extraction(image_path)
        
        # Step 2
        scene_context = self._create_ontology_guided_representation(visual_info)
        
        # Step 3
        full_prompt = self.construct_full_prompt(scene_context)
        
        print("\n" + "="*50)
        print("          FINAL PROMPT SENT TO LLM (Excerpt)")
        print("="*50)
        print(full_prompt[-1000:]) # Print the tail of the prompt for brevity
        print("="*50 + "\n")
        
        # Step 4
        llm_response_str = self._simulate_llm_inference(full_prompt)
        
        # Step 5 - Parse and display the result
        print("[Step 5] Parsing and displaying the final inference.")
        print("-" * 50)
        print("          LLM INFERENCE RESULT")
        print("-" * 50)
        try:
            # Try to pretty-print if it's JSON
            parsed_response = json.loads(llm_response_str.replace("'", '"')) # Basic cleanup
            print(json.dumps(parsed_response, indent=4))
        except json.JSONDecodeError:
            # Otherwise, print as raw text
            print(llm_response_str)
        print("-" * 50)


if __name__ == '__main__':
    # Initialize the recognizer model
    recognizer = OntologyBasedActivityRecognizer()

    # Define a mock image file path to be analyzed.
    # The simulation is keyed to this specific filename.
    test_image_file = "mock_data/images/test_image_formwork.jpg"

    # Run the full inference pipeline
    recognizer.run_inference_on_image(test_image_file)

Related posts, You May like to read