Customized Vision-Language Representations for Industrial Qualification: Bridging AI and Expert Knowledge in Additive Manufacturing

In the rapidly evolving world of additive manufacturing (AM), ensuring the quality and reliability of engineered materials is a critical bottleneck. Traditional qualification methods rely heavily on manual inspection and expert interpretation, leading to delays, inconsistencies, and scalability issues. A groundbreaking new study titled “Linking heterogeneous microstructure informatics with expert characterization knowledge through customized and hybrid vision-language representations for industrial qualification” proposes a transformative solution: customized vision-language representations (VLRs) that seamlessly integrate AI-driven image analysis with human expertise.

This article dives deep into the paper’s methodology, findings, and implications, offering engineers, researchers, and industrial stakeholders a clear understanding of how hybrid VLRs can revolutionize material qualification in advanced manufacturing.

What Are Vision-Language Representations (VLRs)?

Vision-Language Representations (VLRs) are AI models trained to understand and link visual data (like microscope images) with textual descriptions (such as expert assessments). These models learn a shared embedding space where images and text can be compared based on semantic similarity.

Two of the most prominent pre-trained VLRs used in this study are:

CLIP (Contrastive Language–Image Pre-training) by OpenAI
FLAVA (Foundational Language and Vision Alignment) by Meta AI

These models are pre-trained on massive datasets of image-text pairs, enabling them to perform zero-shot classification—that is, classifying new, unseen data without task-specific retraining.

The Challenge: Bridging Data, Information, and Expert Knowledge

The paper addresses a fundamental gap in materials engineering: the disconnect between raw data and expert knowledge.

As illustrated in the Data–Information–Knowledge–Wisdom (DIKW) hierarchy (Figure 1 in the paper), the journey from raw micrographs to qualified materials involves multiple steps:

Data: Raw pixel intensities from optical metallography.
Information: Extracted features via semantic segmentation (e.g., carbide clusters, porosity).
Knowledge: Expert interpretation of these features using domain-specific criteria.
Wisdom: Application-specific decisions (e.g., accept/reject based on industry standards).

While deep learning models excel at converting data to information, the information-to-knowledge transition remains largely manual. This is where customized VLRs come in.

The Solution: Customized and Hybrid VLRs

The study introduces a novel framework that enhances generic VLRs through customization and hybridization, specifically tailored for industrial qualification of additively manufactured metal matrix composites (MMCs).

Step 1: Data and Preprocessing

The researchers used Ni-WC MMCs fabricated via directed energy deposition (DED), a form of AM. Optical metallography images were processed and segmented using a deep learning tool called MicroSegQ+, which identified key phases:

Red: Fusion line
Blue: Heat-affected zone (HAZ)
Green: Bead reinforcement area
Pink: Carbide particles
Dark Blue: Porosity

This segmentation transformed raw images into quantifiable microstructure information.

Step 2: Encoding Expert Knowledge

Six Expert Assessments (EAs) were defined to qualify the microstructures:

ASSESSMENT	CRITERIA
Dilution (EA-1)	Area of matrix material mixing into substrate (≤5% acceptable)
HAZ (EA-2)	Depth of thermally altered substrate (≤1 mm typical)
Reinforcement (EA-3)	Presence of distinct bead structure
Porosity (EA-4)	Pore size, area fraction, distribution
Carbide Dissolution (EA-5)	≥30% carbide area fraction
Carbide Distribution (EA-6)	Uniform spatial distribution

Each assessment was encoded into positive and negative textual prompts, such as:

“An ideal microstructural image has uniformly distributed pink carbide particles in the green metal matrix…”

These color-aware textual descriptions were crucial for aligning language with visual features.

How Customized VLRs Work: The Net Similarity Scoring Approach

The core innovation lies in a customized similarity-based representation that computes a net similarity score for each microstructure.

1. Vision-Language Similarity (Using CLIP)

For a given image, the model computes cosine similarity with averaged positive and negative text embeddings:

\[ \bar{z}_T^{+} = \frac{1}{N^{+}} \sum_{i} z_{i}^{T+}, \quad \bar{z}_T^{-} = \frac{1}{N^{-}} \sum_{j} z_{j}^{T-} \] \[ \Delta_{\text{CLIP}} = \cos(z_I, \bar{z}_T^{+}) – \cos(z_I, \bar{z}_T^{-}) \]

A positive Δ indicates alignment with acceptable criteria.

2. Vision-Vision Similarity (Using FLAVA)

Similarly, the model compares the query image to reference positive and negative image embeddings:

\[ \bar{z}_{I}^{+} = \frac{1}{M^{+}} \sum_{k} z_{k}^{I^{+}}, \quad \bar{z}_{I}^{-} = \frac{1}{M^{-}} \sum_{l} z_{l}^{I^{-}} \] \[ \Delta_{\text{FLAVA}} = \cos(z_q, \bar{z}_{I}^{+}) – \cos(z_q, \bar{z}_{I}^{-}) \]

3. Hybridization via Z-Score Normalization

To combine both models, the deltas are z-score normalized to align their distributions:

\[ z_{\text{CLIP}} = \frac{\Delta_{\text{CLIP}} – \mu_{\text{CLIP}}}{\sigma_{\text{CLIP}}}, \qquad z_{\text{FLAVA}} = \frac{\Delta_{\text{FLAVA}} – \mu_{\text{FLAVA}}}{\sigma_{\text{FLAVA}}} \]

The final hybrid similarity score is:

\[ \Delta_{\text{Hybrid}} = z_{\text{CLIP}} + z_{\text{FLAVA}} \]

If Δ_Hybrid≥0 , the sample is classified as acceptable.

Key Findings: CLIP vs. FLAVA – Complementary Strengths

The study revealed distinct performance profiles:

MODEL	STRENGTH	LIMITATION
CLIP	Strongtextual alignment; robust to phrasing variations	Lower visual sensitivity
FLAVA	Highvisual discrimination; better at image-to-image similarity	Sensitive to textual rewording

For example:

CLIP showed higher intra-class similarity among paraphrased technical descriptions.
FLAVA produced more discriminative visual embeddings, suppressing spurious similarities.

The hybrid approach leveraged both strengths, significantly improving classification accuracy.

Performance Evaluation: Top-K Accuracy and Classification Results

The models were evaluated using top-K accuracy across individual and cumulative expert assessments.

Table: Top-5 Accuracy – Individual vs. Cumulative Assessments

ASSESSMENT	FLAVA (TOP-5)	CLIP (TOP-5)	CLIP + COLOR (TOP-5)
Dilution (EA1)	60%	60%	40% →60%
Porosity (EA3)	40%	20%	20% →50%
Distribution (EA5)	80%	80%	80% →80%

Note: CLIP with color-aware prompts showed improved mid-rank retrieval.

Classification Outcomes

Distribution: 4 false positives, 3 false negatives
Dilution: 4 false positives (mostly negligible dilution)
Reinforcement Area: 7 false positives (due to other defects not in reference set)

Z-score normalization played a critical role in correcting borderline cases. For instance:

Sample 2: Negative raw CLIP delta (−0.0038) → Positive after normalization (0.1276) → Correct positive classification.
Sample 3: Positive FLAVA delta (0.0772) but negative CLIP z-score (−0.7706) → Final negative classification.

Why Hybrid VLRs Outperform Single Models

The hybrid framework offers several advantages:

✅ Balanced Decision-Making: Combines textual robustness (CLIP) with visual sensitivity (FLAVA)
✅ No Retraining Needed: Operates in zero-shot mode, ideal for industrial agility
✅ Interpretable Results: Net similarity scores provide traceability
✅ Scalable Customization: New criteria can be added via textual prompts

As the authors state:

“Z-score normalization plays a critical role in shifting borderline decisions and aligning the influence of each modality in the final prediction.”

Industrial Application: A Multi-Modal Qualification Knowledge Base

The paper proposes a vision-language detection tree (Figure 32) for automated, high-throughput qualification:

Input: New micrograph uploaded
Segmentation: DL model extracts features
Similarity Scoring: Hybrid VLR computes ΔHybrid
Decision: Accept/Reject based on threshold
Output: Classification + aligned textual rationale

Proposed UI Features (Figure 33):

Interactive Fusion Strategy Selection (z-score sum, weighted average)
Configurable Confidence Thresholding
Dual-Model Backend (CLIP + FLAVA)
Drag-and-Drop Image Upload
Contextualized Expert Feedback Output

This system enables human-in-the-loop decision-making without requiring AI expertise.

SEO-Optimized Takeaways for Industry Professionals

🔍 Primary Keywords

Vision-language representations
Industrial qualification
Additive manufacturing
Microstructure informatics
Expert knowledge integration

📌 Why This Matters for Your Business

Reduce inspection time from hours to seconds
Standardize quality control across facilities
Preserve expert knowledge in a reusable format
Enable rapid qualification of new materials and processes

🚀 Future-Ready Applications

Integration with digital twins and in-situ monitoring
Expansion to SEM, XCT, and EBSD data
Prompt engineering for dynamic criteria updates

Conclusion: A New Era of AI-Augmented Materials Engineering

This study demonstrates that customized and hybrid vision-language representations can effectively bridge the gap between raw microstructural data and expert knowledge. By leveraging CLIP for stable textual alignment and FLAVA for sensitive visual discrimination, the framework enables zero-shot, interpretable, and scalable qualification of advanced materials.

The z-score normalized hybrid scoring method ensures robust decisions, even near classification boundaries. Moreover, the modular design allows seamless integration into existing industrial workflows.

As additive manufacturing continues to grow, such AI-augmented systems will become essential for high-throughput, reliable, and traceable material qualification.

Call to Action: Embrace the Future of Industrial AI

Are you ready to transform your materials qualification process?

👉 Download the full paper to explore the methodology in depth.
👉 Contact research teams at McGill University or NRC Canada for collaboration opportunities.
👉 Experiment with CLIP and FLAVA models using open-source implementations on Hugging Face or GitHub.

Don’t let manual inspection slow you down.
Leverage vision-language AI to build smarter, faster, and more reliable manufacturing pipelines.

Here is the complete Python script that implements the framework using the CLIP and FLAVA models.

# -*- coding: utf-8 -*-
"""
This script implements the end-to-end model proposed in the paper:
"Linking heterogeneous microstructure informatics with expert characterization
knowledge through customized and hybrid vision-language representations for
industrial qualification" by Mutahar Safdar et al.

The code performs zero-shot classification of microstructural images based on
customized and hybrid similarity scores from pre-trained CLIP and FLAVA models.
"""

# Install necessary libraries if they are not already installed.
# !pip install transformers torch Pillow numpy scikit-learn

import torch
from transformers import CLIPProcessor, CLIPModel, FlavaProcessor, FlavaModel
from PIL import Image
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
import os
import requests
from io import BytesIO

# --- 1. Model Loading ---
def load_models():
    """Loads the pre-trained CLIP and FLAVA models and processors."""
    print("Loading pre-trained models (CLIP and FLAVA)...")
    # Load CLIP model and processor from Hugging Face
    clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    # Load FLAVA model and processor from Hugging Face
    flava_model = FlavaModel.from_pretrained("facebook/flava-full")
    flava_processor = FlavaProcessor.from_pretrained("facebook/flava-full")
    print("Models loaded successfully.")
    return clip_model, clip_processor, flava_model, flava_processor

# --- 2. Feature Extraction ---
def get_image_embedding(image_path, model, processor):
    """Extracts feature embedding from a single image."""
    try:
        # Check if the image_path is a URL or a local file
        if image_path.startswith('http'):
            response = requests.get(image_path)
            image = Image.open(BytesIO(response.content)).convert("RGB")
        else:
            image = Image.open(image_path).convert("RGB")
    except (IOError, FileNotFoundError):
        print(f"Error: Image file not found at {image_path}")
        return None

    # Preprocess the image and get the pixel values
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        # Get the image features (embedding)
        if "clip" in model.config.model_type:
            embedding = model.get_image_features(pixel_values=inputs['pixel_values'])
        else: # FLAVA
            embedding = model.get_image_features(pixel_values=inputs['pixel_values'])
    # L2 normalize the embedding as per the paper's methodology
    embedding = embedding / embedding.norm(p=2, dim=-1, keepdim=True)
    return embedding

def get_text_embedding(text, model, processor):
    """Extracts feature embedding from a text prompt."""
    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        # Get the text features (embedding)
        if "clip" in model.config.model_type:
            embedding = model.get_text_features(**inputs)
        else: # FLAVA - Note: FLAVA text embeddings are not directly comparable to image embeddings
            embedding = model.get_text_features(**inputs)
    # L2 normalize the embedding
    embedding = embedding / embedding.norm(p=2, dim=-1, keepdim=True)
    return embedding

def get_mean_embedding(paths, model, processor, input_type='image'):
    """Computes the mean embedding for a list of images or texts."""
    embeddings = []
    for path in paths:
        if input_type == 'image':
            emb = get_image_embedding(path, model, processor)
        else:
            emb = get_text_embedding(path, model, processor)
        if emb is not None:
            embeddings.append(emb)
    if not embeddings:
        return None
    # Stack embeddings and calculate the mean
    mean_emb = torch.mean(torch.stack(embeddings), dim=0)
    # Re-normalize the mean embedding
    mean_emb = mean_emb / mean_emb.norm(p=2, dim=-1, keepdim=True)
    return mean_emb

# --- 3. Similarity and Scoring ---
def cosine_similarity(emb1, emb2):
    """Computes cosine similarity between two embeddings."""
    # Ensure embeddings are on the same device and correctly shaped
    emb1 = emb1.cpu().numpy().flatten()
    emb2 = emb2.cpu().numpy().flatten()
    return np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

def calculate_delta_scores(image_paths, clip_model, clip_processor, flava_model, flava_processor, pos_img_refs, neg_img_refs, pos_text_refs, neg_text_refs):
    """Calculates the raw delta scores for CLIP and FLAVA for all images."""
    print("\nCalculating delta scores for the dataset...")
    # Get mean reference embeddings
    mean_pos_img_flava = get_mean_embedding(pos_img_refs, flava_model, flava_processor, 'image')
    mean_neg_img_flava = get_mean_embedding(neg_img_refs, flava_model, flava_processor, 'image')
    mean_pos_text_clip = get_mean_embedding(pos_text_refs, clip_model, clip_processor, 'text')
    mean_neg_text_clip = get_mean_embedding(neg_text_refs, clip_model, clip_processor, 'text')

    delta_flava_scores = []
    delta_clip_scores = []

    for img_path in image_paths:
        # Get embeddings for the current input image
        img_emb_flava = get_image_embedding(img_path, flava_model, flava_processor)
        img_emb_clip = get_image_embedding(img_path, clip_model, clip_processor)

        if img_emb_flava is None or img_emb_clip is None:
            continue

        # Calculate FLAVA (vision-vision) similarity delta (Equation 8)
        sim_pos_flava = cosine_similarity(img_emb_flava, mean_pos_img_flava)
        sim_neg_flava = cosine_similarity(img_emb_flava, mean_neg_img_flava)
        delta_flava = sim_pos_flava - sim_neg_flava
        delta_flava_scores.append(delta_flava)

        # Calculate CLIP (vision-language) similarity delta (Equation 6)
        sim_pos_clip = cosine_similarity(img_emb_clip, mean_pos_text_clip)
        sim_neg_clip = cosine_similarity(img_emb_clip, mean_neg_text_clip)
        delta_clip = sim_pos_clip - sim_neg_clip
        delta_clip_scores.append(delta_clip)

    return np.array(delta_flava_scores), np.array(delta_clip_scores)

def z_score_normalize(scores):
    """Applies z-score normalization to a list of scores."""
    mean = np.mean(scores)
    std = np.std(scores)
    if std == 0:
        return np.zeros_like(scores)
    return (scores - mean) / std

# --- 4. Main Classification Logic ---
def classify_microstructures(image_paths, delta_flava_scores, delta_clip_scores):
    """Classifies images using the hybrid scoring method."""
    print("\nNormalizing scores and performing hybrid classification...")
    # Normalize the delta scores using z-score (Equations 9 & 10)
    z_flava = z_score_normalize(delta_flava_scores)
    z_clip = z_score_normalize(delta_clip_scores)

    # Calculate the final hybrid score (Equation 11)
    hybrid_scores = z_clip + z_flava

    # Classify based on the hybrid score
    # "Positive" if score >= 0, "Negative" if score < 0
    predictions = ["Positive" if score >= 0 else "Negative" for score in hybrid_scores]

    print("\n--- Classification Results ---")
    for i, img_path in enumerate(image_paths):
        print(f"Image: {os.path.basename(img_path):<15} | Δ_FLAVA: {delta_flava_scores[i]:.4f} (z={z_flava[i]:.4f}) | Δ_CLIP: {delta_clip_scores[i]:.4f} (z={z_clip[i]:.4f}) | Hybrid Score: {hybrid_scores[i]:.4f} | Prediction: {predictions[i]}")

    return predictions

# --- 5. Main Execution ---
if __name__ == "__main__":
    # --- Mock Dataset and Ground Truth ---
    # In a real scenario, this would be your dataset of image paths.
    # For demonstration, we'll use placeholder URLs.
    # Replace these with paths to your actual image files.
    # The paper uses 40 samples, but we'll use a smaller set for this example.
    
    # Let's create a dummy directory and some images for a local test
    if not os.path.exists("mock_images"):
        os.makedirs("mock_images")
    
    base_url = "https://placehold.co/600x400"
    image_files = {
        "sample1_good_dist.png": f"{base_url}/5cb85c/ffffff?text=Good+Dist", # Greenish
        "sample2_good_dist.png": f"{base_url}/6fbf6f/ffffff?text=Good+Dist", # Light Green
        "sample3_bad_dist.png": f"{base_url}/d9534f/ffffff?text=Bad+Dist", # Reddish
        "sample4_bad_dist.png": f"{base_url}/c9302c/ffffff?text=Bad+Dist", # Dark Red
        "sample5_good_dilution.png": f"{base_url}/5bc0de/ffffff?text=Good+Dilution", # Light Blue
        "sample6_bad_dilution.png": f"{base_url}/f0ad4e/ffffff?text=Bad+Dilution", # Orange
        "sample7_good_reinforce.png": f"{base_url}/292b2c/ffffff?text=Good+Reinforce", # Dark Grey
        "sample8_bad_reinforce.png": f"{base_url}/f7f7f7/000000?text=Bad+Reinforce" # Light Grey
    }

    for name, url in image_files.items():
        path = os.path.join("mock_images", name)
        if not os.path.exists(path):
            response = requests.get(url)
            with open(path, 'wb') as f:
                f.write(response.content)

    all_image_paths = [os.path.join("mock_images", f) for f in image_files.keys()]

    # Ground truth labels for each assessment (1 for Positive, 0 for Negative)
    # This is a mock ground truth for evaluation purposes.
    ground_truth = {
        "Distribution": [1, 1, 0, 0, 1, 0, 1, 0],
        "Dilution":     [1, 1, 1, 1, 1, 0, 1, 0],
        "Reinforcement":[1, 1, 1, 1, 1, 1, 1, 0]
    }
    
    # --- Reference Sets for Expert Assessments (based on the paper) ---
    # These should be paths to images and text prompts that define "good" vs "bad"
    expert_assessments = {
        "Distribution": {
            "pos_imgs": [os.path.join("mock_images", "sample1_good_dist.png")],
            "neg_imgs": [os.path.join("mock_images", "sample3_bad_dist.png")],
            "pos_texts": ["An ideal microstructural image has uniformly distributed pink carbide particles in the green metal matrix."],
            "neg_texts": ["A non-ideal microstructural image has uneven or clustered pink carbide particles in the green metal matrix."]
        },
        "Dilution": {
            "pos_imgs": [os.path.join("mock_images", "sample5_good_dilution.png")],
            "neg_imgs": [os.path.join("mock_images", "sample6_bad_dilution.png")],
            "pos_texts": ["An ideal microstructural image has the bead reinforcement area fully contained above the red fusion line, highlighting no dilution."],
            "neg_texts": ["A non-ideal microstructural image shows the bead reinforcement area intruding below the red fusion line, indicating dilution."]
        },
        "Reinforcement": {
            "pos_imgs": [os.path.join("mock_images", "sample7_good_reinforce.png")],
            "neg_imgs": [os.path.join("mock_images", "sample8_bad_reinforce.png")],
            "pos_texts": ["A green bead reinforcement area with pink carbide particles is present above the blue substrate."],
            "neg_texts": ["A non-ideal microstructural image shows an incomplete or poorly defined green bead reinforcement area."]
        }
    }
    
    # Load models
    clip_model, clip_processor, flava_model, flava_processor = load_models()

    # --- Run classification for each expert assessment ---
    for assessment, refs in expert_assessments.items():
        print(f"\n========================================================")
        print(f"PERFORMING ASSESSMENT FOR: {assessment.upper()}")
        print(f"========================================================")

        # 1. Calculate Delta Scores for all images
        delta_flava, delta_clip = calculate_delta_scores(
            all_image_paths,
            clip_model, clip_processor,
            flava_model, flava_processor,
            refs["pos_imgs"], refs["neg_imgs"],
            refs["pos_texts"], refs["neg_texts"]
        )

        # 2. Classify based on the scores
        predictions = classify_microstructures(all_image_paths, delta_flava, delta_clip)

        # 3. Evaluate the results
        true_labels_int = ground_truth[assessment]
        pred_labels_int = [1 if p == "Positive" else 0 for p in predictions]
        true_labels_str = ["Positive" if l == 1 else "Negative" for l in true_labels_int]

        print(f"\n--- Evaluation for {assessment} ---")
        try:
            print("Confusion Matrix:")
            print(confusion_matrix(true_labels_int, pred_labels_int))
            print("\nClassification Report:")
            print(classification_report(true_labels_str, predictions, zero_division=0))
        except Exception as e:
            print(f"Could not generate evaluation report. Error: {e}")

Related posts, You May like to read