🛠️ Graph Pipeline Phase 1: 기하학적 데이터 추출 (Geometric Extraction)

이 문서는 P&ID Graph Pipeline의 첫 번째 단계인 기하학적 데이터 추출의 상세 구현 계획을 다룹니다. 목표는 단순한 텍스트 추출을 넘어, 도면 내 모든 객체의 **물리적 위치(좌표)**와 기하학적 속성을 보존하여 이후 위상 모델링(Topology Modeling)이 가능하도록 하는 것입니다.

📦 1. 필수 패키지 및 환경 설정

1.1 Python 패키지

패키지	용도	비고
`ezdxf`	DXF 파일 파싱 및 엔티티 추출	핵심 라이브러리
`shapely`	기하학적 연산 (Intersection, Distance, Bounding Box)	좌표 기반 분석 필수
`numpy`	대량의 좌표 데이터 계산 및 행렬 연산	성능 최적화
`pandas`	추출된 객체 데이터의 구조화 및 CSV/JSON 저장	데이터 관리
`pydantic`	추출 데이터의 스키마 정의 및 유효성 검증	데이터 무결성 보장
`pytesseract` / `pdf2image`	PDF 도면의 영역 기반 OCR 추출	PDF 처리 시 필요

1.2 설치 명령어

pip install ezdxf shapely numpy pandas pydantic pytesseract pdf2image

📐 2. 상세 설계 구조

2.1 데이터 모델 (Schema)

모든 추출 객체는 다음과 같은 공통 속성을 갖는 GeometricEntity 모델을 따릅니다.

from pydantic import BaseModel
from typing import List, Optional, Union, Tuple

class BoundingBox(BaseModel):
    min_x: float
    min_y: float
    max_x: float
    max_y: float
    center: Tuple[float, float]

class GeometricEntity(BaseModel):
    entity_id: str
    entity_type: str  # TEXT, LINE, CIRCLE, POLYLINE, ARC
    layer: str
    bbox: BoundingBox
    properties: dict  # 텍스트 값, 색상, 선 굵기 등
    coordinates: List[Tuple[float, float]]  # 시작점, 끝점 또는 정점 리스트

2.2 처리 파이프라인 흐름

DXF Load: ezdxf.readfile()을 통해 도면 로드.
Entity Iteration: 모든 레이어의 엔티티를 순회하며 타입별 분류.
Coordinate Extraction:
- TEXT: 삽입점(Insertion Point) 및 텍스트 길이를 이용한 BBox 계산.
- LINE: 시작점(Start)과 끝점(End) 추출.
- POLYLINE: 모든 정점(Vertices) 리스트 추출.
- CIRCLE/ARC: 중심점(Center)과 반지름(Radius) 추출.
Spatial Normalization: 도면 좌표계를 분석 시스템 좌표계로 정규화.
Structured Export: JSON 또는 DB(PostgreSQL/PostGIS)에 저장.

💻 3. 실제 구현 코딩 가이드 (Example)

3.1 DXF 기하학적 추출 핵심 코드

import ezdxf
from shapely.geometry import box, LineString, Point
from typing import List

class PidGeometricExtractor:
    def __init__(self, file_path: str):
        self.doc = ezdxf.readfile(file_path)
        self.msp = self.doc.modelspace()

    def get_bbox(self, entity):
        """엔티티의 Bounding Box를 계산하여 shapely box 객체로 반환"""
        if entity.dxftype() == 'TEXT':
            # 텍스트의 경우 삽입점과 텍스트 길이를 기반으로 단순화된 BBox 생성
            p = entity.dxf.insert
            return box(p.x, p.y, p.x + 10, p.y + 5) # 실제로는 폰트 크기 반영 필요
        elif entity.dxftype() == 'LINE':
            start = entity.dxf.start
            end = entity.dxf.end
            return box(min(start.x, end.x), min(start.y, end.y), 
                      max(start.x, end.x), max(start.y, end.y))
        # ... 기타 타입 구현
        return None

    def extract_all(self) -> List[dict]:
        results = []
        for entity in self.msp:
            bbox_obj = self.get_bbox(entity)
            if bbox_obj:
                results.append({
                    "id": entity.dxf.handle,
                    "type": entity.dxftype(),
                    "layer": entity.dxf.layer,
                    "bbox": {
                        "min_x": bbox_obj.bounds[0],
                        "min_y": bbox_obj.bounds[1],
                        "max_x": bbox_obj.bounds[2],
                        "max_y": bbox_obj.bounds[3]
                    },
                    "value": getattr(entity.dxf, 'text', None)
                })
        return results

# 사용 예시
extractor = PidGeometricExtractor("plant_drawing.dxf")
geometric_data = extractor.extract_all()

3.2 유틸리티 함수: 인접성 체크 (Proximity Utility)

추후 2단계(위상 모델링)에서 사용할 핵심 유틸리티입니다.

from shapely.geometry import Point

def is_near(entity_a_bbox, entity_b_bbox, threshold=5.0):
    """두 객체의 Bounding Box 간의 최단 거리가 임계값 이내인지 확인"""
    return entity_a_bbox.distance(entity_b_bbox) <= threshold

def is_inside(point, bbox):
    """특정 점이 Bounding Box 내부에 있는지 확인"""
    return bbox.contains(Point(point))

🚀 4. Phase 1 완료 기준 (Definition of Done)

DXF 파일 내 모든 TEXT, LINE, POLYLINE의 좌표 데이터가 누락 없이 추출되는가?
각 객체별로 정확한 Bounding Box가 계산되어 저장되는가?
추출된 데이터가 GeometricEntity 스키마에 맞게 JSON 형태로 저장되는가?
(선택 사항) PDF 도면의 경우 OCR을 통해 텍스트의 좌표값이 추출되는가?

5.3 KiB Raw Blame History