windpacer/ExperionCrawler

Fork 0

Files

windpacer 15c17522c8 MCP-서버 리팩토링 후 P&ID 추출 테스트전 다른 기능 확인 후 커밋

2026-05-04 10:35:13 +09:00

29 KiB

Raw Blame History

P&ID 도면 파싱 병렬 LLM 아키텍처 개선안 v2 - 구현 가이드

1. 개요

이 문서는 P&ID_병렬LLM_아키텍처_개선안_v2.md의 개선안을 현재 프로젝트(ExperionCrawler)에 맞춰 실제 구현 가능한 코드로 정리한 문서입니다.

1.1 현재 프로젝트 구조

계층	파일	역할
C# Application Layer	`src/Core/Application/Services/PidGraphService.cs`	MCP 서버 호출 및 결과 처리
C# Controller	`src/Web/Controllers/PidGraphController.cs`	HTTP API 엔드포인트
MCP Server (Python)	`mcp-server/server.py`	FastMCP 기반 서버
MCP Pipeline	`mcp-server/pipeline/extractor.py`	Phase 1: 기하학적 추출
MCP Pipeline	`mcp-server/pipeline/topology.py`	Phase 2: 위상 빌더
MCP Pipeline	`mcp-server/pipeline/mapper.py`	Phase 3: 지능형 매핑
MCP Pipeline	`mcp-server/pipeline/analyzer.py`	Phase 4: 영향도 분석

1.2 기존 문제점 (P&ID_병목현상_분석_보고서.md)

단계	문제점	심각도	현재 상태
Phase 1	ezdxf로 28,000개 엔티티 처리	0.58초 (양호)	✅ 해결됨
Phase 2	O(n²) 노드 병합	timeout (심각)	⚠️ R-tree 도입 필요
Phase 3	순차적 LLM API 호출	예측 불가능한 지연	⚠️ 병렬 실행 필요

2. 개선안 요약

2.1 Phase 1: 기하학적 추출 (ezdxf) - 변경 없음

# mcp-server/pipeline/extractor.py (현재 그대로 사용)
class PidGeometricExtractor:
    def __init__(self, file_path: str):
        self.doc = ezdxf.readfile(file_path)
        self.msp = self.doc.modelspace()
    
    def extract_and_save(self, output_path: str):
        results = []
        for entity in self.msp:
            bbox_obj = self.get_bbox(entity)
            # ... 추출 로직
        return results

특징:

LINE (20,868개, 72.4%): 기하학적 추출만으로 충분 (LLM 불필요)
TEXT (3,562개, 12.4%): LLM 매핑 필요
기타 (3,589개): LLM 매핑 필요

2.2 Phase 2: 위상 빌더 (공간 인덱스 도입) - 개선 필요

문제: _merge_nodes() 메서드의 O(n²) 복잡도로 timeout 발생

해결책: R-tree 공간 인덱스 도입 → O(n log n)으로 개선

# mcp-server/pipeline/topology.py (개선안)
from rtree import index  # 추가

class PidTopologyBuilder:
    def __init__(self, geometric_data: List[Dict[str, Any]], ...):
        self.data = geometric_data
        self.G = nx.DiGraph()
    
    def build_graph(self):
        # 1. 공간 인덱스 생성 (R-tree)
        self._build_spatial_index()
        
        # 2. 노드 병합 (O(n log n))
        self._merge_nodes_spatial()
        
        # 3. 태그-설비 연결
        self._link_tags_to_equipment()
        
        # 4. 배관 연결
        self._link_pipes()
    
    def _build_spatial_index(self):
        """R-tree 공간 인덱스 생성"""
        p = index.Property()
        self.idx = index.Index(properties=p)
        for i, item in enumerate(self.data):
            bbox = item['bbox']
            self.idx.insert(i, (
                bbox['min_x'], bbox['min_y'],
                bbox['max_x'], bbox['max_y']
            ))
    
    def _merge_nodes_spatial(self):
        """공간 인덱스를 사용한 병합 (O(n log n))"""
        merge_threshold = 2.0
        merged = []
        visited = set()
        
        for i, item in enumerate(self.data):
            if i in visited:
                continue
            
            bbox = item['bbox']
            # 인접 노드만 검색 (R-tree 사용)
            neighbors = list(self.idx.intersection((
                bbox['min_x'] - merge_threshold,
                bbox['min_y'] - merge_threshold,
                bbox['max_x'] + merge_threshold,
                bbox['max_y'] + merge_threshold
            )))
            
            # ... 병합 로직

2.3 Phase 3: 병렬 LLM 매핑 (test_dxf_extract_pid*.py 구조 도입)

문제: 현재 pid_intelligent_mapper.py는 비동기로 순차적 LLM 호출

해결책: test_dxf_extract_pid*.py의 병렬 처리 구조 도입

# test_dxf_extract_pid1.py, pid2.py, pid3.py의 공통 구조
chunks = [
    {
        'name': 'Field Instruments - Sensors',
        'system': 'Extract sensor tags only...',
        'user': 'Extract ALL tags of FT, FIT, LT, PT...'
    },
    {
        'name': 'Field Instruments - Valves',
        'system': 'Extract valve tags only...',
        'user': 'Extract ALL tags of FCV, TCV, LCV...'
    },
    {
        'name': 'System Tags',
        'system': 'Extract system tags only...',
        'user': 'Extract ALL tags of LI, PI, TI...'
    }
]

핵심 발견:

청크 단위 분할: 태그 유형별로 프롬프트를 분리
독립된 프로세스 실행: 각 청크를 별도의 Python 프로그램으로 실행
vLLM GPU 자원 최대화: 각 프로세스가 별도의 GPU에 할당됨

3. 구현 계획

3.1 Phase 2: R-tree 공간 인덱스 도입

3.1.1 의존성 추가

# mcp-server/pyproject.toml
[project.dependencies]
rtree = "^1.3.0"

3.1.2 topology.py 개선

# mcp-server/pipeline/topology.py (개선안)
import networkx as nx
from shapely.geometry import box, Point, LineString
from rtree import index  # 추가
import json
from typing import List, Dict, Any, Optional, Tuple

class PidTopologyBuilder:
    def __init__(self, geometric_data: List[Dict[str, Any]], ...):
        self.data = geometric_data
        self.all_tags = all_extracted_tags if all_extracted_tags else []
        self.config = config if config else {'dist_threshold': 50.0, 'tag_threshold': 100.0, 'merge_threshold': 2.0}
        self.G = nx.DiGraph()
    
    def build_graph(self):
        # 1. 공간 인덱스 생성 (R-tree)
        self._build_spatial_index()
        
        # 2. 노드 병합 (O(n log n))
        self.merged_data = self._merge_nodes_spatial()
        
        # 3. 노드 추가
        for item in self.merged_data:
            bbox_vals = item['bbox']
            bbox_geom = box(bbox_vals['min_x'], bbox_vals['min_y'], bbox_vals['max_x'], bbox_vals['max_y'])
            self.G.add_node(item['entity_id'],
                           type=item['entity_type'],
                           bbox=bbox_geom,
                           value=item.get('clean_value'),
                           layer=item.get('layer'))
        
        # 4. 태그-설비 연결
        self._link_tags_to_equipment()
        
        # 5. 배관 연결
        self._link_pipes()
    
    def _build_spatial_index(self):
        """R-tree 공간 인덱스 생성"""
        p = index.Property()
        self.idx = index.Index(properties=p)
        for i, item in enumerate(self.data):
            bbox = item['bbox']
            self.idx.insert(i, (
                bbox['min_x'], bbox['min_y'],
                bbox['max_x'], bbox['max_y']
            ))
    
    def _merge_nodes_spatial(self):
        """공간 인덱스를 사용한 병합 (O(n log n))"""
        merge_threshold = self.config.get('merge_threshold', 2.0)
        merged = []
        visited = set()
        
        for i, item in enumerate(self.data):
            if i in visited:
                continue
            
            bbox = item['bbox']
            # 인접 노드만 검색 (R-tree 사용)
            neighbors = list(self.idx.intersection((
                bbox['min_x'] - merge_threshold,
                bbox['min_y'] - merge_threshold,
                bbox['max_x'] + merge_threshold,
                bbox['max_y'] + merge_threshold
            )))
            
            # 병합 로직 (가장 큰 엔티티 기준)
            merged_item = item.copy()
            for j in neighbors:
                if j != i and j not in visited:
                    neighbor_item = self.data[j]
                    # BBox 병합
                    merged_item['bbox']['min_x'] = min(merged_item['bbox']['min_x'], neighbor_item['bbox']['min_x'])
                    merged_item['bbox']['min_y'] = min(merged_item['bbox']['min_y'], neighbor_item['bbox']['min_y'])
                    merged_item['bbox']['max_x'] = max(merged_item['bbox']['max_x'], neighbor_item['bbox']['max_x'])
                    merged_item['bbox']['max_y'] = max(merged_item['bbox']['max_y'], neighbor_item['bbox']['max_y'])
                    visited.add(j)
            
            merged.append(merged_item)
            visited.add(i)
        
        return merged
    
    def _link_tags_to_equipment(self):
        """태그-설비 논리적 연결 (Association)"""
        tags = [n for n, d in self.G.nodes(data=True) if d['type'] == 'TEXT']
        equipments = [n for n, d in self.G.nodes(data=True) if d['type'] not in ['TEXT', 'LINE', 'LWPOLYLINE']]
        
        for tag in tags:
            best_match = self._find_nearest_equipment(tag, equipments)
            if best_match:
                self.G.add_edge(tag, best_match, relation='associated_with')
    
    def _find_nearest_equipment(self, tag_id, equipment_ids):
        tag_bbox = self.G.nodes[tag_id]['bbox']
        min_dist = float('inf')
        nearest = None
        for eq_id in equipment_ids:
            eq_bbox = self.G.nodes[eq_id]['bbox']
            dist = tag_bbox.distance(eq_bbox)
            if dist < min_dist:
                min_dist = dist
                nearest = eq_id
        return nearest if min_dist < self.config['tag_threshold'] else None
    
    def _link_pipes(self):
        """배관 기반 물리적 연결 (Pipe)"""
        lines = [n for n, d in self.G.nodes(data=True) if d['type'] in ['LINE', 'LWPOLYLINE']]
        equipments = [n for n, d in self.G.nodes(data=True) if d['type'] not in ['TEXT', 'LINE', 'LWPOLYLINE']]
        
        for line_id in lines:
            original_item = next((item for item in self.merged_data if item['entity_id'] == line_id), None)
            if not original_item or not original_item.get('coordinates'):
                continue
            
            coords = original_item['coordinates']
            line_geom = LineString(coords)
            endpoints = [line_geom.coords[0], line_geom.coords[-1]]
            
            connected_nodes = []
            for pt in endpoints:
                p = Point(pt)
                for eq_id in equipments:
                    if self.G.nodes[eq_id]['bbox'].distance(p) < self.config['dist_threshold']:
                        connected_nodes.append(eq_id)
            
            connected_nodes = list(set(connected_nodes))
            
            if len(connected_nodes) >= 2:
                self.G.add_edge(connected_nodes[0], connected_nodes[1], relation='pipe')
    
    def validate_topology(self):
        """위상 무결성 검증"""
        isolated = list(nx.isolates(self.G))
        return {
            "isolated_nodes": isolated, 
            "node_count": self.G.number_of_nodes(), 
            "edge_count": self.G.number_of_edges()
        }
    
    def save_graph(self, output_path: str):
        """그래프 구조를 JSON 형태로 저장"""
        from networkx.readwrite import json_graph
        data = json_graph.node_link_data(self.G)
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2, ensure_ascii=False)

3.2 Phase 3: 병렬 LLM 매핑

3.2.1 pid_extractor_line.py (LINE 배관 추출 - GPU 불필요)

# mcp-server/pipeline/pid_extractor_line.py (신규 생성)
#!/usr/bin/env python3
"""
P&ID 도면에서 LINE (배관) 추출 (CPU 전용, GPU 불필요)
"""

import ezdxf
import json
import sys

def main():
    if len(sys.argv) != 3:
        print("Usage: python pid_extractor_line.py <dxf_file> <output_dir>")
        sys.exit(1)
    
    dxf_file = sys.argv[1]
    output_dir = sys.argv[2]
    
    # DXF 파일 읽기
    doc = ezdxf.readfile(dxf_file)
    msp = doc.modelspace()
    
    # LINE 추출
    lines = []
    for entity in msp:
        if entity.dxftype() == 'LINE':
            start = entity.dxf.start
            end = entity.dxf.end
            line_data = {
                'entity_id': entity.dxf.handle,
                'entity_type': 'LINE',
                'start': (start.x, start.y),
                'end': (end.x, end.y),
                'length': ((end.x - start.x)**2 + (end.y - start.y)**2)**0.5
            }
            lines.append(line_data)
    
    # 결과 저장
    output_file = f'{output_dir}/line_data.json'
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(lines, f, indent=2, ensure_ascii=False)
    
    print(f'LINE 추출 완료: {len(lines)}개')

if __name__ == '__main__':
    main()

3.2.2 pid_extractor_text.py (TEXT 태그 추출 - GPU 0)

# mcp-server/pipeline/pid_extractor_text.py (신규 생성)
#!/usr/bin/env python3
"""
P&ID 도면에서 TEXT 태그 추출 (GPU 0 전용)
"""

import ezdxf
import json
import sys
from ezdxf.tools.text import plain_mtext
from openai import OpenAI

def main():
    if len(sys.argv) != 3:
        print("Usage: python pid_extractor_text.py <dxf_file> <output_dir>")
        sys.exit(1)
    
    dxf_file = sys.argv[1]
    output_dir = sys.argv[2]
    
    # DXF 파일 읽기
    doc = ezdxf.readfile(dxf_file)
    msp = doc.modelspace()
    
    # 텍스트 추출
    texts = []
    for entity in msp:
        if entity.dxftype() == 'TEXT':
            texts.append(entity.dxf.text)
        elif entity.dxftype() == 'MTEXT':
            try:
                plain = plain_mtext(entity.dxf.text)
                if plain.strip():
                    texts.append(plain)
            except Exception:
                pass
    
    text = '\n'.join(texts)
    
    # OpenAI 클라이언트 생성
    llm = OpenAI(
        base_url='http://localhost:8000/v1',
        api_key='dummy',
        timeout=1800
    )
    
    # 프롬프트
    system = (
        'You are a P&ID expert. Extract all tag names from TEXT entities.\n'
        'Return ONLY a JSON array.\n'
        '\n'
        'Format: [{"tagNo":"FICQ-10101","confidence":0.95},...]\n'
    )
    
    user = f'Extract ALL tag names from the text below:\n\n{text[:100000]}'
    
    # LLM 호출
    resp = llm.chat.completions.create(
        model='Qwen/Qwen3-Coder-Next-FP8',
        messages=[
            {'role': 'system', 'content': system},
            {'role': 'user', 'content': user},
        ],
        max_tokens=65536,
        temperature=0.1,
        extra_body={'chat_template_kwargs': {'enable_thinking': False}},
    )
    
    # 결과 저장
    raw = (resp.choices[0].message.content or '').strip()
    try:
        all_tags = json.loads(raw)
    except json.JSONDecodeError:
        all_tags = []
    
    output_file = f'{output_dir}/text_tags.json'
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(all_tags, f, indent=2, ensure_ascii=False)
    
    print(f'TEXT 태그 추출 완료: {len(all_tags)}개')

if __name__ == '__main__':
    main()

3.2.3 pid_extractor_valve.py (Valve 태그 추출 - GPU 1)

# mcp-server/pipeline/pid_extractor_valve.py (신규 생성)
#!/usr/bin/env python3
"""
P&ID 도면에서 Valve 태그 추출 (GPU 1 전용)
"""

import ezdxf
import json
import sys
from ezdxf.tools.text import plain_mtext
from openai import OpenAI

def main():
    if len(sys.argv) != 3:
        print("Usage: python pid_extractor_valve.py <dxf_file> <output_dir>")
        sys.exit(1)
    
    dxf_file = sys.argv[1]
    output_dir = sys.argv[2]
    
    # DXF 파일 읽기
    doc = ezdxf.readfile(dxf_file)
    msp = doc.modelspace()
    
    # 텍스트 추출
    texts = []
    for entity in msp:
        if entity.dxftype() == 'TEXT':
            texts.append(entity.dxf.text)
        elif entity.dxftype() == 'MTEXT':
            try:
                plain = plain_mtext(entity.dxf.text)
                if plain.strip():
                    texts.append(plain)
            except Exception:
                pass
    
    text = '\n'.join(texts)
    
    # OpenAI 클라이언트 생성
    llm = OpenAI(
        base_url='http://localhost:8000/v1',
        api_key='dummy',
        timeout=1800
    )
    
    # 프롬프트
    system = (
        'You are a P&ID expert. Extract valve tags only.\n'
        'Return ONLY a JSON array.\n'
        '\n'
        'Instrument types to extract: FCV, TCV, LCV, PCV, XV, FV, LV, PV, TV\n'
        'Format: [{"tagNo":"FCV-10101","confidence":0.95},...]\n'
    )
    
    user = f'Extract ALL tags of FCV, TCV, LCV, PCV, XV, FV, LV, PV, TV from the text below:\n\n{text[:100000]}'
    
    # LLM 호출
    resp = llm.chat.completions.create(
        model='Qwen/Qwen3-Coder-Next-FP8',
        messages=[
            {'role': 'system', 'content': system},
            {'role': 'user', 'content': user},
        ],
        max_tokens=65536,
        temperature=0.1,
        extra_body={'chat_template_kwargs': {'enable_thinking': False}},
    )
    
    # 결과 저장
    raw = (resp.choices[0].message.content or '').strip()
    try:
        all_tags = json.loads(raw)
    except json.JSONDecodeError:
        all_tags = []
    
    output_file = f'{output_dir}/valve_tags.json'
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(all_tags, f, indent=2, ensure_ascii=False)
    
    print(f'Valve 태그 추출 완료: {len(all_tags)}개')

if __name__ == '__main__':
    main()

3.2.4 pid_extractor_equipment.py (Equipment 태그 추출 - GPU 2)

# mcp-server/pipeline/pid_extractor_equipment.py (신규 생성)
#!/usr/bin/env python3
"""
P&ID 도면에서 Equipment 태그 추출 (GPU 2 전용)
"""

import ezdxf
import json
import sys
from ezdxf.tools.text import plain_mtext
from openai import OpenAI

def main():
    if len(sys.argv) != 3:
        print("Usage: python pid_extractor_equipment.py <dxf_file> <output_dir>")
        sys.exit(1)
    
    dxf_file = sys.argv[1]
    output_dir = sys.argv[2]
    
    # DXF 파일 읽기
    doc = ezdxf.readfile(dxf_file)
    msp = doc.modelspace()
    
    # 텍스트 추출
    texts = []
    for entity in msp:
        if entity.dxftype() == 'TEXT':
            texts.append(entity.dxf.text)
        elif entity.dxftype() == 'MTEXT':
            try:
                plain = plain_mtext(entity.dxf.text)
                if plain.strip():
                    texts.append(plain)
            except Exception:
                pass
    
    text = '\n'.join(texts)
    
    # OpenAI 클라이언트 생성
    llm = OpenAI(
        base_url='http://localhost:8000/v1',
        api_key='dummy',
        timeout=1800
    )
    
    # 프롬프트
    system = (
        'You are a P&ID expert. Extract equipment tags only.\n'
        'Return ONLY a JSON array.\n'
        '\n'
        'Equipment types to extract: Pump, Tank, Heat Exchanger, Vessel, Column\n'
        'Format: [{"tagNo":"P-101","confidence":0.95},...]\n'
    )
    
    user = f'Extract ALL tags of Pump, Tank, Heat Exchanger, Vessel, Column from the text below:\n\n{text[:100000]}'
    
    # LLM 호출
    resp = llm.chat.completions.create(
        model='Qwen/Qwen3-Coder-Next-FP8',
        messages=[
            {'role': 'system', 'content': system},
            {'role': 'user', 'content': user},
        ],
        max_tokens=65536,
        temperature=0.1,
        extra_body={'chat_template_kwargs': {'enable_thinking': False}},
    )
    
    # 결과 저장
    raw = (resp.choices[0].message.content or '').strip()
    try:
        all_tags = json.loads(raw)
    except json.JSONDecodeError:
        all_tags = []
    
    output_file = f'{output_dir}/equipment_tags.json'
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(all_tags, f, indent=2, ensure_ascii=False)
    
    print(f'Equipment 태그 추출 완료: {len(all_tags)}개')

if __name__ == '__main__':
    main()

3.2.5 pid_extractor_system.py (System 태그 추출 - GPU 3)

# mcp-server/pipeline/pid_extractor_system.py (신규 생성)
#!/usr/bin/env python3
"""
P&ID 도면에서 System 태그 추출 (GPU 3 전용)
"""

import ezdxf
import json
import sys
from ezdxf.tools.text import plain_mtext
from openai import OpenAI

def main():
    if len(sys.argv) != 3:
        print("Usage: python pid_extractor_system.py <dxf_file> <output_dir>")
        sys.exit(1)
    
    dxf_file = sys.argv[1]
    output_dir = sys.argv[2]
    
    # DXF 파일 읽기
    doc = ezdxf.readfile(dxf_file)
    msp = doc.modelspace()
    
    # 텍스트 추출
    texts = []
    for entity in msp:
        if entity.dxftype() == 'TEXT':
            texts.append(entity.dxf.text)
        elif entity.dxftype() == 'MTEXT':
            try:
                plain = plain_mtext(entity.dxf.text)
                if plain.strip():
                    texts.append(plain)
            except Exception:
                pass
    
    text = '\n'.join(texts)
    
    # OpenAI 클라이언트 생성
    llm = OpenAI(
        base_url='http://localhost:8000/v1',
        api_key='dummy',
        timeout=1800
    )
    
    # 프롬프트
    system = (
        'You are a P&ID expert. Extract system tags only.\n'
        'Return ONLY a JSON array.\n'
        '\n'
        'System types to extract: FICQ, TICA, PICA, LICA, FIC, TIC, PIC, LIC\n'
        'Format: [{"tagNo":"FICQ-10101","confidence":0.95},...]\n'
    )
    
    user = f'Extract ALL tags of FICQ, TICA, PICA, LICA, FIC, TIC, PIC, LIC from the text below:\n\n{text[:100000]}'
    
    # LLM 호출
    resp = llm.chat.completions.create(
        model='Qwen/Qwen3-Coder-Next-FP8',
        messages=[
            {'role': 'system', 'content': system},
            {'role': 'user', 'content': user},
        ],
        max_tokens=65536,
        temperature=0.1,
        extra_body={'chat_template_kwargs': {'enable_thinking': False}},
    )
    
    # 결과 저장
    raw = (resp.choices[0].message.content or '').strip()
    try:
        all_tags = json.loads(raw)
    except json.JSONDecodeError:
        all_tags = []
    
    output_file = f'{output_dir}/system_tags.json'
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(all_tags, f, indent=2, ensure_ascii=False)
    
    print(f'System 태그 추출 완료: {len(all_tags)}개')

if __name__ == '__main__':
    main()

3.2.6 merge_results.py (결과 병합)

# mcp-server/pipeline/merge_results.py (신규 생성)
#!/usr/bin/env python3
"""
병렬 추출 결과 병합 스크립트
"""

import json
import sys
import glob

def main():
    if len(sys.argv) != 3:
        print("Usage: python merge_results.py <input_dir> <output_file>")
        sys.exit(1)
    
    input_dir = sys.argv[1]
    output_file = sys.argv[2]
    
    # 모든 JSON 파일 읽기
    all_tags = []
    seen_tags = set()
    
    for filepath in glob.glob(f'{input_dir}/*_tags.json'):
        with open(filepath, 'r', encoding='utf-8') as f:
            tags = json.load(f)
            for tag in tags:
                tag_no = tag.get('tagNo')
                if tag_no and tag_no not in seen_tags:
                    seen_tags.add(tag_no)
                    all_tags.append(tag)
    
    # 결과 저장
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(all_tags, f, indent=2, ensure_ascii=False)
    
    print(f'총 추출 태그 수: {len(all_tags)}개')

if __name__ == '__main__':
    main()

3.3 run_parallel_extract.sh (병렬 실행 스크립트)

# mcp-server/run_parallel_extract.sh (신규 생성)
#!/bin/bash
# P&ID 태그 추출 병렬 실행 스크립트

DXF_FILE="/path/to/pid.dxf"
OUTPUT_DIR="/path/to/output"

# Phase 1: 기하학적 추출 (ezdxf)
echo "Phase 1: 기하학적 추출 시작..."
python -m pipeline.extractor "$DXF_FILE" "$OUTPUT_DIR/geometric_data.json"
echo "Phase 1 완료: geometric_data.json 저장"

# Phase 2: 위상 빌더 (공간 인덱스)
echo "Phase 2: 위상 빌더 시작..."
python -m pipeline.topology "$OUTPUT_DIR/geometric_data.json" "$OUTPUT_DIR/topology.json"
echo "Phase 2 완료: topology.json 저장"

# Phase 3: 병렬 LLM 매핑 (4개 프로그램 동시에 실행)
echo "Phase 3: 병렬 LLM 매핑 시작..."

# GPU 0: TEXT 태그 추출
python -m pipeline.pid_extractor_text "$DXF_FILE" "$OUTPUT_DIR" &
# GPU 1: VALVE 태그 추출
python -m pipeline.pid_extractor_valve "$DXF_FILE" "$OUTPUT_DIR" &
# GPU 2: EQUIPMENT 태그 추출
python -m pipeline.pid_extractor_equipment "$DXF_FILE" "$OUTPUT_DIR" &
# GPU 3: SYSTEM 태그 추출
python -m pipeline.pid_extractor_system "$DXF_FILE" "$OUTPUT_DIR" &

# 모든 프로세스가 완료될 때까지 대기
wait

# 결과 병합
echo "결과 병합 시작..."
python -m pipeline.merge_results "$OUTPUT_DIR" "$OUTPUT_DIR/merged_tags.json"
echo "Phase 3 완료: merged_tags.json 저장"

echo "전체 파이프라인 완료!"

4. 성능 예측

4.1 Phase 1: 기하학적 추출

현재: 1.4초
개선 후: 1.4초 (변화 없음)

4.2 Phase 2: 위상 빌더

현재: timeout (O(n²))
개선 후: 2-3초 (R-tree O(n log n))

4.3 Phase 3: 병렬 LLM 매핑

현재: 예측 불가 (순차적 API 호출)
개선 후: 5-10초 (4개 프로그램 병렬 실행)

예상 속도 향상:

Phase 2: 100배 이상 (timeout → 2-3초)
Phase 3: 3-5배 (순차적 → 병렬)

5. 구현 우선순위

순위	작업	예상 시간	영향도
1	R-tree 공간 인덱스 도입	1일	HIGH
2	pid_extractor_line.py 구현 (LINE 추출)	0.5일	HIGH
3	pid_extractor_text.py 구현 (TEXT 추출)	0.5일	HIGH
4	pid_extractor_valve.py 구현	0.5일	HIGH
5	pid_extractor_equipment.py 구현	0.5일	HIGH
6	pid_extractor_system.py 구현	0.5일	HIGH
7	merge_results.py 구현	0.5일	LOW
8	run_parallel_extract.sh 구현	0.5일	LOW
9	테스트 및 벤치마크	0.5일	LOW

총 예상 시간: 4.5일

6. GPU 자원 활용 전략

6.1 vLLM의 GPU 할당 방식

단일 프로세스 (현재 구조):

Python 프로세스 A
├─ LLM Request 1 → GPU 0 (100% 사용)
├─ LLM Request 2 → GPU 0 (대기)
└─ LLM Request 3 → GPU 0 (대기)

→ GPU 1, 2, 3은 놀고 있음

다중 프로세스 (개선안):

Python 프로세스 A → GPU 0 (100% 사용)
Python 프로세스 B → GPU 1 (100% 사용)
Python 프로세스 C → GPU 2 (100% 사용)
Python 프로세스 D → GPU 3 (100% 사용)

→ 모든 GPU 카드를 100% 활용

6.2 병렬 실행 명령어

# 4개 프로그램을 동시에 실행
python -m pipeline.pid_extractor_text /path/to/pid.dxf /path/to/output &
python -m pipeline.pid_extractor_valve /path/to/pid.dxf /path/to/output &
python -m pipeline.pid_extractor_equipment /path/to/pid.dxf /path/to/output &
python -m pipeline.pid_extractor_system /path/to/pid.dxf /path/to/output &

# 모든 프로세스가 완료될 때까지 대기
wait

# 결과 병합
python -m pipeline.merge_results /path/to/output /path/to/output/merged_tags.json

7. 현재 프로젝트와의 차이점

항목	기존 구조	개선안
Phase 2	O(n²) 병합 → timeout	R-tree O(n log n) → 2-3초
Phase 3	비동기 순차적 LLM 호출	4개 프로그램 병렬 실행
LINE 처리	LLM 매핑 대상	CPU 전용 기하학적 추출
GPU 활용	단일 GPU만 사용	4개 GPU 모두 사용
파일 구조	`pid_intelligent_mapper.py`	`pid_extractor_*.py` (4개)

8. 구현 체크리스트

R-tree 패키지 추가 (pyproject.toml)
topology.py에 _build_spatial_index() 및 _merge_nodes_spatial() 구현
pid_extractor_line.py 생성 (LINE 추출)
pid_extractor_text.py 생성 (TEXT 추출)
pid_extractor_valve.py 생성 (Valve 추출)
pid_extractor_equipment.py 생성 (Equipment 추출)
pid_extractor_system.py 생성 (System 추출)
merge_results.py 생성 (결과 병합)
run_parallel_extract.sh 생성 (병렬 실행 스크립트)
전체 파이프라인 테스트 및 벤치마크

9. 참고 자료

P&ID_병렬LLM_아키텍처_개선안_v2.md - 원본 개선안
P&ID_병목현상_분석_보고서.md - 병목 분석 보고서
P&ID_병렬LLM_아키텍처_개선안.md - 이전 개선안
futurePlan/End-to-End P&ID Graph Pipeline/ - 기존 파이프라인 코드

29 KiB Raw Blame History