Files

windpacer 302183c97e feat: P&ID 연결 분석, LLM 에이전트 모드, KB 확장, MCP 서버 리팩토링

- P&ID: 연결 분석 API, Prefix 규칙 관리, 카테고리 분류, DXF 그래프 빌드
- LLM: 대화 요약, tool card 영구 보존, 시계열 차트(uPlot), 에이전트 모드
- KB: 청크 미리보기, Field Instrument Inference, 인증/Qdrant 클라이언트
- MCP: 서버 기능 확장, 파이프라인 수정, timeout 개선
- Frontend: P&ID UI, LLM UI, KB UI, OPC UA Write 탭 추가
- 설정: AGENTS.md, plant_context, README, opencode.json 업데이트
- 정리: 진단 체크리스트 문서 삭제

2026-05-21 23:36:57 +09:00

35 KiB

Raw Blame History

DXF 추출 로직 개선안

작성일: 2026-05-19 | 모델: Qwen3.6-27B-FP8

1. 현재 파이프라인 구조

[Frontend: app.js]
     | POST /api/pid/upload
     v
[PidController.cs]  ->  uploads/pid/{filename}.dxf 저장
     | POST /api/pid/extract
     v
[PidExtractorService.cs]
     |
     +- Step 1: C# netDxf 텍스트 추출
     |    - Stream -> temp 파일 -> DxfDocument.Load()
     |    - TEXT, MTEXT, Block Attribute 추출
     |    - Circle 중심 좌표 매핑
     |    - Balloon 재조합 (O(n^2))
     |    - FilterDxfText (regex 필터링)
     |    -> (filteredText, coords) 반환
     |
     +- Step 2: MCP -> Python extract_pid_tags
     |    - HTTP localhost:5001 -> JSON-RPC
     |    - _extract_pid_tags_from_text() (순수 regex, LLM 없음)
     |    - kind: pipe | equipment | instrument | unknown 분류
     |    -> JSON: {success, count, tags: [...]}
     |
     +- Step 3: MCP -> Python match_pid_tags
     |    - HTTP localhost:5001 -> JSON-RPC
     |    - 결정론적 매칭 (exact 0.99, prefix 0.95)
     |    -> JSON: {success, count, mappings: [...]}
     |
     +- Step 4: DB 저장
     |    - 중복 체크 (기존 TagNo 제외)
     |    - Category 분류 (prefix rules)
     |    - TagClass 분류 (field vs system)
     |    - 좌표 할당 (coords)
     |    -> pid_equipment 테이블 INSERT
     |
     v
[Response: totalCount, confidenceItems, lowConfidenceItems, skippedDuplicates]

병렬 경로: MCP 직접 도구 (C# netDxf 우회)
  parse_pid_dxf(filepath) -> _extract_pid_dxf_fast() -> ezdxf layer-aware
  - LINENO 레이어 -> 라인 마스터 (service/line_no/size/material_spec)
  - 그 외 TEXT/MTEXT -> 태그 후보 (prefix로 장비/계기 분류)
  - 좌표 정보 없음
  - 이 경로는 C# PidExtractorService 에서 사용되지 않음

2. 발견된 문제점

P1 - 심각한 문제

#	문제	영향	위치
1	C# <-> Python 중복 파싱	C# netDxf 로 텍스트 추출 후 Python regex 로 재분류. 같은 DXF 2회 파싱	`PidExtractorService.cs:136-234` + `server.py:373-432`
2	좌표 정보 손실	Python `_extract_pid_dxf_fast`는 layer 활용하지만 좌표 반환 안 함	`server.py:296-366`
3	Temp 파일 I/O	Stream -> temp 파일 -> netDxf 로드 -> 삭제. 불필요한 디스크 I/O	`PidExtractorService.cs:138-142`
3b	Block INSERT 처리 손실	C#은 Block AttributeDefinitions 추출하나 Python `_extract_pid_dxf_fast`는 TEXT/MTEXT만 처리	`PidExtractorService.cs:162-164` vs `server.py:310`
3c	regex 불일치	C# `FilterDxfText` `[A-Z]{1,6}-\d{2,6}` vs Python `_PID_TAG_RE` `[A-Z]{1,4}-\d{3,6}` → Python이 더 엄격하여 일부 태그 누락 가능	`PidExtractorService.cs:253` vs `server.py:247`
4	Balloon 재조합 O(n^2)	모든 func/num 쌍 비교. 대용량 DXF (1000+ 텍스트)에서 성능 병목	`PidExtractorService.cs:277-308`
5	MCP 서버 단일 장애점	MCP 서버 다운 시 전체 파싱 실패. fallback 없음	`PidExtractorService.cs:55-56`

P2 - 개선 필요

#	문제	영향	위치
6	FilterDxfText 과도한 필터링	`[A-Z]{1,6}-\d{2,6}` 패턴만 통과. 일부 유효 태그 누락 가능	`PidExtractorService.cs:240-260`
7	dxf-graph/ 고립	`pid_geometric_extractor.py`, `pid_topology_builder.py` 등 production 미사용	`dxf-graph/`
8	Backup 파일 축적	`mcp-server/.rooBackup/`에 server.py 백업 3개	`mcp-server/.rooBackup/`
9	Tag matching prefix 길이 제한	`_MIN_PREFIX_LEN = 4` -> `P-10101` 같은 짧은 prefix prefix 매칭 불가	`server.py:1643`
10	netDxf 의존성 불필요	Python ezdxf 가 layer-aware 파싱 지원하므로 C# netDxf 중복	`ExperionCrawler.csproj:32`

3. 개선 방향

핵심 전략: Python ezdxf 단일 파싱 경로로 통합

C# netDxf 제거하고 Python parse_pid_dxf를 확장하여 좌표 + layer 정보를 모두 반환. C#은 MCP 호출 + DB 저장만 담당.

[Frontend]
     | POST /api/pid/extract
     v
[PidController.cs]
     |
     v
[PidExtractorService.cs]
     |
     +- Step 1: MCP -> Python parse_pid_dxf (확장)
     |    - ezdxf layer-aware 파싱 (기존 _extract_pid_dxf_fast 기반)
     |    - 좌표 정보 포함 (TEXT/MTEXT 위치)
     |    - Balloon 재조합 (Python 측으로 이동, R-tree 최적화)
     |    - Block reference 가상 엔티티 확장
     |    -> JSON: {tags: [{tagNo, kind, coords: {x, y}, ...}]}
     |
     +- Step 2: MCP -> Python match_pid_tags (기존 유지)
     |    -> JSON: {mappings: [...]}
     |
     +- Step 3: DB 저장 (기존 유지)
     |    -> pid_equipment 테이블 INSERT
     |
     v
[Response]

Fallback: MCP 연결 실패 시 Python 프로세스 직접 호출

4. Todo List (작업 단위별)

Phase 1: Python 파싱 확장 (좌표 + Balloon 포함)

1.1 _extract_pid_dxf_fast에 좌표 정보 추가
- TEXT/MTEXT 엔티티의 (x, y, height)를 tag 결과에 포함
- Circle 중심 매핑 로직 추가 (C# ExtractDxfText의 circleCoords 로직 이식)
- 파일: mcp-server/server.py
1.2 Balloon 재조합을 Python 측으로 이동
- _extract_pid_tags_from_text 내에 Balloon 재조합 함수 추가
- C# ReconstructBalloonTags 로직을 Python 으로 포팅
- R-tree (spatial index) 도입으로 O(n^2) -> O(n log n) 최적화
- 파일: mcp-server/server.py
1.3 Block reference 가상 엔티티 확장
- INSERT 엔티티의 virtual_entities()를 통해 블록 내부 TEXT 추출
- 파일: mcp-server/server.py
1.4 parse_pid_dxf 반환 형식 확장
- 기존: {success, fluid_dictionary, linenos, tags, stats}
- 변경: {success, fluid_dictionary, linenos, tags: [{..., coords: {x, y, h}}], stats}
- 파일: mcp-server/server.py

Phase 2: C# 측 정리

2.1 ExtractDxfText 메서드 제거
- netDxf 의존성 제거
- using netDxf; 제거
- 파일: PidExtractorService.cs
2.2 FilterDxfText 메서드 제거
- Python 측에서 필터링하므로 C# 측 불필요
- 파일: PidExtractorService.cs
2.3 ReconstructBalloonTags 메서드 제거
- Python 측으로 이동
- 파일: PidExtractorService.cs
2.4 ExtractFromStreamAsync 재작성
- DXF: McpClient.ParsePidDxfAsync(filepath) 호출
- PDF: 기존 ExtractPdfText + McpClient.ExtractPidTagsAsync 유지 (변경 없음)
- 파일: PidExtractorService.cs
2.5 netDxf NuGet 패키지 제거
- ExperionCrawler.csproj에서 <PackageReference Include="netDxf" /> 제거
- 파일: ExperionCrawler.csproj
2.6 ExtractedItem 모델에 좌표 필드 추가
- PosX, PosY 필드 추가
- 파일: PidExtractorService.cs (내부 클래스)

Phase 3: MCP Fallback

3.1 Python 프로세스 직접 호출 fallback 구현
- MCP 연결 실패 시 Python _extract_pid_dxf_fast를 subprocess 로 직접 호출
- 파일: PidExtractorService.cs
3.2 McpClient.CallToolAsync 실패 감지 개선
- "도구 호출 실패:" 문자열이 포함된 응답을 예외로 처리
- 파일: McpClient.cs

Phase 4: Tag matching 개선

4.1 _MIN_PREFIX_LEN 조정
- 4 -> 3으로 하향 (P-10101 같은 단축 prefix도 prefix 매칭 가능)
- false positive 방지 위해 숫자 부분도 매칭하는 조건 추가
- 파일: mcp-server/server.py
4.2 숫자 기반 매칭 추가
- P-10101 <-> p-10101.pv -> 숫자(10101)가 일치하면 매칭
- prefix도 일치해야 함 (PSV-10101 <-> p-10101.pv 거짓 매칭 방지)
- 파일: mcp-server/server.py

Phase 5: dxf-graph 통합 (선택적)

5.1 pid_geometric_extractor.py -> server.py 통합
- bbox 계산 로직을 _extract_pid_dxf_fast에 통합
- 파일: mcp-server/server.py
5.2 pid_topology_builder.py -> MCP tool 노출
- build_pid_graph_parallel tool 확장
- 파일: mcp-server/server.py

Phase 6: 정리

6.1 .rooBackup 정리
- mcp-server/.rooBackup/ 삭제
- 파일: mcp-server/.rooBackup/
6.2 Build 검증
- dotnet build src/Web/ExperionCrawler.csproj
- dotnet test
6.3 통합 테스트
- 샘플 DXF 파일로 추출 테스트 (uploads/pid/P10-EQP-BLOCK.dxf)
- 좌표 정보가 DB에 저장되는지 확인
- MCP 서버 다운 시 fallback 동작 확인

5. 코드 수정 사항

5.1 `mcp-server/server.py` - `_extract_pid_dxf_fast` 확장

현재 (296-366 라인):

async def _extract_pid_dxf_fast(filepath: str) -> dict:
    """DXF에서 layer + regex만으로 구조 추출. 좌표 계산/LLM 호출 없음."""
    import ezdxf
    from ezdxf.tools.text import plain_mtext
    from collections import Counter

    def _work():
        doc = ezdxf.readfile(filepath)
        msp = doc.modelspace()

        linenos: list[dict] = []
        tags: list[dict] = []
        seen_tags: set[str] = set()

        for e in msp.query('TEXT MTEXT'):
            # ... 텍스트 추출 및 분류 ...

수정 후:

async def _extract_pid_dxf_fast(filepath: str) -> dict:
    """DXF에서 layer + regex + 좌표로 구조 추출. 좌표 포함, LLM 호출 없음."""
    import ezdxf
    from ezdxf.tools.text import plain_mtext
    from collections import Counter
    import math

    def _work():
        doc = ezdxf.readfile(filepath)
        msp = doc.modelspace()

        linenos: list[dict] = []
        tags: list[dict] = []
        seen_tags: set[str] = set()

        # Circle 중심 좌표 사전 수집
        circles = []
        for c in msp.query('CIRCLE'):
            circles.append((c.dxf.center.x, c.dxf.center.y, c.dxf.radius))

        # 텍스트 엔티티 수집 (좌표 포함)
        positioned_texts = []  # list of (text, x, y, h)

        for e in msp.query('TEXT MTEXT'):
            if e.dxftype() == 'TEXT':
                txt = e.dxf.text or ""
                x, y = e.dxf.insert.x, e.dxf.insert.y
                h = e.dxf.height or 0.0
            else:
                try:
                    txt = plain_mtext(e.dxf.text or "")
                except Exception:
                    txt = e.dxf.text or ""
                x, y = e.dxf.insert.x, e.dxf.insert.y
                h = e.dxf.height or 0.0
            txt = txt.strip()
            if not txt:
                continue
            positioned_texts.append((txt, x, y, h))
            layer = e.dxf.layer

            # 기존 분류 로직 (layer 기반 LINENO, 태그 분류)
            # tag 추가 시 coords 포함:
            # tags.append({
            #     "tagNo": txt, "kind": ..., "prefix": ..., "type": ...,
            #     "coords": {"x": x, "y": y, "h": h},
            #     "layer": layer
            # })

        # Circle 중심 매핑
        circle_coords = {}
        for (txt, x, y, h) in positioned_texts:
            best_r = float('inf')
            cx, cy = x, y
            for (ccx, ccy, cr) in circles:
                d = math.sqrt((x - ccx) ** 2 + (y - ccy) ** 2)
                if d < cr and cr < best_r:
                    best_r = cr
                    cx, cy = ccx, ccy
            if best_r < float('inf'):
                circle_coords[txt] = (cx, cy)

        # Balloon 재조합 (R-tree 기반)
        balloon_tags = _reconstruct_balloons_rtree(positioned_texts, circle_coords)
        for (tag, x, y, h) in balloon_tags:
            if tag not in seen_tags:
                seen_tags.add(tag)
                cls = _classify_pid_tag(tag)
                final_x, final_y = x, y
                if tag in circle_coords:
                    final_x, final_y = circle_coords[tag]
                tags.append({
                    "tagNo": tag,
                    **cls,
                    "instrumentType": cls["prefix"],
                    "confidence": 0.90,
                    "coords": {"x": final_x, "y": final_y, "h": h},
                })

        # stats 계산 ...

추가할 Balloon 재조합 함수 (R-tree 기반):

def _reconstruct_balloons_rtree(
    texts,
    circle_coords,
):
    """ISA 기능코드 + 루프번호를 근접 좌표로 짝지어 재조합.
    R-tree (spatial index) 로 O(n log n) 으로 최적화.
    """
    import re
    try:
        import rtree
        HAS_RTREE = True
    except ImportError:
        HAS_RTREE = False

    _instr_func_re = re.compile(r'^[FPLTASHQWVXZBDCRK][ICTREVYSAQGZ]{1,3}$')
    _loop_num_re = re.compile(r'^\d{3,6}[A-Z]?$')

    funcs = [(t, x, y, h) for t, x, y, h in texts if _instr_func_re.match(t)]
    nums = [(t, x, y, h) for t, x, y, h in texts if _loop_num_re.match(t)]

    if not funcs or not nums:
        return []

    result = []
    seen = set()

    idx = None
    if HAS_RTREE and len(nums) > 50:
        idx = rtree.index.Index()
        for i, (t, x, y, h) in enumerate(nums):
            idx.insert(i, (x, y, x, y))

    for (ft, fx, fy, fh) in funcs:
        threshold = fh * 5.0 if fh > 0 else 12.0
        best_dist = float('inf')
        best_num = None
        nx, ny = fx, fy

        if idx is not None:
            candidates = list(idx.intersection(
                (fx - threshold, fy - threshold, fx + threshold, fy + threshold)))
            for i in candidates:
                t, x, y, h = nums[i]
                d = math.sqrt((fx - x) ** 2 + (fy - y) ** 2)
                if d < best_dist:
                    best_dist = d
                    best_num = t
                    nx, ny = x, y
        else:
            for (t, x, y, h) in nums:
                d = math.sqrt((fx - x) ** 2 + (fy - y) ** 2)
                if d < best_dist:
                    best_dist = d
                    best_num = t
                    nx, ny = x, y

        if best_num and best_dist <= threshold:
            tag = f"{ft}-{best_num}"
            if tag.upper() not in seen:
                seen.add(tag.upper())
                result.append((tag, (fx + nx) / 2, (fy + ny) / 2, fh))

    return result

5.2 `mcp-server/server.py` - `_extract_pid_tags_from_text` 수정

수정 사항: coords 필드를 반환 형식에 포함 (PDF/OCR 텍스트는 좌표 없음)

def _extract_pid_tags_from_text(text: str, coords_map: dict | None = None) -> list[dict]:
    """plain text 에서 tag/LineNo를 regex로 추출.
    coords_map 이 제공되면 좌표 정보 포함.
    """
    # ... 기존 로직 동일 ...
    # tag 추가 시:
    entry = {
        "tagNo": token,
        "kind": "equipment",
        "prefix": cls["prefix"],
        # ...
    }
    if coords_map and token in coords_map:
        entry["coords"] = coords_map[token]
    out.append(entry)
    return out

5.3 `mcp-server/server.py` - `match_pid_tags` 개선

현재:

_MIN_PREFIX_LEN = 4  # prefix 매칭 최소 길이

수정 후:

_MIN_PREFIX_LEN = 3  # P-10101 같은 단축 prefix도 매칭 가능

# 추가 매칭 전략: 숫자 기반 매칭
def _match_by_number(pid_norm, ex_index):
    """P-10101 <-> p-10101.pv -> 숫자(10101) + prefix(P)가 모두 일치하면 매칭."""
    m = re.match(r'^([a-z]+)-(\d+)$', pid_norm)
    if not m:
        return None, 0.0
    pid_prefix, pid_num = m.group(1), m.group(2)
    for ex_norm, ex_orig in ex_index.items():
        em = re.match(r'^([a-z]+)-(\d+)', ex_norm)
        if em and em.group(1) == pid_prefix and em.group(2) == pid_num:
            return ex_orig, 0.95
    return None, 0.0

5.4 `PidExtractorService.cs` - 재작성

현재 ExtractFromStreamAsync (36-134 라인):

public async Task<PidExtractionResult> ExtractFromStreamAsync(Stream stream, string fileName, bool useImageMode = false)
{
    var ext = Path.GetExtension(fileName).ToLowerInvariant();

    string text;
    Dictionary<string, (double X, double Y, double H)>? coords = null;

    if (ext == ".dxf")
        (text, coords) = ExtractDxfText(stream);
    else if (ext == ".pdf")
        text = ExtractPdfText(stream);
    // ...

수정 후:

public async Task<PidExtractionResult> ExtractFromStreamAsync(Stream stream, string fileName, bool useImageMode = false)
{
    var ext = Path.GetExtension(fileName).ToLowerInvariant();
    List<ExtractedItem> extractedItems;

    if (ext == ".dxf")
    {
        var tmp = Path.GetTempFileName() + ".dxf";
        try
        {
            await using var fs = File.Create(tmp);
            await stream.CopyToAsync(fs);
            extractedItems = await ParseDxfViaMcpAsync(tmp);
        }
        finally
        {
            if (File.Exists(tmp)) File.Delete(tmp);
        }
    }
    else if (ext == ".pdf")
    {
        var text = ExtractPdfText(stream);
        if (string.IsNullOrWhiteSpace(text))
            return new PidExtractionResult(0, 0, 0);
        var json = await _mcp.ExtractPidTagsAsync(text, "pdf");
        extractedItems = ParseJson(json);
    }
    else
    {
        throw new NotSupportedException("지원 형식: .dxf .pdf");
    }

    if (extractedItems.Count == 0)
    {
        _logger.LogWarning("P&ID 추출 결과 0건 - 파일: {FileName}", fileName);
        return new PidExtractionResult(0, 0, 0);
    }

    // 이후 로직 (매핑, 중복 체크, DB 저장) 동일
}

새로 추가할 메서드:

private async Task<List<ExtractedItem>> ParseDxfViaMcpAsync(string filePath)
{
    try
    {
        var json = await _mcp.ParsePidDxfAsync(filePath);
        using var doc = JsonDocument.Parse(json);
        var root = doc.RootElement;
        var items = new List<ExtractedItem>();

        if (root.TryGetProperty("tags", out var tagsEl) && tagsEl.ValueKind == JsonValueKind.Array)
        {
            foreach (var tag in tagsEl.EnumerateArray())
            {
                var item = new ExtractedItem
                {
                    TagNo = tag.TryGetProperty("tagNo", out var tn) ? tn.GetString() ?? "" : "",
                    InstrumentType = tag.TryGetProperty("prefix", out var p) ? p.GetString() : null,
                    LineNumber = tag.TryGetProperty("lineNumber", out var ln) ? ln.GetString() : null,
                    Confidence = tag.TryGetProperty("confidence", out var c) ? c.GetDouble() : 0.5,
                };

                if (tag.TryGetProperty("coords", out var coordsEl))
                {
                    item.PosX = coordsEl.TryGetProperty("x", out var cx) ? cx.GetDouble() : null;
                    item.PosY = coordsEl.TryGetProperty("y", out var cy) ? cy.GetDouble() : null;
                }

                items.Add(item);
            }
        }

        _logger.LogInformation("[PID] MCP 파싱 완료: {Count}건 (파일: {File})",
            items.Count, Path.GetFileName(filePath));
        return items;
    }
    catch (Exception ex)
    {
        _logger.LogWarning(ex, "[PID] MCP 파싱 실패 - fallback 사용 (파일: {File})", filePath);
        return await FallbackParseDxfAsync(filePath);
    }
}

private async Task<List<ExtractedItem>> FallbackParseDxfAsync(string filePath)
{
    var result = new List<ExtractedItem>();
    try
    {
        var escapedPath = filePath.Replace("'", "'\"'\"'");
        var psi = new ProcessStartInfo
        {
            FileName = "python3",
            Arguments = $"-c \"import sys,json,asyncio; sys.path.insert(0,'mcp-server'); " +
                        $"from server import _extract_pid_dxf_fast; " +
                        $"d=asyncio.run(_extract_pid_dxf_fast('{escapedPath}')); " +
                        $"print(json.dumps(d))\"",
            RedirectStandardOutput = true,
            UseShellExecute = false,
            CreateNoWindow = true,
        };
        using var proc = Process.Start(psi);
        if (proc != null)
        {
            var output = await proc.StandardOutput.ReadToEndAsync();
            await proc.WaitForExitAsync();
            if (!string.IsNullOrWhiteSpace(output))
            {
                using var doc = JsonDocument.Parse(output);
                if (doc.RootElement.TryGetProperty("tags", out var tagsEl) &&
                    tagsEl.ValueKind == JsonValueKind.Array)
                {
                    foreach (var tag in tagsEl.EnumerateArray())
                    {
                        result.Add(new ExtractedItem
                        {
                            TagNo = tag.TryGetProperty("tagNo", out var tn) ? tn.GetString() ?? "" : "",
                            InstrumentType = tag.TryGetProperty("prefix", out var p) ? p.GetString() : null,
                            Confidence = tag.TryGetProperty("confidence", out var c) ? c.GetDouble() : 0.5,
                        });
                    }
                }
            }
        }
    }
    catch (Exception ex)
    {
        _logger.LogError(ex, "[PID] Fallback 파싱도 실패");
    }
    return result;
}

제거할 메서드:

ExtractDxfText (136-234 라인) - 전체 제거
FilterDxfText (240-260 라인) - 전체 제거
ReconstructBalloonTags (277-308 라인) - 전체 제거
관련 regex: _instrFuncRe, _loopNumRe - 제거

5.5 `ExtractedItem` 모델 확장

현재:

public class ExtractedItem
{
    public string TagNo { get; set; } = "";
    public string? EquipmentName { get; set; }
    public string? InstrumentType { get; set; }
    public string? LineNumber { get; set; }
    public string? PidDrawingNo { get; set; }
    public double Confidence { get; set; } = 0.5;
}

수정 후:

public class ExtractedItem
{
    public string TagNo { get; set; } = "";
    public string? EquipmentName { get; set; }
    public string? InstrumentType { get; set; }
    public string? LineNumber { get; set; }
    public string? PidDrawingNo { get; set; }
    public double Confidence { get; set; } = 0.5;
    public double? PosX { get; set; }
    public double? PosY { get; set; }
}

5.6 `ExperionCrawler.csproj` - netDxf 제거

현재:

<!-- P&ID 추출 -->
<PackageReference Include="netDxf" Version="2022.11.2" />
<PackageReference Include="PdfPig" Version="0.1.9" />

수정 후:

<!-- P&ID 추출 (PDF only - DXF는 Python ezdxf 사용) -->
<PackageReference Include="PdfPig" Version="0.1.9" />

5.7 DB 저장 시 좌표 할당 수정

현재 (108-112 라인):

if (coords != null && coords.TryGetValue(item.TagNo, out var c))
{
    newItem.PosX = c.X;
    newItem.PosY = c.Y;
}

수정 후:

newItem.PosX = item.PosX;
newItem.PosY = item.PosY;

6. Python 의존성 추가

mcp-server/pyproject.toml 또는 requirements.txt에:

rtree>=1.0.0  # Balloon 재조합 공간 인덱싱 (optional, 없으면 선형 검색 fallback)

7. 마이그레이션 계획

단계별 롤아웃

단계	작업	리스크	롤백
1	Python `_extract_pid_dxf_fast` 확장 (좌표 포함)	낮음 - 기존 API 호환	git revert
2	Python Balloon 재조합 + R-tree	낮음 - optional dependency	rtree 없이 fallback
3	C# `ExtractFromStreamAsync` 재작성 (MCP 호출)	중간 - MCP 의존성	기존 코드 복원
4	netDxf 제거	낮음 - Phase 3 완료 후	패키지 재추가
5	Fallback 구현	낮음 - 안전망	-
6	Backup 정리, 테스트	없음	-

호환성

parse_pid_dxf 반환 형식에 coords 필드 추가는 backward compatible (기존 호출자는 coords 무시)
extract_pid_tags는 변경 없음 (PDF 경로에서 사용)
C# ExtractedItem에 PosX/PosY 추가는 nullable - 기존 코드와 호환

8. 성능 기대 효과

지표	현재	개선 후	개선율
DXF 파싱 횟수	2회 (C# + Python)	1회 (Python)	50% 감소
Balloon 재조합	O(n^2)	O(n log n)	1000 텍스트 기준 ~100x
디스크 I/O	temp 파일 읽기/쓰기	temp 파일 유지 (MCP filepath 인자 필요)	변경 없음 ⚠️
NuGet 의존성	netDxf + PdfPig	PdfPig only	1개 감소
MCP 장애 시	전체 실패	fallback 동작	-

9. 검증 체크리스트

dotnet build src/Web/ExperionCrawler.csproj 성공
dotnet test 성공
샘플 DXF (P10-EQP-BLOCK.dxf) 추출 시 좌표가 DB 에 저장됨
MCP 서버 중지 상태에서 추출 시 fallback 동작
PDF 추출 경로에 영향 없음
기존 DB 데이터에 영향 없음 (기존 좌표 유지)
parse_pid_dxf MCP tool 직접 호출 시 coords 반환
Balloon 재조합 결과 검증 (TE-9101 같은 분리 태그가 재조합됨)

10. 실행용 프롬프트 (다른 LLM에게 전달)

아래 프롬프트를 다른 LLM에게 그대로 전달하면 된다. 코드베이스의 현재 상태, 수정해야 할 정확한 위치, 주의사항이 모두 포함되어 있다.

프롬프트 시작

당신은 C# (.NET 8)과 Python에 능숙한 시니어 개발자다. 아래 DXF 추출 로직 개선안을 정확히 구현하라.

0. 코드베이스 구조

src/
├── Core/Application/Services/PidExtractorService.cs   — 핵심 수정 대상 (1067줄)
├── Infrastructure/Mcp/McpClient.cs                    — MCP 호출 클라이언트 (256줄, 수정 X)
└── Web/ExperionCrawler.csproj                         — netDxf 제거 대상
mcp-server/
└── server.py                                          — Python 파싱 확장 대상 (2235줄)

단일 프로젝트: src/Web/ExperionCrawler.csproj. Core/Infrastructure는 <Compile Include> glob로 포함.

1. Phase 1: Python `server.py` 수정 (가장 중요)

파일: mcp-server/server.py

1.1 `_extract_pid_dxf_fast` 함수 확장 (라인 296-366)

현재 동작: DXF에서 TEXT/MTEXT를 추출하여 태그 분류. 좌표 정보 없음.

수정해야 할 점:

각 tag에 coords: {"x": float, "y": float, "h": float} 필드 추가
Circle 중심 좌표 매핑 로직 추가 (C# PidExtractorService.cs:167-188의 ExtractDxfText 내 circleCoords 로직을 Python으로 포팅)
Block INSERT 엔티티의 virtual_entities()도 처리 (C# PidExtractorService.cs:162-164의 Block AttributeDefinitions 처리에 해당)

Circle 매핑 알고리즘 (C# 원본 참고):

foreach TEXT entity:
  bestR = infinity
  for each CIRCLE:
    d = distance(TEXT.pos, CIRCLE.center)
    if d < CIRCLE.radius AND CIRCLE.radius < bestR:
      bestR = CIRCLE.radius
      mapped_coords = CIRCLE.center
  if bestR < infinity:
    circle_coords[TEXT.value] = mapped_coords

Balloon 재조합 함수 추가:

새 함수 _reconstruct_balloons_rtree(texts, circle_coords)를 _extract_pid_dxf_fast 앞에 정의하라.

C# PidExtractorService.cs:277-308의 ReconstructBalloonTags를 Python으로 포팅
regex 패턴:
- _instr_func_re = r'^[FPLTASHQWVXZBDCRK][ICTREVYSAQGZ]{1,3}$' (ISA 기능코드)
- _loop_num_re = r'^\d{3,6}[A-Z]?$' (루프번호)
R-tree (rtree 패키지)를 사용하되, 없으면 fallback으로 선형 검색. ImportError를 catch하여 graceful degradation.
임계값: threshold = h * 5.0 if h > 0 else 12.0
반환: list[tuple[str, float, float, float]] — (tag, x, y, h)

반환 형식 변경:

# 기존
tags.append({"tagNo": txt, **cls, "layer": layer})

# 변경
tags.append({
    "tagNo": txt,
    **cls,
    "layer": layer,
    "coords": {"x": final_x, "y": final_y, "h": h},
})

주의: 기존 parse_pid_dxf tool(라인 1702-1724)은 _extract_pid_dxf_fast의 결과를 {"success": True, **data}로 감싸서 반환한다. 반환 형식에 coords를 추가하는 것은 backward compatible — 기존 호출자는 coords를 무시한다.

1.2 `match_pid_tags` 개선 (라인 1622-1696)

수정 1: _MIN_PREFIX_LEN = 4 → _MIN_PREFIX_LEN = 3

수정 2: prefix 매칭(전략 2)에 숫자 매칭 조건 추가.

현재 prefix 매칭은 n.startswith(pid_norm + ".")로 동작한다. P-101 같은 짧은 prefix가 P-101, P-1010, P-10101 모두와 매칭되는 문제가 있다. 숫자 부분도 일치하는지 확인하라:

# 전략 2 개선: prefix + 숫자 동시 매칭
import re
m = re.match(r'^([a-z]+)-(\d+)$', pid_norm)
if m:
    pid_prefix, pid_num = m.group(1), m.group(2)
    hit = next(
        (n for n in ex_norms
         if re.match(rf'^{re.escape(pid_prefix)}-{re.escape(pid_num)}(\.|-|$)', n)),
        None,
    )
    if hit:
        mappings.append({"pidTag": pid, "experionTag": ex_index[hit], "confidence": 0.95})
        continue

2. Phase 2: C# `PidExtractorService.cs` 수정

파일: src/Core/Application/Services/PidExtractorService.cs

2.1 `ExtractedItem` 모델에 좌표 필드 추가 (라인 1052-1060)

public class ExtractedItem
{
    public string TagNo { get; set; } = "";
    public string? EquipmentName { get; set; }
    public string? InstrumentType { get; set; }
    public string? LineNumber { get; set; }
    public string? PidDrawingNo { get; set; }
    public double Confidence { get; set; } = 0.5;
    public double? PosX { get; set; }    // ← 추가
    public double? PosY { get; set; }    // ← 추가
}

2.2 `ExtractFromStreamAsync` 재작성 (라인 36-134)

핵심 변경: DXF 경로에서 C# netDxf 파싱을 제거하고 MCP parse_pid_dxf 호출로 대체.

기존 흐름:
  DXF → ExtractDxfText(netDxf) → text + coords
       → MCP ExtractPidTagsAsync(text) → tags
       → 매핑 → DB 저장

변경 후 흐름:
  DXF → Stream → temp 파일
       → MCP ParsePidDxfAsync(filepath) → tags (coords 포함)
       → 매핑 → DB 저장

구체적인 구현:

.dxf인 경우:
- Stream을 temp 파일로 저장 (Path.GetTempFileName() + ".dxf")
- ParseDxfViaMcpAsync(tmp) 호출 (새 메서드, 아래 참조)
- finally에서 temp 파일 삭제
.pdf인 경우: 변경 없음. 기존 ExtractPdfText + McpClient.ExtractPidTagsAsync 유지
ParseDxfViaMcpAsync 새 메서드:
- _mcp.ParsePidDxfAsync(filePath) 호출
- JSON 응답에서 tags 배열 파싱
- 각 tag의 coords.x, coords.y를 ExtractedItem.PosX, PosY에 할당
- 실패 시 FallbackParseDxfAsync 호출 (Phase 3)

DB 저장 시 좌표 할당 (라인 108-112):

// 기존
if (coords != null && coords.TryGetValue(item.TagNo, out var c))
{
    newItem.PosX = c.X;
    newItem.PosY = c.Y;
}
// 변경
newItem.PosX = item.PosX;
newItem.PosY = item.PosY;

2.3 제거할 메서드

다음 메서드를 전체 삭제하라:

ExtractDxfText (라인 136-234)
FilterDxfText (라인 240-260)
ReconstructBalloonTags (라인 277-308)
관련 regex: _instrFuncRe, _loopNumRe (라인 265-270)

2.4 제거할 using

라인 10: using netDxf; 삭제

2.5 `ParseDxfViaMcpAsync` 구현 시 주의

McpClient.ParsePidDxfAsync는 parse_pid_dxf MCP tool을 호출한다. 반환 JSON:

{
  "success": true,
  "fluid_dictionary": {...},
  "linenos": [...],
  "tags": [
    {
      "tagNo": "P-10101",
      "kind": "equipment",
      "prefix": "P",
      "coords": {"x": 1234.5, "y": 5678.9, "h": 2.5},
      "layer": "INSTRUMENT"
    }
  ],
  "stats": {...}
}

tags 배열만 파싱하면 된다. linenos는 별도 처리 필요 없음 (기존 코드에서도 처리하지 않음).

2.6 `FallbackParseDxfAsync` 구현 (Phase 3)

MCP 연결 실패 시 Python subprocess로 직접 호출:

private async Task<List<ExtractedItem>> FallbackParseDxfAsync(string filePath)
{
    // python3 -c "..." 로 _extract_pid_dxf_fast 직접 호출
    // StandardOutput에서 JSON 읽어서 파싱
}

주의: subprocess 호출 시 파일 경로에 공백이 있을 수 있으므로 적절히 이스케이프하라. sys.path.insert(0, 'mcp-server')로 server.py를 import 가능하게 하라.

3. Phase 3: `ExperionCrawler.csproj` 수정

파일: src/Web/ExperionCrawler.csproj

라인 32: <PackageReference Include="netDxf" Version="2022.11.2" /> 삭제

4. Phase 4: Python 의존성

mcp-server/requirements.txt 또는 pyproject.toml에 rtree>=1.0.0 추가 (optional dependency).

5. 검증 순서

dotnet build src/Web/ExperionCrawler.csproj — 컴파일 성공
dotnet test — 테스트 통과
샘플 DXF 파일로 추출 테스트:
- uploads/pid/P10-EQP-BLOCK.dxf 존재 여부 확인
- 추출 결과에 PosX, PosY가 저장되는지 DB 확인
MCP 서버 중지 상태에서 DXF 추출 시 fallback 동작 확인
PDF 추출 경로에 영향 없는지 확인

6. 절대 해서는 안 되는 것

McpClient.cs 수정 금지 — 기존 ParsePidDxfAsync 메서드(라인 168-169)가 이미 존재하므로 그대로 사용
PDF 경로 변경 금지 — ExtractPdfText + ExtractPidTagsAsync 흐름은 그대로 유지
DB 스키마 변경 금지 — PidEquipment 테이블에 이미 pos_x, pos_y 컬럼이 존재
ParseJson 메서드 구조 변경 금지 — ParseJson은 coords 중첩 구조를 처리하지 못하므로 ParseDxfViaMcpAsync에서 수동 매핑(coords.x → PosX) 필요
using netDxf; 제거 후 netDxf 타입 사용 금지 — ExtractDxfText 전체 삭제해야 함
_extract_pid_dxf_fast의 기존 분류 로직 변경 금지 — 좌표 추가만 하고, _PID_LINENO_FULL_RE, _PID_TAG_RE 등 기존 regex 분류는 그대로 유지

6b. 반드시 처리해야 할 것 (재진단에서 발견)

Block INSERT 처리 필수 — C# ExtractDxfText(라인 162-164)는 Block AttributeDefinitions를 추출하나, Python _extract_pid_dxf_fast는 msp.query('TEXT MTEXT')만 처리. INSERT 엔티티의 virtual_entities()를 통해 블록 내부 TEXT도 추출해야 함. pid_tracer.py:51의 구현 참고.
regex 불일치 검증 — C# FilterDxfText의 [A-Z]{1,6}-\d{2,6}(-[A-Z0-9]+)*는 Python _PID_TAG_RE의 ^([A-Z]{1,4})-(\d{3,6})([A-Z])?$보다 관대함. 5~6글자 prefix(FICQA-101), 2자리 숫자(P-10), 복합 접미사(PSV-10101-2A)가 누락될 수 있음. Python 측 regex를 확장하거나 별도 fallback regex 추가 필요.
Fallback 호출 안전성 — FallbackParseDxfAsync에서 ProcessStartInfo.Arguments에 파일 경로를 직접 삽입하는 것은 Python 코드 인젝션 위험이 있음. UseShellExecute = false이므로 shell injection은 아니지만, 파일 경로에 '__import__("os").system("rm -rf /")#' 같은 값이 들어갈 수 있음. 별도 Python 스크립트 파일을 실행하거나, sys.argv를 통해 경로를 전달하는 방식으로 변경.
Temp 파일 I/O 불가피성 명시 — MCP parse_pid_dxf가 filepath: str을 인자로 받으므로 temp 파일 사용은 불가피. 개선 목표에서 "디스크 I/O 100% 감소"는 달성 불가.

7. 작업 순서 권장

먼저 Python 측 (server.py)을 수정하고 테스트
그 다음 C# 측 (PidExtractorService.cs)을 수정
마지막으로 csproj에서 netDxf 제거
빌드 → 테스트 → 통합 검증

Python 측을 먼저 수정하는 이유: C# 수정 후 MCP 호출 시 Python 측이 새로운 형식(coords 포함)으로 반환해야 하므로, Python이 먼저 준비되어 있어야 통합 테스트가 가능하다.

35 KiB Raw Blame History

DXF 추출 로직 개선안

1. 현재 파이프라인 구조

2. 발견된 문제점

P1 - 심각한 문제

P2 - 개선 필요

3. 개선 방향

핵심 전략: Python ezdxf 단일 파싱 경로로 통합

4. Todo List (작업 단위별)

Phase 1: Python 파싱 확장 (좌표 + Balloon 포함)

Phase 2: C# 측 정리

Phase 3: MCP Fallback

Phase 4: Tag matching 개선

Phase 5: dxf-graph 통합 (선택적)

Phase 6: 정리

5. 코드 수정 사항

5.1 mcp-server/server.py - _extract_pid_dxf_fast 확장

5.2 mcp-server/server.py - _extract_pid_tags_from_text 수정

5.3 mcp-server/server.py - match_pid_tags 개선

5.4 PidExtractorService.cs - 재작성

5.5 ExtractedItem 모델 확장

5.6 ExperionCrawler.csproj - netDxf 제거

5.7 DB 저장 시 좌표 할당 수정

6. Python 의존성 추가

7. 마이그레이션 계획

단계별 롤아웃

호환성

8. 성능 기대 효과

9. 검증 체크리스트

10. 실행용 프롬프트 (다른 LLM에게 전달)

프롬프트 시작

0. 코드베이스 구조

1. Phase 1: Python server.py 수정 (가장 중요)

1.1 _extract_pid_dxf_fast 함수 확장 (라인 296-366)

1.2 match_pid_tags 개선 (라인 1622-1696)

2. Phase 2: C# PidExtractorService.cs 수정

2.1 ExtractedItem 모델에 좌표 필드 추가 (라인 1052-1060)

2.2 ExtractFromStreamAsync 재작성 (라인 36-134)

2.3 제거할 메서드

2.4 제거할 using

2.5 ParseDxfViaMcpAsync 구현 시 주의

2.6 FallbackParseDxfAsync 구현 (Phase 3)

3. Phase 3: ExperionCrawler.csproj 수정

4. Phase 4: Python 의존성

5. 검증 순서

6. 절대 해서는 안 되는 것

6b. 반드시 처리해야 할 것 (재진단에서 발견)

7. 작업 순서 권장

프롬프트 종료

35 KiB

Raw Blame History

5.1 `mcp-server/server.py` - `_extract_pid_dxf_fast` 확장

5.2 `mcp-server/server.py` - `_extract_pid_tags_from_text` 수정

5.3 `mcp-server/server.py` - `match_pid_tags` 개선

5.4 `PidExtractorService.cs` - 재작성

5.5 `ExtractedItem` 모델 확장

5.6 `ExperionCrawler.csproj` - netDxf 제거

1. Phase 1: Python `server.py` 수정 (가장 중요)

1.1 `_extract_pid_dxf_fast` 함수 확장 (라인 296-366)

1.2 `match_pid_tags` 개선 (라인 1622-1696)

2. Phase 2: C# `PidExtractorService.cs` 수정

2.1 `ExtractedItem` 모델에 좌표 필드 추가 (라인 1052-1060)

2.2 `ExtractFromStreamAsync` 재작성 (라인 36-134)

2.5 `ParseDxfViaMcpAsync` 구현 시 주의

2.6 `FallbackParseDxfAsync` 구현 (Phase 3)

3. Phase 3: `ExperionCrawler.csproj` 수정