# DXF 추출 로직 개선안

> 작성일: 2026-05-19 | 모델: Qwen3.6-27B-FP8

---

## 1. 현재 파이프라인 구조

```
[Frontend: app.js]
     | POST /api/pid/upload
     v
[PidController.cs]  ->  uploads/pid/{filename}.dxf 저장
     | POST /api/pid/extract
     v
[PidExtractorService.cs]
     |
     +- Step 1: C# netDxf 텍스트 추출
     |    - Stream -> temp 파일 -> DxfDocument.Load()
     |    - TEXT, MTEXT, Block Attribute 추출
     |    - Circle 중심 좌표 매핑
     |    - Balloon 재조합 (O(n^2))
     |    - FilterDxfText (regex 필터링)
     |    -> (filteredText, coords) 반환
     |
     +- Step 2: MCP -> Python extract_pid_tags
     |    - HTTP localhost:5001 -> JSON-RPC
     |    - _extract_pid_tags_from_text() (순수 regex, LLM 없음)
     |    - kind: pipe | equipment | instrument | unknown 분류
     |    -> JSON: {success, count, tags: [...]}
     |
     +- Step 3: MCP -> Python match_pid_tags
     |    - HTTP localhost:5001 -> JSON-RPC
     |    - 결정론적 매칭 (exact 0.99, prefix 0.95)
     |    -> JSON: {success, count, mappings: [...]}
     |
     +- Step 4: DB 저장
     |    - 중복 체크 (기존 TagNo 제외)
     |    - Category 분류 (prefix rules)
     |    - TagClass 분류 (field vs system)
     |    - 좌표 할당 (coords)
     |    -> pid_equipment 테이블 INSERT
     |
     v
[Response: totalCount, confidenceItems, lowConfidenceItems, skippedDuplicates]

병렬 경로: MCP 직접 도구 (C# netDxf 우회)
  parse_pid_dxf(filepath) -> _extract_pid_dxf_fast() -> ezdxf layer-aware
  - LINENO 레이어 -> 라인 마스터 (service/line_no/size/material_spec)
  - 그 외 TEXT/MTEXT -> 태그 후보 (prefix로 장비/계기 분류)
  - 좌표 정보 없음
  - 이 경로는 C# PidExtractorService 에서 사용되지 않음
```

---

## 2. 발견된 문제점

### P1 - 심각한 문제

| # | 문제 | 영향 | 위치 |
|---|------|------|------|
| 1 | **C# <-> Python 중복 파싱** | C# netDxf 로 텍스트 추출 후 Python regex 로 재분류. 같은 DXF 2회 파싱 | `PidExtractorService.cs:136-234` + `server.py:373-432` |
| 2 | **좌표 정보 손실** | Python `_extract_pid_dxf_fast`는 layer 활용하지만 좌표 반환 안 함 | `server.py:296-366` |
| 3 | **Temp 파일 I/O** | Stream -> temp 파일 -> netDxf 로드 -> 삭제. 불필요한 디스크 I/O | `PidExtractorService.cs:138-142` |
| 3b | **Block INSERT 처리 손실** | C#은 Block AttributeDefinitions 추출하나 Python `_extract_pid_dxf_fast`는 TEXT/MTEXT만 처리 | `PidExtractorService.cs:162-164` vs `server.py:310` |
| 3c | **regex 불일치** | C# `FilterDxfText` `[A-Z]{1,6}-\d{2,6}` vs Python `_PID_TAG_RE` `[A-Z]{1,4}-\d{3,6}` → Python이 더 엄격하여 일부 태그 누락 가능 | `PidExtractorService.cs:253` vs `server.py:247` |
| 4 | **Balloon 재조합 O(n^2)** | 모든 func/num 쌍 비교. 대용량 DXF (1000+ 텍스트)에서 성능 병목 | `PidExtractorService.cs:277-308` |
| 5 | **MCP 서버 단일 장애점** | MCP 서버 다운 시 전체 파싱 실패. fallback 없음 | `PidExtractorService.cs:55-56` |

### P2 - 개선 필요

| # | 문제 | 영향 | 위치 |
|---|------|------|------|
| 6 | **FilterDxfText 과도한 필터링** | `[A-Z]{1,6}-\d{2,6}` 패턴만 통과. 일부 유효 태그 누락 가능 | `PidExtractorService.cs:240-260` |
| 7 | **dxf-graph/ 고립** | `pid_geometric_extractor.py`, `pid_topology_builder.py` 등 production 미사용 | `dxf-graph/` |
| 8 | **Backup 파일 축적** | `mcp-server/.rooBackup/`에 server.py 백업 3개 | `mcp-server/.rooBackup/` |
| 9 | **Tag matching prefix 길이 제한** | `_MIN_PREFIX_LEN = 4` -> `P-10101` 같은 짧은 prefix prefix 매칭 불가 | `server.py:1643` |
| 10 | **netDxf 의존성 불필요** | Python ezdxf 가 layer-aware 파싱 지원하므로 C# netDxf 중복 | `ExperionCrawler.csproj:32` |

---

## 3. 개선 방향

### 핵심 전략: Python ezdxf 단일 파싱 경로로 통합

C# netDxf 제거하고 Python `parse_pid_dxf`를 확장하여 좌표 + layer 정보를 모두 반환.
C#은 MCP 호출 + DB 저장만 담당.

```
[Frontend]
     | POST /api/pid/extract
     v
[PidController.cs]
     |
     v
[PidExtractorService.cs]
     |
     +- Step 1: MCP -> Python parse_pid_dxf (확장)
     |    - ezdxf layer-aware 파싱 (기존 _extract_pid_dxf_fast 기반)
     |    - 좌표 정보 포함 (TEXT/MTEXT 위치)
     |    - Balloon 재조합 (Python 측으로 이동, R-tree 최적화)
     |    - Block reference 가상 엔티티 확장
     |    -> JSON: {tags: [{tagNo, kind, coords: {x, y}, ...}]}
     |
     +- Step 2: MCP -> Python match_pid_tags (기존 유지)
     |    -> JSON: {mappings: [...]}
     |
     +- Step 3: DB 저장 (기존 유지)
     |    -> pid_equipment 테이블 INSERT
     |
     v
[Response]

Fallback: MCP 연결 실패 시 Python 프로세스 직접 호출
```

---

## 4. Todo List (작업 단위별)

### Phase 1: Python 파싱 확장 (좌표 + Balloon 포함)

- [ ] **1.1** `_extract_pid_dxf_fast`에 좌표 정보 추가
  - TEXT/MTEXT 엔티티의 `(x, y, height)`를 tag 결과에 포함
  - Circle 중심 매핑 로직 추가 (C# `ExtractDxfText`의 circleCoords 로직 이식)
  - 파일: `mcp-server/server.py`

- [ ] **1.2** Balloon 재조합을 Python 측으로 이동
  - `_extract_pid_tags_from_text` 내에 Balloon 재조합 함수 추가
  - C# `ReconstructBalloonTags` 로직을 Python 으로 포팅
  - R-tree (spatial index) 도입으로 O(n^2) -> O(n log n) 최적화
  - 파일: `mcp-server/server.py`

- [ ] **1.3** Block reference 가상 엔티티 확장
  - `INSERT` 엔티티의 `virtual_entities()`를 통해 블록 내부 TEXT 추출
  - 파일: `mcp-server/server.py`

- [ ] **1.4** `parse_pid_dxf` 반환 형식 확장
  - 기존: `{success, fluid_dictionary, linenos, tags, stats}`
  - 변경: `{success, fluid_dictionary, linenos, tags: [{..., coords: {x, y, h}}], stats}`
  - 파일: `mcp-server/server.py`

### Phase 2: C# 측 정리

- [ ] **2.1** `ExtractDxfText` 메서드 제거
  - netDxf 의존성 제거
  - `using netDxf;` 제거
  - 파일: `PidExtractorService.cs`

- [ ] **2.2** `FilterDxfText` 메서드 제거
  - Python 측에서 필터링하므로 C# 측 불필요
  - 파일: `PidExtractorService.cs`

- [ ] **2.3** `ReconstructBalloonTags` 메서드 제거
  - Python 측으로 이동
  - 파일: `PidExtractorService.cs`

- [ ] **2.4** `ExtractFromStreamAsync` 재작성
  - DXF: `McpClient.ParsePidDxfAsync(filepath)` 호출
  - PDF: 기존 `ExtractPdfText` + `McpClient.ExtractPidTagsAsync` 유지 (변경 없음)
  - 파일: `PidExtractorService.cs`

- [ ] **2.5** `netDxf` NuGet 패키지 제거
  - `ExperionCrawler.csproj`에서 `<PackageReference Include="netDxf" />` 제거
  - 파일: `ExperionCrawler.csproj`

- [ ] **2.6** `ExtractedItem` 모델에 좌표 필드 추가
  - `PosX`, `PosY` 필드 추가
  - 파일: `PidExtractorService.cs` (내부 클래스)

### Phase 3: MCP Fallback

- [ ] **3.1** Python 프로세스 직접 호출 fallback 구현
  - MCP 연결 실패 시 Python `_extract_pid_dxf_fast`를 subprocess 로 직접 호출
  - 파일: `PidExtractorService.cs`

- [ ] **3.2** `McpClient.CallToolAsync` 실패 감지 개선
  - `"도구 호출 실패:"` 문자열이 포함된 응답을 예외로 처리
  - 파일: `McpClient.cs`

### Phase 4: Tag matching 개선

- [ ] **4.1** `_MIN_PREFIX_LEN` 조정
  - 4 -> 3으로 하향 (P-10101 같은 단축 prefix도 prefix 매칭 가능)
  - false positive 방지 위해 숫자 부분도 매칭하는 조건 추가
  - 파일: `mcp-server/server.py`

- [ ] **4.2** 숫자 기반 매칭 추가
  - `P-10101` <-> `p-10101.pv` -> 숫자(10101)가 일치하면 매칭
  - prefix도 일치해야 함 (PSV-10101 <-> p-10101.pv 거짓 매칭 방지)
  - 파일: `mcp-server/server.py`

### Phase 5: dxf-graph 통합 (선택적)

- [ ] **5.1** `pid_geometric_extractor.py` -> `server.py` 통합
  - bbox 계산 로직을 `_extract_pid_dxf_fast`에 통합
  - 파일: `mcp-server/server.py`

- [ ] **5.2** `pid_topology_builder.py` -> MCP tool 노출
  - `build_pid_graph_parallel` tool 확장
  - 파일: `mcp-server/server.py`

### Phase 6: 정리

- [ ] **6.1** `.rooBackup` 정리
  - `mcp-server/.rooBackup/` 삭제
  - 파일: `mcp-server/.rooBackup/`

- [ ] **6.2** Build 검증
  - `dotnet build src/Web/ExperionCrawler.csproj`
  - `dotnet test`

- [ ] **6.3** 통합 테스트
  - 샘플 DXF 파일로 추출 테스트 (`uploads/pid/P10-EQP-BLOCK.dxf`)
  - 좌표 정보가 DB에 저장되는지 확인
  - MCP 서버 다운 시 fallback 동작 확인

---

## 5. 코드 수정 사항

### 5.1 `mcp-server/server.py` - `_extract_pid_dxf_fast` 확장

**현재 (296-366 라인):**

```python
async def _extract_pid_dxf_fast(filepath: str) -> dict:
    """DXF에서 layer + regex만으로 구조 추출. 좌표 계산/LLM 호출 없음."""
    import ezdxf
    from ezdxf.tools.text import plain_mtext
    from collections import Counter

    def _work():
        doc = ezdxf.readfile(filepath)
        msp = doc.modelspace()

        linenos: list[dict] = []
        tags: list[dict] = []
        seen_tags: set[str] = set()

        for e in msp.query('TEXT MTEXT'):
            # ... 텍스트 추출 및 분류 ...
```

**수정 후:**

```python
async def _extract_pid_dxf_fast(filepath: str) -> dict:
    """DXF에서 layer + regex + 좌표로 구조 추출. 좌표 포함, LLM 호출 없음."""
    import ezdxf
    from ezdxf.tools.text import plain_mtext
    from collections import Counter
    import math

    def _work():
        doc = ezdxf.readfile(filepath)
        msp = doc.modelspace()

        linenos: list[dict] = []
        tags: list[dict] = []
        seen_tags: set[str] = set()

        # Circle 중심 좌표 사전 수집
        circles = []
        for c in msp.query('CIRCLE'):
            circles.append((c.dxf.center.x, c.dxf.center.y, c.dxf.radius))

        # 텍스트 엔티티 수집 (좌표 포함)
        positioned_texts = []  # list of (text, x, y, h)

        for e in msp.query('TEXT MTEXT'):
            if e.dxftype() == 'TEXT':
                txt = e.dxf.text or ""
                x, y = e.dxf.insert.x, e.dxf.insert.y
                h = e.dxf.height or 0.0
            else:
                try:
                    txt = plain_mtext(e.dxf.text or "")
                except Exception:
                    txt = e.dxf.text or ""
                x, y = e.dxf.insert.x, e.dxf.insert.y
                h = e.dxf.height or 0.0
            txt = txt.strip()
            if not txt:
                continue
            positioned_texts.append((txt, x, y, h))
            layer = e.dxf.layer

            # 기존 분류 로직 (layer 기반 LINENO, 태그 분류)
            # tag 추가 시 coords 포함:
            # tags.append({
            #     "tagNo": txt, "kind": ..., "prefix": ..., "type": ...,
            #     "coords": {"x": x, "y": y, "h": h},
            #     "layer": layer
            # })

        # Circle 중심 매핑
        circle_coords = {}
        for (txt, x, y, h) in positioned_texts:
            best_r = float('inf')
            cx, cy = x, y
            for (ccx, ccy, cr) in circles:
                d = math.sqrt((x - ccx) ** 2 + (y - ccy) ** 2)
                if d < cr and cr < best_r:
                    best_r = cr
                    cx, cy = ccx, ccy
            if best_r < float('inf'):
                circle_coords[txt] = (cx, cy)

        # Balloon 재조합 (R-tree 기반)
        balloon_tags = _reconstruct_balloons_rtree(positioned_texts, circle_coords)
        for (tag, x, y, h) in balloon_tags:
            if tag not in seen_tags:
                seen_tags.add(tag)
                cls = _classify_pid_tag(tag)
                final_x, final_y = x, y
                if tag in circle_coords:
                    final_x, final_y = circle_coords[tag]
                tags.append({
                    "tagNo": tag,
                    **cls,
                    "instrumentType": cls["prefix"],
                    "confidence": 0.90,
                    "coords": {"x": final_x, "y": final_y, "h": h},
                })

        # stats 계산 ...
```

**추가할 Balloon 재조합 함수 (R-tree 기반):**

```python
def _reconstruct_balloons_rtree(
    texts,
    circle_coords,
):
    """ISA 기능코드 + 루프번호를 근접 좌표로 짝지어 재조합.
    R-tree (spatial index) 로 O(n log n) 으로 최적화.
    """
    import re
    try:
        import rtree
        HAS_RTREE = True
    except ImportError:
        HAS_RTREE = False

    _instr_func_re = re.compile(r'^[FPLTASHQWVXZBDCRK][ICTREVYSAQGZ]{1,3}$')
    _loop_num_re = re.compile(r'^\d{3,6}[A-Z]?$')

    funcs = [(t, x, y, h) for t, x, y, h in texts if _instr_func_re.match(t)]
    nums = [(t, x, y, h) for t, x, y, h in texts if _loop_num_re.match(t)]

    if not funcs or not nums:
        return []

    result = []
    seen = set()

    idx = None
    if HAS_RTREE and len(nums) > 50:
        idx = rtree.index.Index()
        for i, (t, x, y, h) in enumerate(nums):
            idx.insert(i, (x, y, x, y))

    for (ft, fx, fy, fh) in funcs:
        threshold = fh * 5.0 if fh > 0 else 12.0
        best_dist = float('inf')
        best_num = None
        nx, ny = fx, fy

        if idx is not None:
            candidates = list(idx.intersection(
                (fx - threshold, fy - threshold, fx + threshold, fy + threshold)))
            for i in candidates:
                t, x, y, h = nums[i]
                d = math.sqrt((fx - x) ** 2 + (fy - y) ** 2)
                if d < best_dist:
                    best_dist = d
                    best_num = t
                    nx, ny = x, y
        else:
            for (t, x, y, h) in nums:
                d = math.sqrt((fx - x) ** 2 + (fy - y) ** 2)
                if d < best_dist:
                    best_dist = d
                    best_num = t
                    nx, ny = x, y

        if best_num and best_dist <= threshold:
            tag = f"{ft}-{best_num}"
            if tag.upper() not in seen:
                seen.add(tag.upper())
                result.append((tag, (fx + nx) / 2, (fy + ny) / 2, fh))

    return result
```

### 5.2 `mcp-server/server.py` - `_extract_pid_tags_from_text` 수정

**수정 사항:** `coords` 필드를 반환 형식에 포함 (PDF/OCR 텍스트는 좌표 없음)

```python
def _extract_pid_tags_from_text(text: str, coords_map: dict | None = None) -> list[dict]:
    """plain text 에서 tag/LineNo를 regex로 추출.
    coords_map 이 제공되면 좌표 정보 포함.
    """
    # ... 기존 로직 동일 ...
    # tag 추가 시:
    entry = {
        "tagNo": token,
        "kind": "equipment",
        "prefix": cls["prefix"],
        # ...
    }
    if coords_map and token in coords_map:
        entry["coords"] = coords_map[token]
    out.append(entry)
    return out
```

### 5.3 `mcp-server/server.py` - `match_pid_tags` 개선

**현재:**
```python
_MIN_PREFIX_LEN = 4  # prefix 매칭 최소 길이
```

**수정 후:**
```python
_MIN_PREFIX_LEN = 3  # P-10101 같은 단축 prefix도 매칭 가능

# 추가 매칭 전략: 숫자 기반 매칭
def _match_by_number(pid_norm, ex_index):
    """P-10101 <-> p-10101.pv -> 숫자(10101) + prefix(P)가 모두 일치하면 매칭."""
    m = re.match(r'^([a-z]+)-(\d+)$', pid_norm)
    if not m:
        return None, 0.0
    pid_prefix, pid_num = m.group(1), m.group(2)
    for ex_norm, ex_orig in ex_index.items():
        em = re.match(r'^([a-z]+)-(\d+)', ex_norm)
        if em and em.group(1) == pid_prefix and em.group(2) == pid_num:
            return ex_orig, 0.95
    return None, 0.0
```

### 5.4 `PidExtractorService.cs` - 재작성

**현재 `ExtractFromStreamAsync` (36-134 라인):**

```csharp
public async Task<PidExtractionResult> ExtractFromStreamAsync(Stream stream, string fileName, bool useImageMode = false)
{
    var ext = Path.GetExtension(fileName).ToLowerInvariant();

    string text;
    Dictionary<string, (double X, double Y, double H)>? coords = null;

    if (ext == ".dxf")
        (text, coords) = ExtractDxfText(stream);
    else if (ext == ".pdf")
        text = ExtractPdfText(stream);
    // ...
```

**수정 후:**

```csharp
public async Task<PidExtractionResult> ExtractFromStreamAsync(Stream stream, string fileName, bool useImageMode = false)
{
    var ext = Path.GetExtension(fileName).ToLowerInvariant();
    List<ExtractedItem> extractedItems;

    if (ext == ".dxf")
    {
        var tmp = Path.GetTempFileName() + ".dxf";
        try
        {
            await using var fs = File.Create(tmp);
            await stream.CopyToAsync(fs);
            extractedItems = await ParseDxfViaMcpAsync(tmp);
        }
        finally
        {
            if (File.Exists(tmp)) File.Delete(tmp);
        }
    }
    else if (ext == ".pdf")
    {
        var text = ExtractPdfText(stream);
        if (string.IsNullOrWhiteSpace(text))
            return new PidExtractionResult(0, 0, 0);
        var json = await _mcp.ExtractPidTagsAsync(text, "pdf");
        extractedItems = ParseJson(json);
    }
    else
    {
        throw new NotSupportedException("지원 형식: .dxf .pdf");
    }

    if (extractedItems.Count == 0)
    {
        _logger.LogWarning("P&ID 추출 결과 0건 - 파일: {FileName}", fileName);
        return new PidExtractionResult(0, 0, 0);
    }

    // 이후 로직 (매핑, 중복 체크, DB 저장) 동일
}
```

**새로 추가할 메서드:**

```csharp
private async Task<List<ExtractedItem>> ParseDxfViaMcpAsync(string filePath)
{
    try
    {
        var json = await _mcp.ParsePidDxfAsync(filePath);
        using var doc = JsonDocument.Parse(json);
        var root = doc.RootElement;
        var items = new List<ExtractedItem>();

        if (root.TryGetProperty("tags", out var tagsEl) && tagsEl.ValueKind == JsonValueKind.Array)
        {
            foreach (var tag in tagsEl.EnumerateArray())
            {
                var item = new ExtractedItem
                {
                    TagNo = tag.TryGetProperty("tagNo", out var tn) ? tn.GetString() ?? "" : "",
                    InstrumentType = tag.TryGetProperty("prefix", out var p) ? p.GetString() : null,
                    LineNumber = tag.TryGetProperty("lineNumber", out var ln) ? ln.GetString() : null,
                    Confidence = tag.TryGetProperty("confidence", out var c) ? c.GetDouble() : 0.5,
                };

                if (tag.TryGetProperty("coords", out var coordsEl))
                {
                    item.PosX = coordsEl.TryGetProperty("x", out var cx) ? cx.GetDouble() : null;
                    item.PosY = coordsEl.TryGetProperty("y", out var cy) ? cy.GetDouble() : null;
                }

                items.Add(item);
            }
        }

        _logger.LogInformation("[PID] MCP 파싱 완료: {Count}건 (파일: {File})",
            items.Count, Path.GetFileName(filePath));
        return items;
    }
    catch (Exception ex)
    {
        _logger.LogWarning(ex, "[PID] MCP 파싱 실패 - fallback 사용 (파일: {File})", filePath);
        return await FallbackParseDxfAsync(filePath);
    }
}

private async Task<List<ExtractedItem>> FallbackParseDxfAsync(string filePath)
{
    var result = new List<ExtractedItem>();
    try
    {
        var escapedPath = filePath.Replace("'", "'\"'\"'");
        var psi = new ProcessStartInfo
        {
            FileName = "python3",
            Arguments = $"-c \"import sys,json,asyncio; sys.path.insert(0,'mcp-server'); " +
                        $"from server import _extract_pid_dxf_fast; " +
                        $"d=asyncio.run(_extract_pid_dxf_fast('{escapedPath}')); " +
                        $"print(json.dumps(d))\"",
            RedirectStandardOutput = true,
            UseShellExecute = false,
            CreateNoWindow = true,
        };
        using var proc = Process.Start(psi);
        if (proc != null)
        {
            var output = await proc.StandardOutput.ReadToEndAsync();
            await proc.WaitForExitAsync();
            if (!string.IsNullOrWhiteSpace(output))
            {
                using var doc = JsonDocument.Parse(output);
                if (doc.RootElement.TryGetProperty("tags", out var tagsEl) &&
                    tagsEl.ValueKind == JsonValueKind.Array)
                {
                    foreach (var tag in tagsEl.EnumerateArray())
                    {
                        result.Add(new ExtractedItem
                        {
                            TagNo = tag.TryGetProperty("tagNo", out var tn) ? tn.GetString() ?? "" : "",
                            InstrumentType = tag.TryGetProperty("prefix", out var p) ? p.GetString() : null,
                            Confidence = tag.TryGetProperty("confidence", out var c) ? c.GetDouble() : 0.5,
                        });
                    }
                }
            }
        }
    }
    catch (Exception ex)
    {
        _logger.LogError(ex, "[PID] Fallback 파싱도 실패");
    }
    return result;
}
```

**제거할 메서드:**
- `ExtractDxfText` (136-234 라인) - 전체 제거
- `FilterDxfText` (240-260 라인) - 전체 제거
- `ReconstructBalloonTags` (277-308 라인) - 전체 제거
- 관련 regex: `_instrFuncRe`, `_loopNumRe` - 제거

### 5.5 `ExtractedItem` 모델 확장

**현재:**
```csharp
public class ExtractedItem
{
    public string TagNo { get; set; } = "";
    public string? EquipmentName { get; set; }
    public string? InstrumentType { get; set; }
    public string? LineNumber { get; set; }
    public string? PidDrawingNo { get; set; }
    public double Confidence { get; set; } = 0.5;
}
```

**수정 후:**
```csharp
public class ExtractedItem
{
    public string TagNo { get; set; } = "";
    public string? EquipmentName { get; set; }
    public string? InstrumentType { get; set; }
    public string? LineNumber { get; set; }
    public string? PidDrawingNo { get; set; }
    public double Confidence { get; set; } = 0.5;
    public double? PosX { get; set; }
    public double? PosY { get; set; }
}
```

### 5.6 `ExperionCrawler.csproj` - netDxf 제거

**현재:**
```xml
<!-- P&ID 추출 -->
<PackageReference Include="netDxf" Version="2022.11.2" />
<PackageReference Include="PdfPig" Version="0.1.9" />
```

**수정 후:**
```xml
<!-- P&ID 추출 (PDF only - DXF는 Python ezdxf 사용) -->
<PackageReference Include="PdfPig" Version="0.1.9" />
```

### 5.7 DB 저장 시 좌표 할당 수정

**현재 (108-112 라인):**
```csharp
if (coords != null && coords.TryGetValue(item.TagNo, out var c))
{
    newItem.PosX = c.X;
    newItem.PosY = c.Y;
}
```

**수정 후:**
```csharp
newItem.PosX = item.PosX;
newItem.PosY = item.PosY;
```

---

## 6. Python 의존성 추가

`mcp-server/pyproject.toml` 또는 `requirements.txt`에:

```
rtree>=1.0.0  # Balloon 재조합 공간 인덱싱 (optional, 없으면 선형 검색 fallback)
```

---

## 7. 마이그레이션 계획

### 단계별 롤아웃

| 단계 | 작업 | 리스크 | 롤백 |
|------|------|--------|------|
| 1 | Python `_extract_pid_dxf_fast` 확장 (좌표 포함) | 낮음 - 기존 API 호환 | git revert |
| 2 | Python Balloon 재조합 + R-tree | 낮음 - optional dependency | rtree 없이 fallback |
| 3 | C# `ExtractFromStreamAsync` 재작성 (MCP 호출) | 중간 - MCP 의존성 | 기존 코드 복원 |
| 4 | netDxf 제거 | 낮음 - Phase 3 완료 후 | 패키지 재추가 |
| 5 | Fallback 구현 | 낮음 - 안전망 | - |
| 6 | Backup 정리, 테스트 | 없음 | - |

### 호환성

- `parse_pid_dxf` 반환 형식에 `coords` 필드 추가는 **backward compatible** (기존 호출자는 coords 무시)
- `extract_pid_tags`는 변경 없음 (PDF 경로에서 사용)
- C# `ExtractedItem`에 `PosX`/`PosY` 추가는 **nullable** - 기존 코드와 호환

---

## 8. 성능 기대 효과

| 지표 | 현재 | 개선 후 | 개선율 |
|------|------|---------|--------|
| DXF 파싱 횟수 | 2회 (C# + Python) | 1회 (Python) | 50% 감소 |
| Balloon 재조합 | O(n^2) | O(n log n) | 1000 텍스트 기준 ~100x |
| 디스크 I/O | temp 파일 읽기/쓰기 | temp 파일 유지 (MCP filepath 인자 필요) | 변경 없음 ⚠️ |
| NuGet 의존성 | netDxf + PdfPig | PdfPig only | 1개 감소 |
| MCP 장애 시 | 전체 실패 | fallback 동작 | - |

---

## 9. 검증 체크리스트

- [ ] `dotnet build src/Web/ExperionCrawler.csproj` 성공
- [ ] `dotnet test` 성공
- [ ] 샘플 DXF (`P10-EQP-BLOCK.dxf`) 추출 시 좌표가 DB 에 저장됨
- [ ] MCP 서버 중지 상태에서 추출 시 fallback 동작
- [ ] PDF 추출 경로에 영향 없음
- [ ] 기존 DB 데이터에 영향 없음 (기존 좌표 유지)
- [ ] `parse_pid_dxf` MCP tool 직접 호출 시 coords 반환
- [ ] Balloon 재조합 결과 검증 (TE-9101 같은 분리 태그가 재조합됨)

---

## 10. 실행용 프롬프트 (다른 LLM에게 전달)

> 아래 프롬프트를 다른 LLM에게 그대로 전달하면 된다. 코드베이스의 현재 상태, 수정해야 할 정확한 위치, 주의사항이 모두 포함되어 있다.

---

### 프롬프트 시작

당신은 C# (.NET 8)과 Python에 능숙한 시니어 개발자다. 아래 DXF 추출 로직 개선안을 **정확히** 구현하라.

---

### 0. 코드베이스 구조

```
src/
├── Core/Application/Services/PidExtractorService.cs   — 핵심 수정 대상 (1067줄)
├── Infrastructure/Mcp/McpClient.cs                    — MCP 호출 클라이언트 (256줄, 수정 X)
└── Web/ExperionCrawler.csproj                         — netDxf 제거 대상
mcp-server/
└── server.py                                          — Python 파싱 확장 대상 (2235줄)
```

단일 프로젝트: `src/Web/ExperionCrawler.csproj`. Core/Infrastructure는 `<Compile Include>` glob로 포함.

---

### 1. Phase 1: Python `server.py` 수정 (가장 중요)

**파일:** `mcp-server/server.py`

#### 1.1 `_extract_pid_dxf_fast` 함수 확장 (라인 296-366)

**현재 동작:** DXF에서 TEXT/MTEXT를 추출하여 태그 분류. 좌표 정보 없음.

**수정해야 할 점:**

1. 각 tag에 `coords: {"x": float, "y": float, "h": float}` 필드 추가
2. Circle 중심 좌표 매핑 로직 추가 (C# `PidExtractorService.cs:167-188`의 `ExtractDxfText` 내 circleCoords 로직을 Python으로 포팅)
3. Block INSERT 엔티티의 `virtual_entities()`도 처리 (C# `PidExtractorService.cs:162-164`의 Block AttributeDefinitions 처리에 해당)

**Circle 매핑 알고리즘 (C# 원본 참고):**
```
foreach TEXT entity:
  bestR = infinity
  for each CIRCLE:
    d = distance(TEXT.pos, CIRCLE.center)
    if d < CIRCLE.radius AND CIRCLE.radius < bestR:
      bestR = CIRCLE.radius
      mapped_coords = CIRCLE.center
  if bestR < infinity:
    circle_coords[TEXT.value] = mapped_coords
```

**Balloon 재조합 함수 추가:**

새 함수 `_reconstruct_balloons_rtree(texts, circle_coords)`를 `_extract_pid_dxf_fast` 앞에 정의하라.

- C# `PidExtractorService.cs:277-308`의 `ReconstructBalloonTags`를 Python으로 포팅
- regex 패턴:
  - `_instr_func_re = r'^[FPLTASHQWVXZBDCRK][ICTREVYSAQGZ]{1,3}$'` (ISA 기능코드)
  - `_loop_num_re = r'^\d{3,6}[A-Z]?$'` (루프번호)
- R-tree (`rtree` 패키지)를 사용하되, **없으면 fallback으로 선형 검색**. `ImportError`를 catch하여 graceful degradation.
- 임계값: `threshold = h * 5.0 if h > 0 else 12.0`
- 반환: `list[tuple[str, float, float, float]]` — `(tag, x, y, h)`

**반환 형식 변경:**

```python
# 기존
tags.append({"tagNo": txt, **cls, "layer": layer})

# 변경
tags.append({
    "tagNo": txt,
    **cls,
    "layer": layer,
    "coords": {"x": final_x, "y": final_y, "h": h},
})
```

**주의:** 기존 `parse_pid_dxf` tool(라인 1702-1724)은 `_extract_pid_dxf_fast`의 결과를 `{"success": True, **data}`로 감싸서 반환한다. 반환 형식에 `coords`를 추가하는 것은 **backward compatible** — 기존 호출자는 coords를 무시한다.

#### 1.2 `match_pid_tags` 개선 (라인 1622-1696)

**수정 1:** `_MIN_PREFIX_LEN = 4` → `_MIN_PREFIX_LEN = 3`

**수정 2:** prefix 매칭(전략 2)에 숫자 매칭 조건 추가.

현재 prefix 매칭은 `n.startswith(pid_norm + ".")`로 동작한다. `P-101` 같은 짧은 prefix가 `P-101`, `P-1010`, `P-10101` 모두와 매칭되는 문제가 있다. 숫자 부분도 일치하는지 확인하라:

```python
# 전략 2 개선: prefix + 숫자 동시 매칭
import re
m = re.match(r'^([a-z]+)-(\d+)$', pid_norm)
if m:
    pid_prefix, pid_num = m.group(1), m.group(2)
    hit = next(
        (n for n in ex_norms
         if re.match(rf'^{re.escape(pid_prefix)}-{re.escape(pid_num)}(\.|-|$)', n)),
        None,
    )
    if hit:
        mappings.append({"pidTag": pid, "experionTag": ex_index[hit], "confidence": 0.95})
        continue
```

---

### 2. Phase 2: C# `PidExtractorService.cs` 수정

**파일:** `src/Core/Application/Services/PidExtractorService.cs`

#### 2.1 `ExtractedItem` 모델에 좌표 필드 추가 (라인 1052-1060)

```csharp
public class ExtractedItem
{
    public string TagNo { get; set; } = "";
    public string? EquipmentName { get; set; }
    public string? InstrumentType { get; set; }
    public string? LineNumber { get; set; }
    public string? PidDrawingNo { get; set; }
    public double Confidence { get; set; } = 0.5;
    public double? PosX { get; set; }    // ← 추가
    public double? PosY { get; set; }    // ← 추가
}
```

#### 2.2 `ExtractFromStreamAsync` 재작성 (라인 36-134)

**핵심 변경:** DXF 경로에서 C# netDxf 파싱을 제거하고 MCP `parse_pid_dxf` 호출로 대체.

```
기존 흐름:
  DXF → ExtractDxfText(netDxf) → text + coords
       → MCP ExtractPidTagsAsync(text) → tags
       → 매핑 → DB 저장

변경 후 흐름:
  DXF → Stream → temp 파일
       → MCP ParsePidDxfAsync(filepath) → tags (coords 포함)
       → 매핑 → DB 저장
```

**구체적인 구현:**

1. `.dxf`인 경우:
   - Stream을 temp 파일로 저장 (`Path.GetTempFileName() + ".dxf"`)
   - `ParseDxfViaMcpAsync(tmp)` 호출 (새 메서드, 아래 참조)
   - finally에서 temp 파일 삭제

2. `.pdf`인 경우: **변경 없음**. 기존 `ExtractPdfText` + `McpClient.ExtractPidTagsAsync` 유지

3. `ParseDxfViaMcpAsync` 새 메서드:
   - `_mcp.ParsePidDxfAsync(filePath)` 호출
   - JSON 응답에서 `tags` 배열 파싱
   - 각 tag의 `coords.x`, `coords.y`를 `ExtractedItem.PosX`, `PosY`에 할당
   - 실패 시 `FallbackParseDxfAsync` 호출 (Phase 3)

4. DB 저장 시 좌표 할당 (라인 108-112):
   ```csharp
   // 기존
   if (coords != null && coords.TryGetValue(item.TagNo, out var c))
   {
       newItem.PosX = c.X;
       newItem.PosY = c.Y;
   }
   // 변경
   newItem.PosX = item.PosX;
   newItem.PosY = item.PosY;
   ```

#### 2.3 제거할 메서드

다음 메서드를 **전체 삭제**하라:
- `ExtractDxfText` (라인 136-234)
- `FilterDxfText` (라인 240-260)
- `ReconstructBalloonTags` (라인 277-308)
- 관련 regex: `_instrFuncRe`, `_loopNumRe` (라인 265-270)

#### 2.4 제거할 using

라인 10: `using netDxf;` 삭제

#### 2.5 `ParseDxfViaMcpAsync` 구현 시 주의

`McpClient.ParsePidDxfAsync`는 `parse_pid_dxf` MCP tool을 호출한다. 반환 JSON:

```json
{
  "success": true,
  "fluid_dictionary": {...},
  "linenos": [...],
  "tags": [
    {
      "tagNo": "P-10101",
      "kind": "equipment",
      "prefix": "P",
      "coords": {"x": 1234.5, "y": 5678.9, "h": 2.5},
      "layer": "INSTRUMENT"
    }
  ],
  "stats": {...}
}
```

`tags` 배열만 파싱하면 된다. `linenos`는 별도 처리 필요 없음 (기존 코드에서도 처리하지 않음).

#### 2.6 `FallbackParseDxfAsync` 구현 (Phase 3)

MCP 연결 실패 시 Python subprocess로 직접 호출:

```csharp
private async Task<List<ExtractedItem>> FallbackParseDxfAsync(string filePath)
{
    // python3 -c "..." 로 _extract_pid_dxf_fast 직접 호출
    // StandardOutput에서 JSON 읽어서 파싱
}
```

**주의:** subprocess 호출 시 파일 경로에 공백이 있을 수 있으므로 적절히 이스케이프하라. `sys.path.insert(0, 'mcp-server')`로 server.py를 import 가능하게 하라.

---

### 3. Phase 3: `ExperionCrawler.csproj` 수정

**파일:** `src/Web/ExperionCrawler.csproj`

라인 32: `<PackageReference Include="netDxf" Version="2022.11.2" />` 삭제

---

### 4. Phase 4: Python 의존성

`mcp-server/requirements.txt` 또는 `pyproject.toml`에 `rtree>=1.0.0` 추가 (optional dependency).

---

### 5. 검증 순서

1. `dotnet build src/Web/ExperionCrawler.csproj` — 컴파일 성공
2. `dotnet test` — 테스트 통과
3. 샘플 DXF 파일로 추출 테스트:
   - `uploads/pid/P10-EQP-BLOCK.dxf` 존재 여부 확인
   - 추출 결과에 `PosX`, `PosY`가 저장되는지 DB 확인
4. MCP 서버 중지 상태에서 DXF 추출 시 fallback 동작 확인
5. PDF 추출 경로에 영향 없는지 확인

---

### 6. 절대 해서는 안 되는 것

1. **`McpClient.cs` 수정 금지** — 기존 `ParsePidDxfAsync` 메서드(라인 168-169)가 이미 존재하므로 그대로 사용
2. **PDF 경로 변경 금지** — `ExtractPdfText` + `ExtractPidTagsAsync` 흐름은 그대로 유지
3. **DB 스키마 변경 금지** — `PidEquipment` 테이블에 이미 `pos_x`, `pos_y` 컬럼이 존재
4. **`ParseJson` 메서드 구조 변경 금지** — `ParseJson`은 `coords` 중첩 구조를 처리하지 못하므로 `ParseDxfViaMcpAsync`에서 수동 매핑(`coords.x` → `PosX`) 필요
5. **`using netDxf;` 제거 후 netDxf 타입 사용 금지** — `ExtractDxfText` 전체 삭제해야 함
6. **`_extract_pid_dxf_fast`의 기존 분류 로직 변경 금지** — 좌표 추가만 하고, `_PID_LINENO_FULL_RE`, `_PID_TAG_RE` 등 기존 regex 분류는 그대로 유지

### 6b. 반드시 처리해야 할 것 (재진단에서 발견)

1. **Block INSERT 처리 필수** — C# `ExtractDxfText`(라인 162-164)는 Block AttributeDefinitions를 추출하나, Python `_extract_pid_dxf_fast`는 `msp.query('TEXT MTEXT')`만 처리. `INSERT` 엔티티의 `virtual_entities()`를 통해 블록 내부 TEXT도 추출해야 함. `pid_tracer.py:51`의 구현 참고.
2. **regex 불일치 검증** — C# `FilterDxfText`의 `[A-Z]{1,6}-\d{2,6}(-[A-Z0-9]+)*`는 Python `_PID_TAG_RE`의 `^([A-Z]{1,4})-(\d{3,6})([A-Z])?$`보다 관대함. 5~6글자 prefix(`FICQA-101`), 2자리 숫자(`P-10`), 복합 접미사(`PSV-10101-2A`)가 누락될 수 있음. Python 측 regex를 확장하거나 별도 fallback regex 추가 필요.
3. **Fallback 호출 안전성** — `FallbackParseDxfAsync`에서 `ProcessStartInfo.Arguments`에 파일 경로를 직접 삽입하는 것은 Python 코드 인젝션 위험이 있음. `UseShellExecute = false`이므로 shell injection은 아니지만, 파일 경로에 `'__import__("os").system("rm -rf /")#'` 같은 값이 들어갈 수 있음. 별도 Python 스크립트 파일을 실행하거나, `sys.argv`를 통해 경로를 전달하는 방식으로 변경.
4. **Temp 파일 I/O 불가피성 명시** — MCP `parse_pid_dxf`가 `filepath: str`을 인자로 받으므로 temp 파일 사용은 불가피. 개선 목표에서 "디스크 I/O 100% 감소"는 달성 불가.

---

### 7. 작업 순서 권장

1. 먼저 Python 측 (`server.py`)을 수정하고 테스트
2. 그 다음 C# 측 (`PidExtractorService.cs`)을 수정
3. 마지막으로 `csproj`에서 netDxf 제거
4. 빌드 → 테스트 → 통합 검증

Python 측을 먼저 수정하는 이유: C# 수정 후 MCP 호출 시 Python 측이 새로운 형식(coords 포함)으로 반환해야 하므로, Python이 먼저 준비되어 있어야 통합 테스트가 가능하다.

### 프롬프트 종료