zcrop vlog

识别程序详细理解

发表于 2026-02-14 更新于 2026-03-27

识别程序详细理解

配置文件：

# ==========================================
# YOLOv8 标准三头结构 + SE 注意力机制版
# ==========================================

# Parameters
nc: 4  # number of classes
scales:
  # [depth, width, max_channels]
  n: [0.33, 0.25, 1024]
  s: [0.33, 0.50, 1024]
  m: [0.67, 0.75, 768]
  l: [1.00, 1.00, 512]
  x: [1.00, 1.25, 512]

# Backbone (官方原版，不做任何改动)
backbone:
  # [from, repeats, module, args]
  - [-1, 1, Conv, [64, 3, 2]]  # 0-P1/2
  - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
  - [-1, 3, C2f, [128, True]]  # 2
  - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
  - [-1, 6, C2f, [256, True]]  # 4
  - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
  - [-1, 6, C2f, [512, True]]  # 6
  - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
  - [-1, 3, C2f, [1024, True]] # 8
  - [-1, 1, SPPF, [1024, 5]]   # 9

# Head (在检测前插入 SE 模块)
head:
  - [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 10
  - [[-1, 6], 1, Concat, [1]]  # 11 cat backbone P4 把上一层上采样后的特征图和 backbone 的 P4 特征图在通道维度拼接，用于融合高层语义信息和低层细节信息。
  - [-1, 3, C2f, [512]]        # 12 压缩到512层

  - [-1, 1, nn.Upsample, [None, 2, 'nearest']] # 13
  - [[-1, 4], 1, Concat, [1]]  # 14 cat backbone P3
  - [-1, 3, C2f, [256]]        # 15 (P3/8-small: 负责抓线缆)

  # 🌟【新增】给 P3 加上 SE 探照灯，过滤地毯，提亮细线
  - [-1, 1, SEAttention, []]   # 16: 被 SE 过滤后的 P3

  - [-1, 1, Conv, [256, 3, 2]] # 17: 继续向下走，复用被过滤干净的特征
  - [[-1, 12], 1, Concat, [1]] # 18 cat head P4
  - [-1, 3, C2f, [512]]        # 19 (P4/16-medium: 负责抓鞋子)

  # 🌟【新增】给 P4 加上 SE 探照灯，过滤靠墙环境
  - [-1, 1, SEAttention, []]   # 20: 被 SE 过滤后的 P4

  - [-1, 1, Conv, [512, 3, 2]] # 21
  - [[-1, 9], 1, Concat, [1]]  # 22 cat head P5
  - [-1, 3, C2f, [1024]]       # 23 (P5/32-large: 负责抓远处大目标)

  # 🌟【新增】给 P5 加上 SE 探照灯
  - [-1, 1, SEAttention, []]   # 24: 被 SE 过滤后的 P5

  # 最终检测头接入被 SE 重新标定(过滤)过权重的 16(P3), 20(P4), 24(P5)
  - [[16, 20, 24], 1, Detect, [nc]]  # 25

其中，nc是类别数。

scales:
  # [depth, width, max_channels]
  n: [0.33, 0.25, 1024]
  s: [0.33, 0.50, 1024]
  m: [0.67, 0.75, 768]
  l: [1.00, 1.00, 512]
  x: [1.00, 1.25, 512]

depth = 0.33（深度缩放）
width = 0.25（宽度缩放）
max_channels = 1024（最大通道上限）

YOLOv8 使用 DFL（Distribution Focal Loss）：

每个坐标不再预测一个数，而是预测一个概率分布。

yolov8的reg_max 表示：

每个坐标预测多少个离散 bin。

默认值是16

边界框有 4 个边：

left
top
right
bottom

所以回归输出通道：64

[from, repeats, module, args]

[-1, 1, Conv, [64, 3, 2]]

其中，-1 表示 输入来自上一层的输出。

第二1表示模块的重复次数。

第三个表示模型类型

第四个[64,3,2]

表示输出通道 = 64
卷积核 = 3
stride = 2

[-1, 3, C2f, [128, True]]

True 是 C2f 模块里的一个布尔参数，表示 是否启用 shortcut（残差连接）

原理代码：

class C2f(nn.Module):
    """Faster Implementation of CSP Bottleneck with 2 convolutions."""

    def __init__(self, c1: int, c2: int, n: int = 1, shortcut: bool = False, g: int = 1, e: float = 0.5):
        """Initialize a CSP bottleneck with 2 convolutions.

        Args:
            c1 (int): Input channels.
            c2 (int): Output channels.
            n (int): Number of Bottleneck blocks.
            shortcut (bool): Whether to use shortcut connections.
            g (int): Groups for convolutions.
            e (float): Expansion ratio.
        """
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        #Conv(in_channels, out_channels, kernel_size, stride)
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2) 融合通道信息
降低通道数
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through C2f layer."""
        y = list(self.cv1(x).chunk(2, 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))#沿通道维度拼接

    def forward_split(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass using split() instead of chunk()."""
        y = self.cv1(x).split((self.c, self.c), 1)
        y = [y[0], y[1]]
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

其中，c1为输入特征图的通道数，c2为输出特征图通道数。

Bottleneck网络如下：

class Bottleneck(nn.Module):
    """Standard bottleneck."""

    def __init__(
        self, c1: int, c2: int, shortcut: bool = True, g: int = 1, k: tuple[int, int] = (3, 3), e: float = 0.5
    ):
        """Initialize a standard bottleneck module.

        Args:
            c1 (int): Input channels.
            c2 (int): Output channels.
            shortcut (bool): Whether to use shortcut connection.
            g (int): Groups for convolutions.
            k (tuple): Kernel sizes for convolutions.
            e (float): Expansion ratio.
        """
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Apply bottleneck with optional shortcut connection."""
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

其中，c_ 是 隐藏通道数（bottleneck channels）。

默认 e=0.5，表示中间通道是输出通道的一半

   x
   │
   │
   ├─────────────┐
   │             │
   ▼             │
Conv(cv1)        │
   │             │
Conv(cv2)        │
   │             │
   ▼             │
  F(x)           │
   │             │
   └─────── + ───┘
          │
          ▼
          y

优点：

① 防止梯度消失

深层网络更容易训练。

② 学习更简单

为什么输入通道是 (2+n)*self.c

第一步：cv1

1	cv1: c1 → 2c

输出：

1	(B, 2c, H, W)

第二步：chunk(2)

把通道一分为二：

1 2	(B, c, H, W) (B, c, H, W)

现在列表里有 2个特征图

1	y = [y1, y2]

第三步：Bottleneck

代码：

1 2	for m in self.m: y.append(m(y[-1]))

如果：

n = 3

就会产生：

3 个新的特征图

所以 y 变成：

1	y = [y1, y2, y3, y4, y5]

数量：

2 + n

y.extend(m(y[-1]) for m in self.m)

对每个 Bottleneck：

    用当前 y 的最后一个特征图作为输入
    计算新的特征图
    加入 y 列表

Backbone 负责把图像逐级下采样成 多尺度特征图：

尺寸：640 → 320 → 160 → 80 → 40 → 20（每次 stride=2 变一半）
通道：3 → 64 → 128 → 256 → 512 → 1024（越深语义越强）

YOLOv8 backbone 的最后一个模块，叫 SPPF（Spatial Pyramid Pooling - Fast）。
它的作用是：

在几乎不增加计算量的情况下，大幅扩大感受野，并融合多尺度上下文信息。

[-1, 1, SPPF, [1024, 5]]

-1 → 输入来自上一层（layer8）
1 → 只执行一次
SPPF → 使用 SPPF 模块
[1024,5] → 参数

参数含义：

1 2	1024 → 输出通道 5 → maxpool kernel size

输入特征图：

1	(B,1024,20,20)

输出特征图：

1	(B,1024,20,20)

尺寸不会改变。

详细代码：

class SPPF(nn.Module):
    """Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher."""

    def __init__(self, c1: int, c2: int, k: int = 5, n: int = 3, shortcut: bool = False):
        """Initialize the SPPF layer with given input/output channels and kernel size.

        Args:
            c1 (int): Input channels.
            c2 (int): Output channels.
            k (int): Kernel size.
            n (int): Number of pooling iterations.
            shortcut (bool): Whether to use shortcut connection.

        Notes:
            This module is equivalent to SPP(k=(5, 9, 13)).
        """
        super().__init__()
        c_ = c1 // 2  # hidden channels 先用一个 1x1 的卷积层，强行把通道数从 c1 压缩到 c1 // 2
        self.cv1 = Conv(c1, c_, 1, 1, act=False)# 保留原汁原味的线性特征给后面的池化层去处理
        self.cv2 = Conv(c_ * (n + 1), c2, 1, 1)#原始特征+ n 次池化
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)
        self.n = n
        self.add = shortcut and c1 == c2

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Apply sequential pooling operations to input and return concatenated feature maps."""
        y = [self.cv1(x)]#通道压缩
        y.extend(self.m(y[-1]) for _ in range(getattr(self, "n", 3)))
        y = self.cv2(torch.cat(y, 1))
        return y + x if getattr(self, "add", False) else y

其核心思想是通过连续的 MaxPool 来扩大感受野，并把不同感受野的特征拼接融合。

Head的详细解释：

nn.Upsample, [None, 2, 'nearest'] 每个参数是什么意思

PyTorch 里常见写法是：

1	nn.Upsample(size=None, scale_factor=2, mode='nearest')

对应你这里的 args：

✅ 第一个：None（size）

size=None 表示：不指定输出的固定尺寸（比如不强行输出到 40×40）
而是交给 scale_factor 来决定放大多少倍

✅ 第二个：2（scale_factor）

scale_factor=2 表示：长和宽都乘以 2
例如 20×20 → 40×40

✅ 第三个：'nearest'（mode）

mode='nearest' 表示：最近邻插值
特点：
- 速度快、算子简单（嵌入式友好）
- 不会引入复杂插值计算

[[16, 20, 24], 1, Detect, [nc]]

部分	含义
[16,20,24]	输入来自三层特征
1	执行一次
Detect	检测头模块
[nc]	类别数量

Detect代码如下：

class Detect(nn.Module):
    """YOLO Detect head for object detection models.

    This class implements the detection head used in YOLO models for predicting bounding boxes and class probabilities.
    It supports both training and inference modes, with optional end-to-end detection capabilities.

    Attributes:
        dynamic (bool): Force grid reconstruction.
        export (bool): Export mode flag.
        format (str): Export format.
        end2end (bool): End-to-end detection mode.
        max_det (int): Maximum detections per image.
        shape (tuple): Input shape.
        anchors (torch.Tensor): Anchor points.
        strides (torch.Tensor): Feature map strides.
        legacy (bool): Backward compatibility for v3/v5/v8/v9 models.
        xyxy (bool): Output format, xyxy or xywh.
        nc (int): Number of classes.
        nl (int): Number of detection layers.
        reg_max (int): DFL channels.
        no (int): Number of outputs per anchor.
        stride (torch.Tensor): Strides computed during build.
        cv2 (nn.ModuleList): Convolution layers for box regression.
        cv3 (nn.ModuleList): Convolution layers for classification.
        dfl (nn.Module): Distribution Focal Loss layer.
        one2one_cv2 (nn.ModuleList): One-to-one convolution layers for box regression.
        one2one_cv3 (nn.ModuleList): One-to-one convolution layers for classification.

    Methods:
        forward: Perform forward pass and return predictions.
        forward_end2end: Perform forward pass for end-to-end detection.
        bias_init: Initialize detection head biases.
        decode_bboxes: Decode bounding boxes from predictions.
        postprocess: Post-process model predictions.

    Examples:
        Create a detection head for 80 classes
        >>> detect = Detect(nc=80, ch=(256, 512, 1024))
        >>> x = [torch.randn(1, 256, 80, 80), torch.randn(1, 512, 40, 40), torch.randn(1, 1024, 20, 20)]
        >>> outputs = detect(x)
    """

    dynamic = False  # force grid reconstruction
    export = False  # export mode
    format = None  # export format
    max_det = 300  # max_det
    agnostic_nms = False #无差别去重开关。
    shape = None
    anchors = torch.empty(0)  # init
    strides = torch.empty(0)  # init
    legacy = False  # backward compatibility for v3/v5/v8/v9 models
    xyxy = False  # xyxy or xywh output
#坐标系格式。默认是 False，意思是模型输出的框格式是 [中心点x, 中心点y, 宽w, 高h] (也就是 xywh)。如果设为 True，输出就会变成 [左上角x, 左上角y, 右下角x, 右下角y]。你在写 C++ 后处理的时候，一定要看准这个格式！
    def __init__(self, nc: int = 80, reg_max=16, end2end=False, ch: tuple = ()):
        """Initialize the YOLO detection layer with specified number of classes and channels.

        Args:
            nc (int): Number of classes.
            reg_max (int): Maximum number of DFL channels.
            end2end (bool): Whether to use end-to-end NMS-free detection.
            ch (tuple): Tuple of channel sizes from backbone feature maps.
        """
        super().__init__()
        self.nc = nc  # number of classes
        self.nl = len(ch)  # number of detection layers
        self.reg_max = reg_max  # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)
        self.no = nc + self.reg_max * 4  # number of outputs per anchor
        self.stride = torch.zeros(self.nl)  # strides computed during build
        c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))  # channels
        #为每个检测尺度构建一个 bounding box 回归头，使用两层 3×3 卷积提取特征，再用 1×1 卷积输出 DFL 分布（4×reg_max）。
        self.cv2 = nn.ModuleList(
            nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch
        )
        self.cv3 = (
            nn.ModuleList(nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1)) for x in ch)
            if self.legacy
            else nn.ModuleList(
                nn.Sequential(
                    nn.Sequential(DWConv(x, x, 3), Conv(x, c3, 1)),
                    nn.Sequential(DWConv(c3, c3, 3), Conv(c3, c3, 1)),
                    nn.Conv2d(c3, self.nc, 1),
                )
                for x in ch
            )
        )
        self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()

        if end2end:
            self.one2one_cv2 = copy.deepcopy(self.cv2)
            self.one2one_cv3 = copy.deepcopy(self.cv3)

    @property
    def one2many(self):
        """Returns the one-to-many head components, here for v5/v5/v8/v9/11 backward compatibility."""
        return dict(box_head=self.cv2, cls_head=self.cv3)

    @property
    def one2one(self):
        """Returns the one-to-one head components."""
        return dict(box_head=self.one2one_cv2, cls_head=self.one2one_cv3)

    @property
    def end2end(self):
        """Checks if the model has one2one for v5/v5/v8/v9/11 backward compatibility."""
        return getattr(self, "_end2end", True) and hasattr(self, "one2one")

    @end2end.setter
    def end2end(self, value):
        """Override the end-to-end detection mode."""
        self._end2end = value

    def forward_head(
        self, x: list[torch.Tensor], box_head: torch.nn.Module = None, cls_head: torch.nn.Module = None
    ) -> dict[str, torch.Tensor]:
        """Concatenates and returns predicted bounding boxes and class probabilities."""
        if box_head is None or cls_head is None:  # for fused inference
            return dict()
        bs = x[0].shape[0]  # batch size
        boxes = torch.cat([box_head[i](x[i]).view(bs, 4 * self.reg_max, -1) for i in range(self.nl)], dim=-1)
        scores = torch.cat([cls_head[i](x[i]).view(bs, self.nc, -1) for i in range(self.nl)], dim=-1)
        return dict(boxes=boxes, scores=scores, feats=x)

    def forward(
        self, x: list[torch.Tensor]
    ) -> dict[str, torch.Tensor] | torch.Tensor | tuple[torch.Tensor, dict[str, torch.Tensor]]:
        """Concatenates and returns predicted bounding boxes and class probabilities."""
        preds = self.forward_head(x, **self.one2many)
        if self.end2end:
            x_detach = [xi.detach() for xi in x]
            one2one = self.forward_head(x_detach, **self.one2one)
            preds = {"one2many": preds, "one2one": one2one}
        if self.training:
            return preds
        y = self._inference(preds["one2one"] if self.end2end else preds)
        if self.end2end:
            y = self.postprocess(y.permute(0, 2, 1))
        return y if self.export else (y, preds)

    def _inference(self, x: dict[str, torch.Tensor]) -> torch.Tensor:
        """Decode predicted bounding boxes and class probabilities based on multiple-level feature maps.

        Args:
            x (dict[str, torch.Tensor]): List of feature maps from different detection layers.

        Returns:
            (torch.Tensor): Concatenated tensor of decoded bounding boxes and class probabilities.
        """
        # Inference path
        dbox = self._get_decode_boxes(x)
        return torch.cat((dbox, x["scores"].sigmoid()), 1)

    def _get_decode_boxes(self, x: dict[str, torch.Tensor]) -> torch.Tensor:
        """Get decoded boxes based on anchors and strides."""
        shape = x["feats"][0].shape  # BCHW
        if self.dynamic or self.shape != shape:
            self.anchors, self.strides = (a.transpose(0, 1) for a in make_anchors(x["feats"], self.stride, 0.5))
            self.shape = shape

        dbox = self.decode_bboxes(self.dfl(x["boxes"]), self.anchors.unsqueeze(0)) * self.strides
        return dbox

    def bias_init(self):
        """Initialize Detect() biases, WARNING: requires stride availability."""
        for i, (a, b) in enumerate(zip(self.one2many["box_head"], self.one2many["cls_head"])):  # from
            a[-1].bias.data[:] = 2.0  # box
            b[-1].bias.data[: self.nc] = math.log(
                5 / self.nc / (640 / self.stride[i]) ** 2
            )  # cls (.01 objects, 80 classes, 640 img)
        if self.end2end:
            for i, (a, b) in enumerate(zip(self.one2one["box_head"], self.one2one["cls_head"])):  # from
                a[-1].bias.data[:] = 2.0  # box
                b[-1].bias.data[: self.nc] = math.log(
                    5 / self.nc / (640 / self.stride[i]) ** 2
                )  # cls (.01 objects, 80 classes, 640 img)

    def decode_bboxes(self, bboxes: torch.Tensor, anchors: torch.Tensor, xywh: bool = True) -> torch.Tensor:
        """Decode bounding boxes from predictions."""
        return dist2bbox(
            bboxes,
            anchors,
            xywh=xywh and not self.end2end and not self.xyxy,
            dim=1,
        )

    def postprocess(self, preds: torch.Tensor) -> torch.Tensor:
        """Post-processes YOLO model predictions.

        Args:
            preds (torch.Tensor): Raw predictions with shape (batch_size, num_anchors, 4 + nc) with last dimension
                format [x, y, w, h, class_probs].

        Returns:
            (torch.Tensor): Processed predictions with shape (batch_size, min(max_det, num_anchors), 6) and last
                dimension format [x, y, w, h, max_class_prob, class_index].
        """
        boxes, scores = preds.split([4, self.nc], dim=-1)
        scores, conf, idx = self.get_topk_index(scores, self.max_det)
        boxes = boxes.gather(dim=1, index=idx.repeat(1, 1, 4))
        return torch.cat([boxes, scores, conf], dim=-1)

    def get_topk_index(self, scores: torch.Tensor, max_det: int) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """Get top-k indices from scores.

        Args:
            scores (torch.Tensor): Scores tensor with shape (batch_size, num_anchors, num_classes).
            max_det (int): Maximum detections per image.

        Returns:
            (torch.Tensor, torch.Tensor, torch.Tensor): Top scores, class indices, and filtered indices.
        """
        batch_size, anchors, nc = scores.shape  # i.e. shape(16,8400,84)
        # Use max_det directly during export for TensorRT compatibility (requires k to be constant),
        # otherwise use min(max_det, anchors) for safety with small inputs during Python inference
        k = max_det if self.export else min(max_det, anchors)
        if self.agnostic_nms:
            scores, labels = scores.max(dim=-1, keepdim=True)
            scores, indices = scores.topk(k, dim=1)
            labels = labels.gather(1, indices)
            return scores, labels, indices
        ori_index = scores.max(dim=-1)[0].topk(k)[1].unsqueeze(-1)
        scores = scores.gather(dim=1, index=ori_index.repeat(1, 1, nc))
        scores, index = scores.flatten(1).topk(k)
        idx = ori_index[torch.arange(batch_size)[..., None], index // nc]  # original index
        return scores[..., None], (index % nc)[..., None].float(), idx

    def fuse(self) -> None:
        """Remove the one2many head for inference optimization."""
        self.cv2 = self.cv3 = None