topdown analysis

A Top-Down Method for Performance Analysis and Counters Architecture

为什么要提出这个方法学?

传统的性能评估方法:

image-20250319103349042

其限制如下:

  1. Stalls overlap:OoO处理器许多模块都是并行计算,可能dcache miss惩罚会被overlap(即使miss也可以执行其他没有依赖的指令)
  2. Speculative execution:推测执行产生的错误结果不会影响提交,他们对性能实际影响较小(相对正确路径)
  3. Penalties are workloaddependent:性能惩罚的周期不定,而是和具体的工作负载相关,传统的性能评估方法认为所有惩罚周期一致
  4. Restriction to a pre-defined set of miss-events:只有一些比较常见的会被评估,但比较细节的就可能被忽视(前端供指不足)
  5. Superscalar inaccuracy:超标量处理器可以一个周期执行退休多条指令,issue带宽可能会出现问题(其中一个例子)

于是,intel提出了top-down的性能分析方法,该方法不仅可以应用于微结构探索,也可以去应用于软件分析

他的分析方法就相当一个树,从跟节点到叶子节点一点点分析(选择占比大的事件分析),特点如下:

  1. 引入错误预测的性能事件,并放在性能事件顶部

  2. 引入专门的计数器(12个)

  3. 确定关键的流水线阶段(issue)和计数时机

    比如,与其统计内存访问总时间,不如去观察由于mem access hang导致的执行单元利用率低的持续时间

  4. 使用通用的事件

其分析层次如下:

image-20250326171600522

我们选择issue阶段来进行事件分解

将性能事件分为4类: Frontend Bound, Backend Bound, Bad Speculation and Retiring

image-20250326171657191

前端 bound

前端分为Latency和Bandwidth,

Latency:icache未命中等导致前端无法产生有效指令

Frontend Bandwidth Bound:decoder效率低,其次就是取指令宽度不够(后端执行太快)

boom很难产生Latency bound,后端执行单元少,无法利用OoO的并行

Bad Speculation category

细分为br mispred和mechine clear

Backend Bound category

这里可能会有dcache miss divider执行等情况

其进一步细分为Memory Bound 和Core Bound

为了维持最大IPC,需要去在4-wide的机器上尽量让exu跑满

Memory Bound:显而易见,就是lsu所造成的性能损失

Core Bound :执行端口利用率低,除法指令

下面还展示了将Mem Bound细分

image-20250326173425211

计数器结构

image-20250326173528480

有*号代表现代的PMU含有该计数器,可以看到只需8个新的perf计数器就可以实现一级level的TMA,其计算公式如下:

image-20250326173651209

image-20250326173659628

Rocket chip 性能计数器

RISCV只给出了性能计数器的标准,但并未给出性能计数器的具体实现,rocket chip 的实现方法如下:

这里的EventSets,给出了三个EventSet,每个EventSets通过event选择

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
val perfEvents = new EventSets(Seq(
new EventSet((mask, hits) => Mux(wb_xcpt, mask(0), wb_valid && pipelineIDToWB((mask & hits).orR)), Seq(
("exception", () => false.B),
("load", () => id_ctrl.mem && id_ctrl.mem_cmd === M_XRD && !id_ctrl.fp),
("store", () => id_ctrl.mem && id_ctrl.mem_cmd === M_XWR && !id_ctrl.fp),
("amo", () => usingAtomics.B && id_ctrl.mem && (isAMO(id_ctrl.mem_cmd) || id_ctrl.mem_cmd.isOneOf(M_XLR, M_XSC))),
("system", () => id_ctrl.csr =/= CSR.N),
("arith", () => id_ctrl.wxd && !(id_ctrl.jal || id_ctrl.jalr || id_ctrl.mem || id_ctrl.fp || id_ctrl.mul || id_ctrl.div || id_ctrl.csr =/= CSR.N)),
("branch", () => id_ctrl.branch),
("jal", () => id_ctrl.jal),
("jalr", () => id_ctrl.jalr))
++ (if (!usingMulDiv) Seq() else Seq(
("mul", () => if (pipelinedMul) id_ctrl.mul else id_ctrl.div && (id_ctrl.alu_fn & aluFn.FN_DIV) =/= aluFn.FN_DIV),
("div", () => if (pipelinedMul) id_ctrl.div else id_ctrl.div && (id_ctrl.alu_fn & aluFn.FN_DIV) === aluFn.FN_DIV)))
++ (if (!usingFPU) Seq() else Seq(
("fp load", () => id_ctrl.fp && io.fpu.dec.ldst && io.fpu.dec.wen),
("fp store", () => id_ctrl.fp && io.fpu.dec.ldst && !io.fpu.dec.wen),
("fp add", () => id_ctrl.fp && io.fpu.dec.fma && io.fpu.dec.swap23),
("fp mul", () => id_ctrl.fp && io.fpu.dec.fma && !io.fpu.dec.swap23 && !io.fpu.dec.ren3),
("fp mul-add", () => id_ctrl.fp && io.fpu.dec.fma && io.fpu.dec.ren3),
("fp div/sqrt", () => id_ctrl.fp && (io.fpu.dec.div || io.fpu.dec.sqrt)),
("fp other", () => id_ctrl.fp && !(io.fpu.dec.ldst || io.fpu.dec.fma || io.fpu.dec.div || io.fpu.dec.sqrt))))),
new EventSet((mask, hits) => (mask & hits).orR, Seq(
("load-use interlock", () => id_ex_hazard && ex_ctrl.mem || id_mem_hazard && mem_ctrl.mem || id_wb_hazard && wb_ctrl.mem),
("long-latency interlock", () => id_sboard_hazard),
("csr interlock", () => id_ex_hazard && ex_ctrl.csr =/= CSR.N || id_mem_hazard && mem_ctrl.csr =/= CSR.N || id_wb_hazard && wb_ctrl.csr =/= CSR.N),
("I$ blocked", () => icache_blocked),
("D$ blocked", () => id_ctrl.mem && dcache_blocked),
("branch misprediction", () => take_pc_mem && mem_direction_misprediction),
("control-flow target misprediction", () => take_pc_mem && mem_misprediction && mem_cfi && !mem_direction_misprediction && !icache_blocked),
("flush", () => wb_reg_flush_pipe),
("replay", () => replay_wb))
++ (if (!usingMulDiv) Seq() else Seq(
("mul/div interlock", () => id_ex_hazard && (ex_ctrl.mul || ex_ctrl.div) || id_mem_hazard && (mem_ctrl.mul || mem_ctrl.div) || id_wb_hazard && wb_ctrl.div)))
++ (if (!usingFPU) Seq() else Seq(
("fp interlock", () => id_ex_hazard && ex_ctrl.fp || id_mem_hazard && mem_ctrl.fp || id_wb_hazard && wb_ctrl.fp || id_ctrl.fp && id_stall_fpu)))),
new EventSet((mask, hits) => (mask & hits).orR, Seq(
("I$ miss", () => io.imem.perf.acquire),
("D$ miss", () => io.dmem.perf.acquire),
("D$ release", () => io.dmem.perf.release),
("ITLB miss", () => io.imem.perf.tlbMiss),
("DTLB miss", () => io.dmem.perf.tlbMiss),
("L2 TLB miss", () => io.ptw.perf.l2miss)))))

rocketchip的性能计数器需要写入hpmevent来给出对应的操作:

具体的,hpmevent低8位选择不同的EventSet,其他位去选择具体的性能事件,比如向mhpmevent3写入0x4200,那么mhpmcounter在发生int load commit或 cond branch inst commit均会递增

image-20250324161603855

下面介绍一些函数的用法:该片段节选自CSR,这里主要就是去写入计数器(用于初始化)和event,注意在写入event时需要执行maskEventSelector,防止写入非法的区域

1
2
3
4
5
for (((e, c), i) <- (reg_hpmevent zip reg_hpmcounter).zipWithIndex) {
writeCounter(i + CSR.firstMHPC, c, wdata)//写入计数器
when (decoded_addr(i + CSR.firstHPE)) { e := perfEventSets.maskEventSelector(wdata) }//写入计数器使能
}

具体的,eventSetIdBits为8,set Mask是去mask不同的EventSet,maskMask是去mask一个EventSet不同的性能事件,然后执行eventSel & (setMask | maskMask).U保证写入一定是合法的值

1
2
3
4
5
6
7
//这里低8位是用来选择模式的(也就是不同的性能计数器),setMask的意思就是给出此次选择哪个eventSet的Mask
def maskEventSelector(eventSel: UInt): UInt = {
// allow full associativity between counters and event sets (for now?)
val setMask = (BigInt(1) << eventSetIdBits) - 1
val maskMask = ((BigInt(1) << eventSets.map(_.size).max) - 1) << maxEventSetIdBits
eventSel & (setMask | maskMask).U
}

如何去触发性能事件并且使得相应的计数器增加呢?

首先通过写入reg_hpmevent,配置好性能事件

然后在rocketchip的核心中有

1
csr.io.counters foreach { c => c.inc := RegNext(perfEvents.evaluate(c.eventSel)) }

也就是遍历每个event是否触发了性能事件,并且给出增量,该增量会传入CSR,代码如下:

1
2
val reg_hpmcounter = io.counters.zipWithIndex.map { case (c, i) =>
WideCounter(CSR.hpmWidth, c.inc, reset = false, inhibit = reg_mcountinhibit(CSR.firstHPM+i)) }

reg_mcountinhibit控制对应的计数器是否递增,为1禁止,(这里计数器最大为40位,因为64位会浪费bit)

接下来讲解evaluate是如何工作的,下面代码是EventSets类的函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
private def decode(counter: UInt): (UInt, UInt) = {
require(eventSets.size <= (1 << maxEventSetIdBits))
require(eventSetIdBits > 0)
(counter(eventSetIdBits-1, 0), counter >> maxEventSetIdBits)
}

def evaluate(eventSel: UInt): Bool = {
val (set, mask) = decode(eventSel)
val sets = for (e <- eventSets) yield {
require(e.hits.getWidth <= mask.getWidth, s"too many events ${e.hits.getWidth} wider than mask ${mask.getWidth}")
e check mask
}
sets(set)
}

decode给出的是set(选择了哪个EventSet),mask(具体性能事件的选择),然后遍历了eventSets,对每个元素执行check函数

1
2
3
4
5
6
def size = events.size
val hits = WireDefault(VecInit(Seq.fill(size)(false.B)))
def check(mask: UInt) = {
hits := events.map(_._2())
gate(mask, hits.asUInt)
}

check函数主要就是根据EventSet传入的函数,给出是否命中的信息,最后sets(set)就是选择对应的set,

rocket chip的性能计数器设置每个周期只能+1,这样就无法去将该逻辑应用到BOOM中

  • 兼容boom

现在已经兼容boom,并添加了一级level的TMA