Oscar compiler for power reduction

75 %
25 %
Information about Oscar compiler for power reduction
Technology

Published on September 26, 2013

Author: magoroku15

Source: slideshare.net

Description

ARM quad power reduction OSCAR android LCPC mobile

OSCAR  Compiler  Controlled     Mul3core  Power  Reduc3on     on  Android  Pla8orm Hideo  Yamamoto¹,   Tomohiro  Hirano¹,  Kohei  Muto¹,     Hiroki  Mikami¹,  Takashi  Goto¹,  Dominic  Hillenbrand¹,     Moriyuki  Takamura²,  Keiji  Kimura¹  and  Hironori  Kasahara¹     ¹Green  Compu3ng  Systems  Research  and  Department  Center  Waseda  University   ²FUJITSU  LABORATORIES  LTD.   LCPC2013 1

Presenta3on  Outline •  Background   –  Power  consump3on  in  mul3core     –  Power  control  mechanism  of  the  OSCAR  Compiler   –  Power  control  on  the  Android™ pla8orm   •  Experimental   –  Evalua3on  target  ,  power  rail  and  measurement  device   –  Precise  power  measurement  method  Using  GPIO   –  Bind  mode   –  Clock  ga3ng  method  using  WFI  instruc3on   •  Highlight  event  in  data   –  Power  consump3on  of  MPEG2  decoder     •  Conclusion LCPC2013 2

BACKGROUND     LCPC2013 3

A    Plethora    of    Smart Devices LCPC2013 4 Linux ARM11/   CortexA8 Linux  -­‐2  core  SMP Cortex-­‐   A9 Cortex-­‐   A9 Linux  -­‐  4  core  SMP Cortex-­‐   A9 Cortex-­‐   A9 Cortex-­‐   A9 Cortex-­‐   A9 Linux  –  8  core  HMP Cortex-­‐   A15 Cortex-­‐   A15 Cortex-­‐   A15 Cortex-­‐   A15 Cortex-­‐   A7 Cortex-­‐   A7 Cortex-­‐   A7 Cortex-­‐   A7 Linux  -­‐  8  core  big.LITTLE Cortex-­‐   A15 Cortex-­‐   A15 Cortex-­‐   A15 Cortex-­‐   A15 Cortex-­‐   A7 Cortex-­‐   A7 Cortex-­‐   A7 Cortex-­‐   A7 Linux  -­‐  5  core    4+1  vSMP   Cortex-­‐   A9 Cortex-­‐   A9 Cortex-­‐   A9 Cortex-­‐   A9 Cortex-­‐   A9 2013 2007 2011 ・・・・・・・ 2014 High  performance  device Cumula3ve  smart  device  shipment                    iOS                          700,000,000                    Android      1000,000,000

 In  quad  core  case,  you  can  reduce  ‘f’  to  ¼   keeping  the  same  performance.     If  ‘v’  is     0.6(v)  for  ¼  ‘f’,  power  consump3on  will  be   reduced  to  0.36   Power  Consump3on  in  mul3  core •  Uni  Core   P  =  f*c*v^2          ・・・・・ Eq.1   •   Mul3  Core   P  =  n*f*c*v^2      ・・・・・ Eq.2   LCPC2013 5

OSCAR  Compiler LCPC2013 6 Waseda  University   Mul3grain  Parallel  Processing   • Hierarchical  and  Global  Paralleliza3on • Coarse  grain  task  parallel   • Loop  itera3on  parallel   • Statement  level  parallel   Data  Locality  Op3miza3on   • Task  (or  loop)  decomposi3on  considering   cache  size  or  local  memory  size   • Task  scheduling  considering  data  affinity   Low  power  op3miza3on   • Power  scheduling  with   DVFS,  clock  ga3ng  and   power  ga3ng  by  somware   Doall loop Seq. loop Task level or statement level parallelization

Power  Control  Mechanism  of     the  OSCAR  Compiler •  Es3mate  execu3on  3me  of  each  MT  and  find  cri3cal  path   •  Determine  execu3on  3me  to  sa3sfy  the  given  deadline   •  Decide  op3mal  frequency  and  voltage  of  each  MT.     LCPC2013 7 MT1 MT2 MT5 MT3 MT6 MT8 MT4 MT7 MT9 Core0 Core1 Core2 Core0 Core1 Core2 MT1 MT2 MT5   (Low  freq.)   MT3   (Low  freq.) MT6 MT8 MT4 MT7 MT9 Given  Dead  Line 3me Margin Clock  ga>ng Power   ga3ng Power   ga3ng Power   ga3ng Sta3c  scheduled  MTG Power  scheduling  with  DVFS,  clock   ga3ng  and  power  ga3ng  by  somware   Time   management 3me

Power  Control  on  Android •  CPUFreq         – Frequency  and  voltage  scaling  of  a  target  CPU   •  CPUIdle   – Manages  the  level  of  idle  on  each  core  of  the  CPU   •  HotPlug    >  10ms   – Extended  func3on  of  CPUFreq  and  CPUIdle   – Adds  another  core  to  distribute  the    load  in  high   u3liza3on   – Shuts  down  excess  core  with  low  u3liza3on     – Decide  core  on/off  line  in  a  heuris3c  adap3on     LCPC2013 8

Problems  of  Linux    power  control  and   parallel  processing   •  Hotplug  can’t  online  core  and  thread  binding  swimly   –  In  worst  case  it  needs  several  hundred  milliseconds       •  Non  real-­‐3me   –  Linux  can’t  control  fine  resolu3on  3me  under  5-­‐10ms   LCPC2013 440.6ms 9 Startup  3me  440.6ms

Background   •  Mo3va3on   –  Paralleliza3on  is  effec3ve  for  low  power  execu3on  with   DVFS,  power-­‐ga3ng  and  clock-­‐ga3ng   –  OSCAR  compiler  has  the  capability  to  generate  power   control  API  automa3cally       •  Obstacle   –  Linux  needs  long  startup  3me  for  distribu3ng  load    to   mul3cores     –  Lack  of  fine  resolu3on  3me  control   •  Challenge   –  Low  power  execu3on  Android  pla8orm  by  paralleliza3on     LCPC2013 10

EXPERIMENTAL   LCPC2013 11

Evalua3on  board  -­‐  ODROID-­‐X2 •  Samsung  Exynos4412  Prime   – ARM  Cortex-­‐A9  Quad  core   – Maximum  clock  frequency  1.7GHz   – Used  by  Samsung's  Galaxy  S3   •  DVFS  can’t  be  applied  to  each  core   independently   •  Android  Open  Source  version  is  in  place   •  Circuit  Schema3c  is  available  on  request   LCPC2013 12

SoC Exynos4412 Power  Rail  for  Exynos4412 •  Exynos4412  is  powered  by  4  PMIC  (Power  Management  IC)  voltage   –  VDD_ARM    CORE   –  VDD_INT    Interrupt  controller  and  L2 –  VDD_G3D    GPU –  VDD_MIF    DDR  Memory •  Power  consump3on  of  VDD_ARM  (CORE)  has  been  measured     LCPC2013 Cortex-­‐A9   32KB  I/D   NEON Cortex-­‐A9   32KB  I/D   NEON Cortex-­‐A9   32KB  I/D   NEON Cortex-­‐A9   32KB  I/D   NEON Interrupt  controller    +    L2   GPU DDR VDD_ARM VDD_INT VDD_G3D VDD_MIF PMIC 13

Modified  Circuit  Diagram  of     ODROID-­‐X2 LCPC2013 14 Current Voltage Voltage  (V) Current  (A) x = Power  (W)

How  to  measure  CORE  power     on  ODROID-­‐X2 •  Adding  a  40  mΩ  shunt  resistor  to  VDD_ARM LCPC2013 SoC PMIC Shunt Instrumenta3on  amp Voltage   drop 15

synchroniza3on  between  program   and  waveforms  using  GPIO LCPC2013 16

“bind”  mode •  Core  assignment  logic  of  Android  Linux  hotplug    is  heuris3c   •  New  core  assignment  mode  called  “bind”  mode  is  developed   for  efficient  parallel  execu3on   •  "bind"  mode  is  integrated  in  Android  Linux  as  OSCAR  run3me   and  API   •  Specifica3on  of  OSCAR  API  for  “bind”  mode     –  Core  0  is    reserved    for  Android  system  and  non  OSCAR    parallel   program     –  Applica3on  can  disable  hotplug  and  control  for  Core  ON/OFF  line   –  Applica3on  can  Bind  Core  1,2  and  3  to  OSCAR  parallel  program     LCPC2013 17 Startup  3me  7.2ms

clock  ga3ng •  WFI  instruc3on   – WFI  instruc3on    suspends  the  execu3on  of  the   processor  core  and  stops  the  clock  un3l  3mer   event   •  Clock  ga3ng  driver  using  WFI  instruc3on   – The  WFI  instruc3on  is  privileged  instruc3on   – The  API  allows  user  program  to  execute  WFI   instruc3on  within  Linux  driver   LCPC2013 18

while(1)  {      gpio_value(1);      call_wfi_api(1);      gpio_value(0);   } 250mA 500mA Fine  3ming  control  by  WFI  driver LCPC2013 19 250mA 500mA 2000us  (4  slot) Wake   up Time  Slot  is  500  us GPIO while(1)  {      gpio_value(1);      call_wfi_api(4);      gpio_value(0);   } GPIO Clock  ga3ng 0us  <    T  <  500us 1500us  <    T      <  2000us 15000us  (3  slot) (N  -­‐1)  x  500us      <    T    <    N  x  500us

Current  waveform  of  busy  wait     without  clock  ga3ng   1000mA 1500mA 2000mA    500mA 1core 2cores 3cores 4cores Busy  wait  in  ordinary  execute 20

Current  waveform  of  busy  wait       with  clock  ga3ng LCPC2013 1000mA 1500mA 2000mA    500mA 1core 2cores 3cores 4cores Busy  wait  with  clock  ga>ng 21 Wake  up  all  cores Clock  ga3ng  all  cores

  Compare  with     current  waveforms     1000mA 1500mA 2000mA    500mA 1core 2cores 3cores 4cores Busy  wait  in  ordinary  execute LCPC2013 1000mA 1500mA 2000mA    500mA 1core 2cores 3cores 4cores Busy  wait  with  clock  ga>ng 22 Wake  up  all  cores Clock  ga3ng  all  cores

MPEG2  DECODER  Highlight  data LCPC2013 23

Power  Consump3on  of     MPEG2  Decoder  on  ODROID-­‐X2 LCPC2013 1/7(13.3%) 1/3(38.1%) NUMBER  OF  CORES 24 With  Power  Reduc3on  Control Without  Power  Reduc3on  Control  

 demo LCPC2013 25

LCPC2013   MPEG2  Decode  execu3on   In  high  clock  and  voltage   Busy  Wait  execu3on    Clock  ga3ng     by  WFI   Reduced   by  WFI Consumed Reduced   26 (a)  Without  Power  Reduc3on  Control (b)  With  Power  Reduc3on  Control Power  Waveform  of     MPEG2  Decoder  for  1  Core 1.7GHz,  1.4V 1.7GHz,  1.4V

LCPC2013 Busy  Wait  execu3on    Clock  ga3ng     by  WFI   MPEG2  Decode  execu3on   In  low  clock  and  voltage   Power  Waveform  of     MPEG2  Decoder  for  3  Core DVFS   P  =  n*f*c*V^2   Reduced   by  WFI MPEG2  Decode  execu3on   In  high  clock  and  voltage   Consumed Reduced 27 (a)  Without  Power  Reduc3on  Control (b)  With  Power  Reduc3on  Control 1.7GHz,  1.4V 400MHz,  1.05V 200MHz,  0.92V

Power  Consump3on  of     MPEG2  Decoder  on  ODROID-­‐X2 LCPC2013 NUMBER  OF  CORES 2.79 0.97 0.63 0.37 WFI DVFS WFI 1/3(38.1%) Consumed Reduced 28

Conclusions   •  The  ODROID-­‐X2  Circuit  is  modified  such  that   1.  Precise  Power  waveforms  at  the  output  of  PMIC  is   observed,  and   2.  The  power  waveforms  and  parallel  program  event  are  inter-­‐ related  in  3ming  for  OSCAR  compiler  op3miza3on.   •  The  efficient  parallel  program  execu3on  pla8orm  on  Android  is   established  by   1.  “bind”  mode,  and     2.  The  WFI  instruc3on    by  the  OSCAR  compiler.   •  The  newly  developed  OSCAR  compiler  power  control   mechanism  has  decreased  the  power  to  one  third,  from  0.97   Wa~  in  1-­‐core  to  0.37  Wa~  in  3-­‐core,  in  running  MPEG2   decoder  on  Android  pla8orm.   LCPC2013 29

BACKUP  SLIDE   LCPC2013 30

OPTICAL  FLOW Highlight  data LCPC2013 31

Power  Consump3on  of     Op3cal  Flow  on  ODROID-­‐X2 LCPC2013 13.4% 31.5% 32

Power  Waveform  of     Op3cal  Flow  for  1core LCPC2013 Op3cal  Flow  execu3on   Busy  Wait  execu3on   Clock  ga3ng  by  WFI   Reduce  power   of  waste  CPU   cycles 33

Power  Waveform  of     Op3cal  Flow  for  3core LCPC2013 Op3cal  Flow  execu3on   In  high  clock  and  voltage   Busy  Wait  execu3on   Clock  ga3ng     by  WFI   P  =  n*f*c*V^2   Op3cal  Flow  execu3on   In  low  clock  and  voltage   34

#pragma  oscar  get_current_>me(current,  >mer_no Low-­‐power  code  with  OSCAR  API LCPC2013 Proc0 Scheduled Tasks T1 off Proc1 Scheduled Tasks T2 T4 Proc2 Scheduled Tasks T3 T6(slow) OSCAR Compiler • Multigrain Parallelization • Memory Optimization • Data Transfer
 Optimization • DVFS, Clock gating Sequential Programs C/Fortran Low-­‐power  parallel  C/Fortran  Programs   including  OSCAR  API Backend Compiler API  Decoder Na3ve  Compiler #pragma  oscar  fvcontrol(pe,  (id,  state))   #pragma  oscar  get_fvstatus(pe,  id,  state)   Translate  OSCAR  API  into  Library  call   Exec. Object 35

ODROID Original   L C GND L C GND VDD_ARM Schema3c Layout 36 PMIC

ODROID  Amer  rework PMIC GND GND VDD_ARM R C C L GND Single  5  Pin Drop  Voltage L R Voltage 37

How to work hotplug L L L L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L L 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 up2g0_delay up2gn_delay down_delay up2gn_delay down_delay 1 1 up up up Down Down Down down_delay Idle Idle Idle Idle up down idle disable

Auto hotplug governor tegra_cpu_set_speed_cap 578 int tegra_cpu_set_speed_cap(unsigned int *speed_cap) 579 { 581 unsigned int new_speed = tegra_cpu_highest_speed(); 586 new_speed = tegra_throttle_governor_speed(new_speed); 587 new_speed = edp_governor_speed(new_speed); 588 new_speed = user_cap_speed(new_speed); 592 ret = tegra_update_cpu_speed(new_speed); 594 tegra_auto_hotplug_governor(new_speed, false); 596 } tegra_auto_hotplug_governor parameters LP-mode GP-MODE up_delay up2g0_delay up2dn_delay down_delay down_deley down_delay top_freq idle_top_freq idle_bottom_freq botttom_freq 0 idle_bottom_freq Current State Compare with requested freq New State Delay to effecte IDLE > top_freq UP Up_delay IDLE <=bottom_freq DOWN Down_delay DOWN >top_freq UP Up_delay DOWN >bottom_freq IDLE NA UP <bottom_freq DOWN Down_delay UP <=top_freq IDLE ND Throttle_table throttle_index Update form user thermal_cooling_device Edp_Thermal Auto Hot plug Suspend CpuFreq

#pragma presentations

Add a comment

Related presentations

Related pages

OSCAR Compiler Controlled Multicore Power Reduction on ...

OSCAR Compiler Controlled Multicore Power Reduction on Android Platform Hideo Yamamoto1(B), Tomohiro Hirano 1, Kohei Muto1, Hiroki Mikami , Takashi Goto 1 ...
Read more

OSCAR Compiler Controlled Multicore Power Reduction on ...

OSCAR Compiler Controlled Multicore Power Reduction on Android Platform Hideo Yamamoto 1, Tomohiro Hirano , Kohei Muto , Hiroki Mikami , Takashi
Read more

OSCAR Automatic Parallelizing and Power Reducing Compiler ...

and software productivityand reduce power OSCAR Parallelizing Compiler ... Power reduction using DVFS, Clock/ Power gating Proc1 Thread 1 Code with
Read more

OSCAR Compiler for Automatic Parallelization and Power ...

OSCAR Compiler for Automatic Parallelization and Power Reduction for Multicores and Manycores Hironori Kasahara Professor, Dept. of Computer Science ...
Read more

OSCAR Parallelizing and Power Reducing Compiler for ...

Compiler for Multicores Hironori Kasahara ... and reduce power OSCAR Parallelizing Compiler ... Reduction With Power Reduction by OSCAR Compiler
Read more

Power Reduction on Android Platform using the OSCAR ...

Using the OSCAR Compiler, ... The comparison between using 1 core without power reduction and 3 cores with power reduction has realized a ...
Read more

OSCAR Automatic Paralleling and Power Reducing Compiler ...

and software productivity and reduce power OSCAR Parallelizing Compiler ... Reduction of consumed power by compiler control DVFS and Power
Read more

Parallelization and Power Reduction of Embedded Real-time ...

1 Parallelization and Power Reduction of Embedded Real-time Applications by OSCAR Compiler on ARM and Intel Multicores Hironori Kasahara Professor, Dept ...
Read more

OSCAR Compiler Controlled Multicore Power Reduction on ...

In recent years, smart devices are transitioning from single core processors to multicore processors to satisfy the growing demands of higher performance ...
Read more