Andrea Mondelli

PhD Student supervised by Pierre Michaud and André Seznec @ ALF project-team

Contact information

IRISA/INRIA
Campus de Beaulieu
35042 RENNES Cedex
FRANCE

tel : +33 2 99 84 71 91

E-mail: andrea DOT mondelli AT inria.fr

PhD thesis: Sequential Accelerators in Future Manycore Processors

Description : For almost 4 decades, better and better integration technology has allowed to double the number of transistors on a single processor chip every 2 years. By 2010, manufacturers were able to implement 8 to 12 high-end superscalar processors or up to 100 simple cores on a single die. In 10 years from now, it will be possible to put on the same chip a large number of general purpose cores,certainly 100’s of highly complex cores or 1000’s of simple cores. This will lead to the so-called “many cores”. We believe that despite future huge investment in the development of parallel applications, and in parallelizing applications, most applications will still exhibit a significant amount of sequential code sections.

Amdahl’s law indicates that for applications featuring even a minor fraction of sequential sections, the overall execution time on a massively parallel many core will significantly depend on the performance on these sequential code sections. Hence we consider that a significant fraction of the transistor budget as well as the energy budget in the 2020’s many core processors can be dedicated to achieve ultimate high performance on sequential codes or sequential sections in parallel codes. Frequency of processor is limited by the power constraint, but even more by the temperature. In a recent study, we have proposed a radically new approach to achieve very performance on sequential code, the Sequential Accelerator (SACC). A SACC occupies the area of many complex cores (e.g. 1/4th of a 2020’s many core processor chip) and can consume the energy quota associated with this share of the chip. A SACC features several complex cores that are clocked at a very high frequency. Unlike traditional designs, these complex cores are not built to run continuously, but they are designed assuming that they are inactive most of the time: a single core is active at any time, the inactive cores are power-gated to minimize power consumption. The active core can consume a high electric power, hence its temperature increases quickly. To prevent temperature to become too high, the execution is migrated periodically to a new core.This way, heat generation is spread uniformly on the whole SACC area, which maintains temperature at an acceptable level. Several micro architectures for implementing a SACC are possible, e.g., different issue widths, different instruction window sizes, different L1 and L2 cache sizes, different pipeline lengths, different voltage and clock frequency, etc. The power density distribution for these different micro architectures will not be the same, hence different temperatures for the same total power. For a thermally constrained micro architecture, this will also have an impact on performance. The objective of the thesis is to explore the design space of possible SACC designs including micro architectures but also technology and propose architecture candidates to achieve very high performance on sequential code. Experiments will be conducted with simulation tools for modeling performance, power and temperature.

Description

I am a PhD student. Usually I write software in C. Sometimes in Ruby, C++, asm, or others.

I work on Computer Architecture trying to improve single core performances. I like to play with cache coherence protocols and system simulators like Cotson , Gems and (of course) our in-house micro-architectural simulator.

I worked in the ancient world of Computer Security and Expert Systems and I like taking things apart since I was a child, but I still do not know how to reassemble them.

I am a faithful disciple of the oldest religion in my field: Amdahl’s Law.

Publications (and deliverables):

“Revisiting Clustered Microarchitecture for Future Superscalar Cores: A Case for Wide Issue Clusters” ACM Trans. Archit. Code Optim. 12, 3, Article 28 (August 2015), 22 pages. DOI: http://dx.doi.org/10.1145/2800787
“Dataflow Support in x86_64 Multicore Architectures through Small Hardware Extensions” , DSD2015 – Euromicro Conference on Digital System Design, Madeira , Portugal , August 2015, doi:10.1109/DSD.2015.62
“Enhancing an x86_64 Multi-Core Architecture with Data-Flow Execution Support” , CF2015 – ACM Proc. of Computing Froniers, Ischia, Italy, May 2015, pp. 1-2., doi:10.1145/2742854.2742896, ISBN:978-1-4503-3358-0
“Simulating a Multi-core x86_64 Architecture with Hardware ISA Extension Supporting a Data-Flow Execution Model” , AIMS2014 – IEEE Proc. AIMS-2014, Madrid, Spain, Nov. 2014, pp. 264-269
“Analisi e Valutazione di schemi di Replicazione per Memorie Cache” , OmniScriptum GmbH & Co. KG, 136 pages, ISBN:978-3639656091
“A Scalable Distributed Data-flow Scheduler for Many-Cores“, HiPEAC/ACACES-2013
“PIKE Improving COTSon Interface for Easier Design Space Exploration” , HiPEAC/ACACES-2013
“Advanced Version of the Compilation Tools” , TERAFLUX D4.7, Siena, Italy, May 2014, pp. 1-25
“Rep. on knowledge transfer and training“, TERAFLUX D7.4, Siena, Italy, Dec. 2012, pp. 1-50
“Fine-tuned TERAFLUX Execution Model“, TERAFLUX D6.3, Siena, Italy, Dec 2012, pp. 1-34

Previous experiences:

2012 – 2013: Teraflux Project, Development of simulators and virtualization tools for research projects on multicore and reconfigurable systems @ University of Siena (IT)
2007 – 2009: Security Engineer for Managed Services, assurance on security related platforms within H3G Italy ICT infrastructure and Deployment of RSA Ericsson’s authentication systems @ Ericsson ENSA
2006: Intrusion Detection with Expert Systems (Immune Genetic Algorithm) @ Polytechnic of Bari (IT)
2005: Java security compliant module development, testing of safety requirements @ Sun Microsystems (Saarbrücken, DE)
2003 – 2004: Unix Security Senior administration @ Medianext (IT)
1999 – 2001: Unix System Junior administration @ Isrnet (IT)

Past projects (for fun&profit):

TSU – Thread Scheduling Unit
Victim-Cache Coherence Protocol design and simulation for D-NUCA architectures
Digital circuit design of a CIC filter (VHDL)
Analysis of properties and topology of the Internet using Autonomous Systems methodology
Resolution of the TSP problem using genetic algorithms
Design of didactic Security Protocol for Mutual Trust exchange of cryptographic keys
Monitoring Device with Real Time Protocol on FLEX board
Client-Server iteration system with extended synchronization mechanism in Java
Distributed memory system with high scalability in C language
Safety-oriented Unix WebServer
Deployment of Sensage system for enterprise tracking log
Deployment of RSA Ericsson’s authentication systems
L.I.S.A. – Danger Theory in a Computer Intrusions recognition system
Toy Graphics engine in OpenGL and C
Security Code Revision of Java JDK
Security-oriented Linux distribution: “MaliGNUz”
Pike4Cotson A wrapper for Cotson simulator

More details:

La thèse est basée sur le développement de microprocesseurs pour multicluster l’augmentation des performances. L’évolution technologique pose des limites physiques à l’augmentation des structures internes des processeurs. Mon travail se concentre sur l’exploration de solutions alternatives telles que le clustering et la mise en œuvre de nouvelles politiques d’ordonnancement des instructions. La structure de départ est le microprocesseur Haswell courant utilisé pour analyser la performance de son utilisation en mode cluster. Le front-end a été étendu en ligne avec les prévisions technologiques pour les années à venir, tandis que le back-end sera étendu grâce à l’étude des solutions alternatives à ceux actuels. Le but ultime est d’améliorer les performances en contournant les problèmes matériels actuels telo que les delais des fil, la consommation d’énergie, la taille des structures internes, et la limite à la vitesse du microprocesseur.

*****

In my thesis I am proposing and developing new approaches to achieve very performance on sequential code through sequential accelerators. Putting more cores on a single chip has increased the total chip throughput and benefits some applications with thread-level parallelism. However, most applications have low thread-level parallelism. So having more cores is not sufficient. It is important also to accelerate individual threads.

A first approach is to reduce hardware complexity and increase the number of instructions on-fly using multicluster architecture. The architecture that I propose is based on different assumptions with past research on clustered microarchitecture: I assume wide issue clusters (≥ 8-issue), whereas past research mostly focused on narrow issue clusters (≤ 4-issue). Going from narrow issue to wide issue clusters is not just a quantitative change, it has a qualitative impact on the clustering problem, in particular on the steering policy. Past research on steering policies showed that minimizing inter-cluster communications while achieving good cluster load balancing is a difficult problem. My work tries to to depict what future superscalar cores may look like in 10 years. Clustered architecture is not the only possible solution for future sequential accelerator.

The results of this approach are described in the paper “Revisiting clustered microarchitecture for future superscalar cores: a case for wide-issue clusters” I have submitted to TACO journal ( http://taco.acm.org ).

I also propose another approach: hardware accelerator of loops. Many programs spend a significant part of the execution in dynamic loops, i.e., periodic sequences of dynamic instructions. I am exploring the possibility to implement an accelerator specialized for dynamic loops and that does not require any help from the compiler or the programmer. A way to do it could be the reorganization of ups (micro-ops) in the loop buffer for possibile optimizations. As example, take advantage of read-after-write and write-after-write register dependencies for increasing the register renaming bandwidth, or bypass predicted load-store dependencies and remove the corresponding Load. This hardware accelerator require a quality evaluation of out-of-order parameters and branch predictor behavior for loops execution because of differences among normal branch misprediction and loop exits.