The performance and reliability of computer system structure

Label system architecturereading notes
719 people reading comments(0) Collection Report

The computer system structure is a science, is also an art, it is a choice of how to design each part of the hardware components of the computer, on the other hand, it is very reasonable and efficient these hardware components together, and at the hardware level, form a complete computer.

Although there is a computer in the scientific community, but there is no perfect computer. But the structure of computer system's goal is always the same, that is the hope of a higher performance, more complete function and more efficient computer design in smaller spending. You can see is that it is almost impossible to achieve the goal, but in the research and development of computer system structure in front of staff, can continue to move closer to this goal.

In the continuous exploration of scientific research personnel in the course of the study, the structure of computer system has been in the function, performance, efficiency and cost highly coordinated development. Thus, the structure of computer system into a, under the control of cost, energy and availability, achieve a new era of computer multi function, high performance and high efficiency.

Although many researchers are moving in the direction of continuous efforts, but according to the current development of computer system structure, we can also see some rational restrictions, including restrictions on the design of logic, but also included in the hardware level. This article reviews for dozens of papers published in ISCA and MICRO meeting for the last three years, through careful reading and understanding, I found published these articles, the author is in the face of a variety of bottleneck of computer system structure, make the thinking. They have limitations in the present state of the computer system structure, through the investigation, analysis and research, gives the corresponding suggestions and feasible solutions.

It can be said a wide range of computer system structure, the definition is very broad, it includes the boundary question design, instruction set organization, hardware and software etc., and relates to application, technology, parallelism, interface, programming language, compiler, operating system and so on. As the technical development center, the system structure has been constantly moving forward.

If you go to study computer system structure development history, it is not difficult to find from the previous 60 years, the early era of the earliest development of system structure, the computer system is developing very fast, the general hardware is very common. From the mid 60's to the mid 70's, the system structure has made great progress. Multiprogramming, multi user system introduces a new concept of human-computer interaction, to create a new realm of computer application, the hardware and software on a new level. From the middle of 1970s, distributed systems began to appear and popular, which greatly increases the complexity of the system, the microprocessor has been applied widely. Now the development of computer system structure has entered the fourth generation, hardware and software to be comprehensive utilization greatly, quickly from the host environment change into centralized distributionClient/The server(or browser / server) environment, new technologies continue to emerge. However, in general, to solve the problem still exists on the computer function. With the development of RISC technology and Cache technology innovation, not only in the professional field, PC is increasingly close to the. In every progress and innovation at the same time make components to minimize the cost becomes the most important question to consider.


This paper in the book report form, from two point to introduce the computer system structure about improving performance and increasing the effectiveness of some macro measures, they are cache design, parallel computing is introduced. On the other hand by introduced in the structure of the computer system "reliable architecture" a glimpse of the gateway.


A,CAche design

Cache for learning computer system structure of personnel, it is certainly not a strange word. It means that the cache. It is there than the main memory volume is small, but fast, for storing copies of fetching instructions from the main memory (it may be the next step for dealing with).

Cache memory is present in between the main memory and CPU level of memories, composed of static memory (SRAM) chip, capacity is relatively small but the speed is much higher than that of main memory, close to in the speed of the CPU.

Cache mainly consists of three parts: cache memory and storage instructions and data from main memory transferred block; address conversion unit, establish a table of contents to achieve the main memory address to cache address conversion; replacement parts, in cache is full according to certain strategy for data block replacement and modify the address conversion unit.

In the first essay, "I see the Vantage: Scalable andEfficient Fine-Grain Cache Partitioning", referred to the Vantage scalable and efficient fine-grained cache partition.

Tell the cache partition is widely used in single chip multi processor, including the guarantee service quality and safety related to shared control technology. However the existing told cache partitioning technique (for example way-partition) is always limited in coarse-grained Oh aye last, the consequences of such inevitable thing can only support a handful of partition, finally will naturally reduce the cache associativity, told the cause performance decline. But now the computer system structure as the first time that has entered a multi-core, multi application scenarios, many core processor and distributed, so the distribution of coarse granularity partition, this technology is only applied to the 2-4 kernel, but for a few time extended to ten chips, is natural to bear the loss of performance, can not cope with.

But the author presents a new cache partitioning technology: Vantage. This cache technology can overcome the following limitations: Cache can have dozens of the particle size of cache and specifies the size of the partition, at the same time, Cache can keep high relevance and strong zonal isolation. Vantage use cache array, because the cache array has brought a good hash algorithm and relevance, which can be done easy to fasten a whole row of the cache line. The Vantage can then flow through the control of replacement, thereby forcing the capacity allocation.

And before the Cache partition concept is different, provides security isolation strict Vantage (for example, it can be 90%) but not all cache. The birth of Vantage is due to some analysis model, the analysis model allows us to provide a strong guarantee, in terms of relevance not only, still have the same security boundary. From this can be seen, this is actually not difficult to achieve, because as long as the state overhead of less than two percent can be significantly improved, also can simply modify the cache controller.

Also, in this paper, the author will use some experimental data to illustrate the fine-grained cache problem. You can see the Vantage using a lot of simulation in the assessment of the time. In the 32 core system, each partition load with 350 multiprogramming work and the core of the partition of the last level cache and the traditional technology to reduce the workload of 71% and a non - partitioned cache throughput (by average, 7% more than 25%, degradation) even in the use of 64 way cache. In contrast, Vantage improves the throughput of the work load of 98%, 8%, average (up to 20%), using a 4 cache memory.

Can be seen, this thesis puts forward the Vantage, good scalability and high efficiency of the fine-grained cache partitioning technology. Its working principle is inserted into the match, the degradation rate of each partition, realize their size roughly constant. According to UCP, the partition scheme Vantage performance in a single chip multi processor on small scale better than existing, but more importantly, it will continue to provide benefits to single chip multi processor to dozens of thread, but some are unable to do this before the point.


In an article on the paper is mainly from the cache partition point of view to reflect the scalability and efficiency, but the structure of the computer system to learn to look to stay certainly not in terms of efficiency, because in cache, the consistency problem is equally important, if not solve the consistency problem of cache, or the cache consistency overhead a lot of words, the presence of cache have no significance. In the "Increasing the effectivenessof directory caches by deactivating coherence for private memory blocks" The author, this paper discusses how to reduce private memory block consistency, to improve the effectiveness of the directory Cache.

The author mentioned in this paper, in order to satisfy the demand of high performance shared more storage server, multi processor system must adopt efficient and scalable Cache coherence protocol, such as directory based Cache. Directory based Cache, as Mr. Chen says, can solve the consistency of Cache to a great extent, but with the expansion of the system, the system of the natural size of Cache will be more and more, but it will cause a serious problem, that is the result will lead to frequent expulsion of catalog items. Therefore, the cache block is invalid, will greatly reduce the performance of the system.

From the processor cache, specify a large proportion of the memory block can only be a processor access, even in parallel applications. Therefore, in this part is not required for consistency maintenance. But once we make full use of the dynamic specified those private memory block technology, we suggest they disable the agreement, and treat them as a single processor system. Obviously, this protocol can make the disabled to ignore a proper tracking the number of blocks of the directory based on cache, so as to reduce their load, they also increase the effective size. Because the operating system will be coordinated in the detection of private blocks, so the author suggests that only need to make simple modifications.

After a series of theoretical analysis, the author simulated the full of the design. The simulation results can be seen through, because of the suggestions proposed in this paper, the cache directory can avoid tracking access block about 75% based on their own, so you can better use the capacity.

Above you can see that this logic become more clear, because less processor in this change, on the other hand, it also contributed to parallel applications shortens the time of operation, the experimental results show that, which can shorten the operation time of up to 15% if the original directory cache block size invariant is maintained. But if you keep the performance of the system, use the directory type cache small 8 times.

After reading this article, I think the draw points are worthy of reference. It only made a very simple method, which can significantly improve the effectiveness of the cache directory. Because the private block does not need to use agreement, so it is based on the idea of private block to avoid tracking. On the improvement of performance, as already noted, but the recovery mechanism, we need to talk about. The Coherence recovery mechanism in the performance of the system will not have a significant impact, he will only slightly affect the energy consumption. Therefore, flushing based recovery mechanism seems to be the most appropriate choice, because it provides a similar update performance test.


In fact, cache said, it is undeniable that cache is indeed great program to ease the memory and disk access speed inconsistency problems. The emergence of Cache is a key point to improve the performance, but with the development of cache, more and more areas can be improved gradually, researchers found. It is so much engaged in computer architecture, cache is in a benign direction. In the next paper, it talks about the exclusive type last level cache (last-levelcaches LLCs), bypass algorithm and insertion algorithm. The papers published in the famous international conference on 2011 ISCA, the name is "Bypass And insertionalgorithms for exclusive last-level caches ".

Study of multilevel cache before, we know that Cache is a difficulty, to replace the old content in full Cache, the simplest strategy is the strategy of LRU. In practice, because it is impossible to know beforehand the call back, can only achieve approximate LRU. May cause some elements frequently transferred out of Cache. An improved scheme is considered calling frequency for high frequency elements, even for a period of time does not appear, nor out of Cache. When considering the realization of LRU frequency is very troublesome. A simple scheme is to maintain frequency elements with different multilevel Cache, each layer with LRU.

A section of the above as the background, and the thesis mainly focus on the last level cache (hereinafter referred to as LLCs), which involved is bypass and insertion algorithm. In fact, about LLCs, the academic circles have seemingly many ideas, because LLCs is the last stage of interaction and disk cache, there are just as it is a watershed, can play many strategies here. For example, the 2010 ISCA Conference on a paper "The virtual write queue: coordinating DRAM and last-level cachepolicies" has mentioned the use of a virtual write queue to realize the coordination between LLCs and DRAM.

Can be said that the generalized LLCs, due to the cross level cache block replication problems, will waste many silicon materials. But with the multi-level industry to cache of the hierarchy, the cache space, compared to the exclusive LLCs, will lead to greater performance loss. Thus, the advantages of LLCs is very obvious, but it will also be a difficulty, that is the LLCs replacement algorithm design becomes more challenging. However, in an ordinary LLC, access history information block can collect and filter, this point in the exclusive LLCs is impossible, because in the midst of blocks in a hit is redistributed from LLC. So, it will be faced with a problem, that is before the replacement algorithm is the most widely used LRU replacement algorithm and LRU will fail, resulting in the exclusive LLCs, insert the appropriate strategy for the cache block becomes more critical. On the other hand, this strategy is not necessary to fill every block, which became a sovereign LLC. This is called selective bypass cache, this is not possible in a shared LLC, because this would violate the principle of sharing.

The above is the main idea, which based on the establishment of design ideas, the author discusses the algorithm for insertion and bypass exclusive LLCs. With the simulation data based on execution shows that the best insertion algorithm and bypass policy combination for system improvement can reach 61.2%, it is a staggering figure. The geometric mean was 3.4%, the main choice for each cycle of 97 single thread dynamic command tracking under the SPEC2006 server application on the instruction, this result is running in the exclusive LLC a 2MB16 Road, the object of comparison, in the multi stream hardware prefetching an exclusive design under the premise the. The corresponding improvement in the 16 road 354 road program running 8 MB shared exclusive LLC workload throughput is 20.6% (the highest) and 2.5% (geometric mean).

In general, the main work of the paper is the exclusive LLCs made important questions, including the failure of LRU algorithm and LRU algorithm in this area. Then proposed to select a series of designs, mainly for selective bypass and insertion strategy. The first is the number of trips between L2 cache and LLC block caused by the second attributes are in the L2 cache L2 was hit by unknown block number cache. Finally, the author gives a best advice, which is based on the number of trip bypass strategy use frequency and insertion algorithm combining algorithm.



Two, parallel technology

Parallel is a very important concept in computer architecture. Parallel can refer to events occurring in the system at the same time. When dealing with parallel problems, usually have two important aspects: to detect and respond to any sequence of external events, and to ensure that within the required time accordingly.

In the computer system structure, can be achieved from the following aspects: parallel processor, memory and pipeline etc.. Simple view, the use of the parallel will bring a lot of good aspects. For example, the work is divided into several blocks, and through the similar to the idea of divide and conquer to complete the work, run in parallel to save time cost etc., but parallel introduced will also bring some negative effects, that is the logic of the system will become more complex than before, on the other hand in terms of power consumption will increase, but this brings disadvantages will not affect researchers exploration in terms of performance, because once the pursuit of more efficient system, it will naturally in some ambiguous make some trade-offs.

When I refer to this paper, found that many papers of tradeoff design topic, it shows that, on the parallel technology now, not only is one aspect of the endless pursuit, but more rational thinking, to choose the direction of development from a right angle, rather than the best angle.


In the many discussions, including a title for the exploring thetradeoffs between programmability and efficiency in data-parallel accelerators ", mainly based on point for data parallel accelerator can trade-off between programmability and efficiency.

In this paper, authors include MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) parallel accelerator architecture design pattern data, classification and modular implementation method. The authors have developed a new micro VT system architecture Maven, the structure is the main vector SIMD system structure based on the traditional micro. Vector SIMD micro mechanism of the traditional VT compared to the previous design process, it is easier to encoding. The use of extensive design space exploration to achieve complete VLSI design many accelerators, in MIMD, between vector-SIMD and VT evaluation model between programmable and execution efficiency of different tradeoffs. But their common test environment is used to compile the kernel in the micro benchmark and the.

The structure of vector core MIMD core found that compared to higher efficiency, even in a fairly irregular kernel. Experimental results show that the structure of their Maven, VT micro structure is due to the traditional vector SIMD architecture, while providing a higher rate and more convenient programming.

After reading this article, I think that effective data parallel processing rules must be effectively and irregular data level parallel (data-level parallelism), but still retains the programmability. The author detailed VLSI results also confirmed this point, based on the micro structure of the vector system in the field of energy and are needed than micro system structure based on the scalar more, even the parallel data processing level quite irregular.

In this paper, we refer to a concept, that is data parallel DLP (data-level parallelism). There is a similar concept, that is memory level parallelism (memory-level MLP parallelism). In the paper "Dynamic warp subdivision forintegrated branch and memory divergence tolerance" in memory involves divergent when it comes to MLP. In the SIMD fetching, decoding in a plurality of processing units, in order to maximize the throughput improvement, once the decline in throughput, lockstep thread to stop is, due to the long latency memory accesses, idle cycles generated is extremely expensive. Multiple twisted multiple threads can hide the latency of interleaved execution, but multiple threads use many warp, will greatly increase the cost due to the presence of a register file, cache contention can lead to poor performance. In this paper, the dynamic warp branch (DWS), a single warp is allowed to occupy more than one slot, without extra register file space. Of course, this design, the final result is to improve the latency hiding and memory level parallelism (MLP). When using the evaluation technology, is a private L1 using same Cache and L2 share the hierarchical structure of cache.


A parallel technology, broadly can be extended to many aspects.

Then, a paper "Virtualizing performanceasymmetric multi-core systems me", focuses on the performance change of virtualization brings.

The performance of asymmetric multi core is composed of heterogeneous nuclei, they support the same ISA, but have different computational power. In order to maximize the asymmetric multicore system throughput, the operating system is responsible for scheduling of different types of kernel threads. However, the asymmetric multicore system virtualization challenges, because of the heterogeneity of physical customer virtual machine hidden operating system. In this paper, the author discusses the design of spatial asymmetric scheduling hypervisor multi core aspects, asymmetric concept which does not require the client operating system. The efficiency of the proposed scheduler, making each a virtual nuclear effective rate is high and the virtual nuclear are mapped to the regional benefits of nuclear physics. In addition to the throughput of the whole system, we consider two important aspects of asymmetric multicore virtualization is fairness, performance between virtual machines performance in fast and slow on the scalability.

Author not only in theory put forward the possibility and still Xen open source module hypervisor of asymmetrical dispatcher execution. Through the use of run a variety of applications, you can evaluate the proposed scheduler in no asymmetrical dispatcher under to improve the system throughput in the end there is how effective. Modified scheduler performance improvement up to 40%, in a 12 core systems run with four fast and eight slow core of the Xen scheduler. The results show that even relatively low performance degradation in virtual machine scheduled slow nuclear, but the scheduler can still by rapid increase in the number of cores to provide scalable performance.


Three, reliable architecture

From the name of speaking, reliable architecture is not a specific professional technical terms, it is a general term, it is mainly those category via improving their technology or mechanism, which makes its own system more reliable architecture.

In the computer system structure, most of the researchers hope design of architecture and system, the test of withstand the information age, so many people engaged in scientific research and reliable architecture.

Before I read the first paper on this, I obviously feel computer system structure mastery of, because there is the first paper which is mostly related to memory consistency, while a consistent problem not only here can be reflected, in the distributed memory architectures, the same can be see consistency principle in the design of importance.

This paper called "Rebound: scalable checkpointing for coherent shared memory", is mainly about the associated shared memory scalable checkpointing, and the author will also be such a technology called Rebound, shows the confidence of the author.

This paper firstly introduces some background of the application field of Rebound. Now our era is the era of nuclear transformation, is used for small shared memory machine service global checkpoint setting mechanism based on hardware scalability becomes not very good. And about barriers to expansion, mainly including the global operations, did not respond to global rollback and the unbalanced I / O load of low efficiency. The scalable checkpointing technology to track the interdependent relationship between threads, and set up checkpointing, rollback operation and dynamic group exist in communication with the processor.

It is obvious that it is a structure of computer system development to a certain extent, a problem occurs, or is the original single or dual core has mature technology, in its application to multi-core even manycore, bottlenecks encountered. The main point here is related to shared memory checkpointing.

In order to solve the scalability problem above mentioned, this paper introduces the Rebound technology to everyone, this is the first in the industry to coordinate the local checkpointing strategy based on hardware, of course, this process is set in a multi processor, and use directory based cache. The transaction Rebound technology added support directory protocol to track dependencies between threads. In addition, it improves the efficiency of checkpointing is through the following ways:

In the A. checkpoint delay to write data back to the safety of the memory

B. contains a number of checkpoint operations

C. in the barrier synchronization phase to optimize the setting of checkpoint

Rebound introduces the distributed algorithm in multi processor checkpointing and rollback. Simulation 64 threads parallel program most, Rebound scalability is good, very low overhead. For a 64 bit processor, compared to the cost of global checkpoint 15%, the average performance overhead of only 2%.

In the above paper used the term memory access latency, memory access latency refers to wait for data stored in system memory access is completed by extension. This is a very important concept in the computer system structure, many related to the memory of the opportunity will be involved in this topic. For example, in "Energy-efficient Mechanisms for Managing Thread Context inThroughput Processors", also in the aspects of an improvement and optimization. This unit of modern graphics processor in this paper (GPU) will use a large number of hardware threads to hide memory access latency and ease function unit. Because of the extreme multi threads need to tie a Zadok's thread scheduler, and a large register file, the energy and delay acquisition is extremely expensive. The author puts forward two kinds of complementary technologies, to reduce the energy consumption of the processor with hyper threading. One of the techniques is to design a two-level thread scheduler, the scheduler maintain the a small part of the active thread set to hide the ALU and the local memory access latency.


On improving the system architecture and many reliable technology, which is known as a hardwareperformance counters. In a 2011 paper in ISCA, using this technology to detect the demand driven software competition. The paper called "Demand-drivensoftware race detection using hardware performance counters".

The dynamic testing data of the competition is an important mechanism in the construction of the robust parallel program. The software program can be equipped with competitive detectors in testing, observation of each memory access, and monitoring may lead to shared data concurrency error between threads. But using this method hardware performance counters can be found usually very difficult to observe the competition, the same, the method can also withstand high runtime overhead.

This paper presents the above technology, competition is the detector by hardware assisted demand driven. Through this technology, a variety of events can be detected by cache, these events show that the data sharing among threads is the main advantage of modern commercial microprocessor hardware available. We use this advantage to build a competitive detector, making it only happens when the data sharing between threads was started. When a small share appears when the analysis of the demand driven far faster than continuous analysis tools now, but not with the loss of accuracy.


To sum up my reading of 10 papers are known, they are from different programs improve computer performance and increase the reliability of computer system architecture, and some of them are focus on computer performance or reliability is not very obvious reduced, greatly reduce the cost of computers or energy overhead. These are the hot research personnel structure research of computer system seriously, this is also the future development of computer architecture is a very promising direction, and has great practical value.

The basic point is to start from the simple concept, put forward thinking in order to find problems, so as to solve the problem, but this method is not only suitable for the computer system structure, so is the other aspects of scientific research. It can also tell myself the way to research in the future use of this guiding ideology, presenting a beautiful blueprint for everyone.




Reference: (Note: in the text also marked)

[1].Daniel Sanchez, Christos Kozyrakis, Vantage: Scalable and Efficient Fine-GrainCache Partitioning, ISCA2011

[2]. Blas Cuesta, Alberto ROS, Mar, a e. g mez mez, increasing the effectiveness of directory caches by deactivatingcoherence for private memory blocks, ISCA2011

[3]. JayeshGaur, Mainak Chaudhuri, Sreenivas Subramoney, Bypass and insertion algorithmsfor exclusive last-level caches, ISCA2011

[4]. Thevirtual write queue: coordinating DRAM and last-level cache policie s, ISCA2011

[5]. YunsupLee, Rimas Avizienis, Alex Bishara, Exploring the tradeoffs betweenprogrammability and efficiency in data-parallel accelerators, ISCA2011

[6]. Dynamicwarp subdivision for integrated branch and memory divergence tolerance, ISCA2010

[7]. YoungjinKwon, Changdae Kim, Seungryoul Maeng, Virtualizing performance, asymmetricmulti-core systems, ISCA2011

[8]. RishiAgarwal, Pranav Garg, and Josep Torrellas, Rebound: scalable checkpointing forcoherent shared memory, ISCA2011

[9]. MarkGebhart, Daniel R. Johnson and so on, Energy-efficient Mechanisms for ManagingThread Context in Throughput Processors, ISCA2011

[10]. Demand-drivensoftware race detection using hardware performance counters, ISCA2011

Guess you're looking for
View comments
* the above user comments represent the personal views do not represent the views or position of the CSDN website
    personal data
    • Visit82024 times
    • Integral:One thousand three hundred and thirty-one
    • Grade
    • Ranking:18615th
    • Original47
    • Reprint:0
    • Translation:1
    • Comment:49
    Blog column
    Contact information