Quantcast
Channel: Intel® Software Development Emulator
Viewing all 25 articles
Browse latest View live

Покупка или продление лицензий на продукты Intel® для разработки программного обеспечения

$
0
0

Корпорация Intel предлагает различные варианты лицензирования продуктов, предназначенных для разработки программного обеспечения. Различные варианты приобретаемых или продлеваемых лицензий на программы Intel® приведены ниже.

Также можно бесплатно загрузить 30-дневные ознакомительные версии продуктов Intel® для разработки программного обеспечения. Загрузить бесплатные ознакомительные версии наших продуктов можно на сайте центра ознакомительных версий программ.

Все приведенные ниже цены являются ценами на коммерческую лицензию для одного разработчика. Все указанные цены являются розничными, рекомендуемыми производителем, и могут быть изменены без предварительного уведомления. Цены указаны БЕЗ учета НДС и прочих применимых налогов и сборов.

  • Для получения информации о передаваемых лицензиях, лицензиях с возможностью использования только на определенных узлах и о прочих вариантах лицензирования обратитесь к торговому посредникуили к представителю корпорации Intel по адресу intel.software.sales@intel.com.
  • Для приобретения лицензии для научных, исследовательских и учебных заведений, выберите нужный продукт, и при оформлении покупки будет показана цена со скидкой. Для получения дополнительных сведений обо всех наших продуктах для учебных заведений посетите сайт центра предложений для учебных заведенийили обратитесь к представителю корпорации Intel по адресу academicdevelopersinfo@intel.com.
  • Продление поддержки — поддержка в течение одного года с даты истечения срока действия текущего соглашения о поддержке.
  • Существующим клиентам предоставляются специальные цены на обновление для продуктов Intel® Parallel Studio XE 2013, Intel® C++ Studio XE 2013 и Intel® Fortran Studio XE 2013.Подробнее о возможностях обновления.

Категория

Рекомендуемая розничная цена продукта
(один пользователь)

Рекомендуемая розничная цена продления поддержки
(один пользователь)

Варианты

Пакеты продуктов

Intel® Parallel Studio XE 2013 для Windows*
Включает Intel® Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013
2299 долл. США799 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Parallel Studio XE 2013 для Linux*
Включает Intel® Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013
2299 долл. США799 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® C++ Studio XE 2013 для Windows или для Linux
Включает Intel® C++ Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013
1599 долл. США599 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Visual Fortran Studio XE 2013 для Windows
Включает Intel® Visual Fortran Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013
1899 долл. США699 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Fortran Studio XE 2013 для Linux
Включает Intel® Fortran Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013
1899 долл. США699 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Parallel Studio
Включает Intel® Parallel Advisor, Intel® Parallel Amplifier, Intel® Parallel Composer и Intel® Parallel Inspector
320 долл. СШАНайти торгового посредника ›
Все варианты ›
Intel® Cluster Studio XE 2013 для Windows
Включает Intel® C++ Composer XE 2013, Intel® Visual Fortran Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1, Intel® MPI Benchmarks, Intel® Inspector XE 2013 иIntel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013
2949 долл. США1049 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Cluster Studio XE 2013 для Linux
Включает Intel® C++ Composer XE 2013, Intel® Fortran Composer XE 2013, Intel® Trace Analyzer and Collector, Intel® MPI Library 4.1, Intel® MPI Benchmarks, Intel® Inspector XE 2013 и Intel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013
2949 долл. США1049 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Cluster Studio 2013 для Windows
Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks
2049 долл. США749 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Cluster Studio 2013 для Linux
Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks
2049 долл. США749 долл. США**Найти торгового посредника ›
Все варианты ›

Компиляторы и библиотеки

Intel® Composer XE 2013 для Windows
Включает Intel® C++ Composer XE 2013 и Intel® Visual Fortran Composer XE 2013
1199 долл. США449 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Composer XE 2013 для Linux
Включает Intel® C++ Composer XE 2013 и Intel® Fortran Composer XE 2013
1449 долл. США499 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® C++ Composer XE 2013 для Windows, для Linux или для OS X*
Включает Intel® C++ Compiler, Intel® Integrated Performance Primitives 7.1, Intel® Math Kernel Library 11.0 и Intel® Parallel Building Blocks
699 долл. США249 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Visual Fortran Composer XE 2013 для Windows
Включает Intel® Visual Fortran Compiler и Intel® Math Kernel Library 11.0
849 долл. США299 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Fortran Composer XE 2013 для Linux
Включает Intel® Fortran Compiler и Intel® Math Kernel Library 11.0
999 долл. США349 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Fortran Composer XE 2013 для OS X
Включает Intel® Fortran Compiler и Intel® Math Kernel Library 11.0
849 долл. США299 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® C++ Compiler Professional Edition с поддержкой ОС реального времени QNX Neutrino*
Включает Intel® C++ Compiler и Intel® Integrated Performance Primitives 7.1
599 долл. США240 долл. СШАВсе варианты ›
Компилятор Intel® C для байтового кода EFI995 долл. США398 долл. СШАНайти торгового посредника ›
Все варианты ›
Intel® Visual Fortran Composer XE 2013 для Windows с IMSL 6.0*
Включает Intel® Visual Fortran Compiler, IMSL* Fortran Numerical Library и Intel® Math Kernel Library 11.0. Включает лицензию на 1 разработчика и 1 лицензию на развертывание, предназначенную для разработчика. При предоставлении приложений, содержащих код IMSL, пользователям, отличным от разработчиков, требуется лицензия на развертывание.
2049 долл. США749 долл. США**Найти торгового посредника ›
Все варианты ›

Лицензии IMSL* на выполнение (также называемые лицензиями IMSL* на развертывание)

Вопросы и ответы о лицензировании IMSL*

Коммерческая однопользовательская лицензия на выполнение приложений с кодом IMSL на системах, содержащих не более 16 процессорных ядер2049 долл. США685 долл. СШАНайти торгового посредника ›
Все варианты ›
Пакет из 10 коммерческих однопользовательских лицензий на выполнение приложений с кодом IMSL на системах, содержащих не более 16 процессорных ядер9709 долл. США1826 долл. СШАНайти торгового посредника ›
Все варианты ›
Коммерческая многопользовательская лицензия на выполнение приложений с кодом IMSL на системах, содержащих не более 64 процессорных ядер13 592 долл. США2557 долл. СШАНайти торгового посредника ›
Все варианты ›

Инструменты для процессора Intel® Atom™

Набор инструментов Intel® для разработки ПО для встраиваемых систем на базе процессора Intel® Atom™
Включает Intel® C++ Compiler, Intel® Application Debugger, Intel® JTAG Debugger, Intel® Integrated Performance Primitives 7.1 и Intel® VTune™ Performance Analyzer
1999 долл. США799 долл. СШАНайти торгового посредника ›
Все варианты ›

Библиотеки для повышения производительности

Intel® Integrated Performance Primitives 7.1 для Windows или для Linux199 долл. США69 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Math Kernel Library 11.0 для Windows или для Linux499 долл. США179 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Threading Building Blocks 4.1 для Windows, для Linux или для OS X499 долл. США179 долл. США**Найти торгового посредника ›
Все варианты ›

Анализаторы производительности приложений

Intel® VTune™ Amplifier XE 2013 для Windows или для Linux899 долл. США349 долл. США**Найти торгового посредника ›
Все варианты ›

Средства проверки работы с памятью и потоками

Intel® Inspector XE 2013 для Windows или для Linux899 долл. США349 долл. США**Найти торгового посредника ›
Все варианты ›

Средства для работы с кластерами

Intel® Cluster Studio XE 2013 для Windows
Включает Intel® C++ Composer XE 2013, Intel® Fortran Composer XE 2013, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE 2013 и Intel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013
2949 долл. США1049 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Cluster Studio XE 2013 для Linux
Включает Intel® C++ Composer XE 2013, Intel® Fortran Composer XE 2013, Intel® Trace Analyzer and Collector, Intel® MPI Library 4.1, Intel® MPI Benchmarks, Intel® Inspector XE 2013 и Intel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013
2949 долл. США1049 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Cluster Studio 2013 для Windows
Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks
2049 долл. США749 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® Cluster Studio 2013 для Linux
Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks
2049 долл. США749 долл. США**Найти торгового посредника ›
Все варианты ›
Intel® MPI Library 4.1 для Windows или для Linux499 долл. США179 долл. США**Найти торгового посредника ›
Все варианты ›

Средства моделирования и имитации работы систем

CoFluent Studio*Н/дН/дВсе варианты ›
CoFluent ReaderН/дН/дВсе варианты ›

Средства для профилирования и отладки графических приложений

Средства для профилирования и отладки графический приложений Intel®Н/дН/дВсе варианты ›

**Вы можете обновить лицензии по минимальной цене только в том случае, если сделаете это до истечения срока предыдущей подписки. Для получения дополнительной информации о продлении лицензий нажмите здесь.

Корпорация Intel прилагает все необходимые усилия для соблюдения конфиденциальности. Ознакомиться с действующими правилами сбора и обработки личных сведений о заказчиках, данных о серийных номерах продуктов Intel и прочих данных можно в нашем уведомлении о конфиденциальностии в уведомлении о проверке серийных номеров.


Fun with Intel® Transactional Synchronization Extensions

$
0
0

By now, many of you have heard of Intel® Transactional Synchronization Extensions (Intel® TSX). If you have not, I encourage you to check out this page (http://www.intel.com/software/tsx) before you read further. In a nutshell, Intel TSX provides transactional memory support in hardware, making the lives of developers who need to write synchronization codes for concurrent and parallel applications easier. It comes in two flavors: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM). If you haven’t read the background, go and do so now, since from here on, I assume that you have that basic knowledge.

I had been developing a PIN-based emulator for Intel TSX for the past few years. The emulator is now integrated into Intel Software Development Emulator. During the development, I had a lot of grins and grimaces with respect to HLE/RTM. I would like to share three such particularly memorable incidents.

The Incidents

Example 1.

The following codelet is a part of a test program a colleague of mine wrote who wanted to learn how to use RTM. With the array ‘data’ containing integer values and the array ‘group’ mapping the data’s elements to the slots in the array ‘sums’, the test program tries to store the sum of the data belonging to a group in the corresponding slot in the array ‘sums’. Since multiple threads may access the same slot simultaneously, each addition is performed in an RTM transaction. When a transaction aborts, the thread re-executes the addition in the critical section along the fallback path (i.e. ‘else’). Do you think it is correct? If you don’t, can you spot what is wrong?

#pragma omp parallel for
    for(int i = 0; i < N; i++){
        int mygroup = group[i];
        if(_xbegin()==-1) {
              sums[mygroup] += data[i];
            _xend();
          } else {
              #pragma omp critical
              {
                  sums[mygroup] += data[i];
              }
          }
      }

Example 2.

I was taught code reuse is imporant when I was in school (sorry, not in the kindergarten ;^)). So, I decided to put to work what I learned when a need arose to write an RTM test. The test was similar to the one in Example 1, except that this test alternates RTM and HLE transactions. (Notice that the test does not have the non-speculative fallback path required for the RTM transaction. Having no fallback path makes the test UNSAFE because Intel TSX does not guarantee forward-progress; i.e., it can abort RTM transactions forever.) The test has two addition statements: one is protected with RTM and the other is protected with HLE. Quite a feat, eh? I felt proud of myself ;-) ... until I started to run the test. The test occasionally printed out incorrect sums. I panicked at first because the test was simple and looked almost identical with other tests, leading me to believe, however briefly, that the emulator had a nasty bug that had hidden unnoticed for a long time. But after a closer look, I realized the test had a flaw. Can you see what I did wrong?

    #define PREFIX_XACQUIRE ".byte 0xF2; "
    #define PREFIX_XRELEASE ".byte 0xF3; "
 
    class mutex_elided {
      uint8_t flag;
      inline bool try_lock_elided() {
        uint8_t value = 1;
        __asm__ volatile (PREFIX_XACQUIRE "lock; xchgl %0, %1"
                : "=r"(value),"=m"(flag):"0"(value),"m"(flag):"memory" );
        return uint8_t(value^1);
      }
    public:
      inline void acquire() {
        for(;;) {
            exponential_backoff backoff;
            while((volatile unsigned char&)flag==1)
                backoff.pause();
            if(try_lock_elided())
                return;
            __asm__ volatile ("pause\n" : : : "memory" );
        }
      }
 
      inline void release() {
        __asm__ volatile (PREFIX_XRELEASE "movl $0, %0"
               : "=m"(flag) : "m"(flag) : "memory" );
      }
    };
    ...
 
 
      mutex_elided m;
#pragma omp parallel for
    for(int i = 0; i < N; i++) {
        int mygroup = group[i];
        if( (i&1) ) {
            while(_xbegin()!=-1) ;
            // must have a fallback path
            sums[mygroup] += 1;
            _xend();
        } else {
            m.acquire();
            sums[mygroup] += 1;
            m.release();
        }
    }

Example 3.

A colleague of mine tried to use RTM to improve performance of a benchmark. (I changed function names for clarity.) The following fragment of the benchmark permutes an array of IDs by, for each ID, swapping its value with that of a randomly picked partner. In the fallback path, elements i and j are exclusively acquired in the increasing order of their indices, and then written back in the reverse order. He was running it on the emulator and came back to me with an occasional hang problem. Can you come up with a sequence of events that leads to an indefinite wait?

bool pause( volatile int64_t* l ) {
    __asm__ __volatile__( "pause\n" : : : "memory" );
    return true;
}
 
int64_t read_and_lock( volatile int64_t* loc ) {
    int64_t val;
    while(1) {
        while( pause( loc ) )
            if(  empty_val != (val = *loc) )
                    break;
        assert( val!=empty_val );
        if ( __sync_bool_compare_and_swap( loc, val, empty_val ) )
            break;
    }
    assert( val!=0 );
    return val;
}
 
void write_and_release( volatile int64_t* loc, int64_t val ) {
    while( pause( loc ) )
        if( __sync_bool_compare_and_swap( loc, empty_val, val ) )
            break;
    return;
}
 
...
#pragma omp parallel for num_threads(16)
    for (int i=0; i<n; i++) {
        int j = (int64_t) ( n * gen_rand() );
 
        if( _xbegin()==-1 ) {
            if(i != j) {
                const vid_t tmp_val = vid_values[i];
                vid_values[i] = vid_values[j];
                vid_values[j] = tmp_val;
            }
            _xend();
        } else {
            if (i < j) {
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                write_and_release( &vid_values[j], tmp_val_i );
                write_and_release( &vid_values[i], tmp_val_j );
            } else if (j < i) {
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                write_and_release( &vid_values[i], tmp_val_j );
                write_and_release( &vid_values[j], tmp_val_i );
            }
        }
    }

Analysis

Example 1.

The fallback path has a race with the code in the RTM path. For example, the following interleaving may happen. (Always keep in mind that one should not make any assumption on relative speeds of threads!)

Thread 1
Thread 2
start critical section
 
read sums[mygroup]
 
 
do transaction that updates 
sums[mygroup]
write sums[mygroup]
 

As a result, the example occasionally loses the increment done in the RTM transaction.

Example 2.

Don’t let the HLE transaction fool you. When an HLE transaction gets aborted, it acquires the same mutex non-speculatively. When this happens, the case effectively becomes identical to Example 1.

Example 3.

Again, one should not make any assumption on relative speeds of concurrently executing threads. Even though the fallback path is race free on its own, it has a race with the code in the RTM path. For example, the following sequence of events may occur.

Thread 1
Thread 2
read_and_lock( vid_values[i]  )
 
do transaction that swaps vid_values[i] and
vid_values[k] and makes vid_values[i] non-zero
read_and_lock( vid_values[j] )
write_and_release( vid_values[j] )
 
Wait for vid_values[i] to become 0
 

Possible Fixes

Now that we have concrete diagnosis for each of the examples, the fixes are straightforward.

Example 1.

Replacing ‘omp critical’ with an atomic increment such as __sync_add_and_fetch would fix the problem. I.e.,

    __sync_add_and_fetch( &sums[mygroup], data[i] );

 A more general solution is to use a mutex in the fallback path and add it to the readset of the RTM transaction to force the transaction to abort if the mutex is acquired by another thread.

mutex fallback_mutex;
 
...
#pragma omp parallel for num_threads(8)
    for(int i = 0; i < N; i++){
        int mygroup = group[i];
        if(_xbegin()==-1) {
            if( !fallback_mutex.is_acquired() ) {
                sums[mygroup] += data[i];
            } else {
                _xabort(1);
            }
            _xend();
        } else {
            fallback_mutex.acquire();
            sums[mygroup] += data[i];
            fallback_mutex.release();
        }
    }

Example 2.

Similarly, we may extend mutex_elided to have the is_acquired() method. Since the lock variable is read inside the RTM transaction, any non-speculative execution of the HLE path which makes the change to the lock variable visible will abort the transaction.

    mutex_elided m;
#pragma omp parallel for num_threads(8)
    for(int i = 0; i < N; i++) {
        int mygroup = group[i];
        if( (i&1) ) {
            while(_xbegin()!=-1) // having no fallback path is
                ;                // UNSAFE
            if( !m.is_acquired() )
                sums[mygroup] += data[i];
            else
                _xabort(0);
            _xend();
        } else {
            m.acquire();
            sums[mygroup] += data[i];
            m.release();
        }
    }

Example 3.

We can also apply the mutex-based approach to this example. Another approach is to read the two ID values in the RTM transaction and check if either of them contains the ‘empty_value’. If so, we abort the transaction and force the thread to follow the fallback path.

#pragma omp parallel for num_threads(16)
    for (int i=0; i<n; i++) {
        int j = (int64_t) ( n * gen_rand() );
        if( _xbegin()==-1 ) {
            if(i != j) {
                const vid_t tmp_val_i = vid_values[i];
                const vid_t tmp_val_j = vid_values[j];
                if( tmp_val_i==0 || tmp_val_j==0 )
                    _xabort(0);
                vid_values[i] = tmp_val_j;
                vid_values[j] = tmp_val_i;
            }
            _xend();
        } else {
            if (i < j) {
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                write_and_release( &vid_values[j], tmp_val_i );
                write_and_release( &vid_values[i], tmp_val_j );
            } else if (j < i) {
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                write_and_release( &vid_values[i], tmp_val_j );
                write_and_release( &vid_values[j], tmp_val_i );
            }
        }
    }

Conclusions

So, what have I learned from these examples? As you may have already noticed, all of these are related to the ‘restricted’ part of RTM. Intel TSX has great potential for improving performance of concurrent/parallel applications. But, the synchronization between the speculative code inside the RTM transaction and the non-speculative fallback path needs to be carefully managed, since the interactions are subtle. I gather most programmers won’t need to worry too much about it because higher-level abstractions in supporting libraries should hide most of agonizing synchronization details. But for those who are willing to get their hands dirty to squeeze out the last drop of performance gain, it always pays to have a watchful eye on the interactions between an RTM code path and its non-speculative fallback. (And we have many tools such as Intel SDE to assist you.)

Disclaimer: The opinion expressed in the blog is the author's own and reflects none of his employer's or his colleagues'.

  • Intel Transactional Synchronization Extensions (Intel TSX)
  • Restricted Transactional Memory (RTM)
  • Transactional memory
  • hardware lock elision
  • Icon Image: 

  • Development Tools
  • Intel® Core™ Processors
  • Optimization
  • Parallel Computing
  • Threading
  • Intel® C++ Compiler
  • Intel® C++ Composer XE
  • Intel® Threading Building Blocks
  • Intel® Software Development Emulator
  • Intel® Transactional Synchronization Extensions
  • OpenMP*
  • C/C++
  • Android*
  • Code for Good
  • Server
  • Windows*
  • Laptop
  • Server
  • Tablet
  • Desktop
  • Phone
  • Developers
  • Students
  • Android*
  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Unix*
  • Theme Zone: 

    IDZone

    Analyzing Intel® SDE's TSX-related log data for capacity aborts

    $
    0
    0

    Starting with version 7.12.0, Intel® SDE has Intel® TSX-related instruction and memory access logging features which can be useful for debugging Intel® TSX's capacity aborts. With the log data from the Intel SDE you can diagnose cache set population to determine if there is non-uniform cache set usage causing capacity overflows. A refined log data may be used to further diagnose the source of the aborts. Since the log file may be huge to navigate and diagnose without refining, here, a simple Python script is presented to help analyze Intel TSX-related log files to help root-cause sources of capacity aborts.

    TxSDELogAnalyzer.py is a simple program which focuses on capacity-aborted transactions. It uses the SDE's log data which can be collected using the set of parameters as below.

    >$ sde -tsx -hle_enabled 1 -rtm_mode full -tsx_debug_log 3 -tsx_log_inst 1 -tsx_log_file 1

    We will describe a few features this simple script has and why it might be important to use them to aid in debugging transaction aborts due to capacity overflows. In a typical scenario, you may do the following to find the source of those aborts.

    (i)  Look at the TSX read/write set sizes and cache set usage distribution

    (ii) Take a random sample of aborted transactions and closely examine them

    (ii) In conjunction with case (ii) you may need to look at various log data related to capacity aborted transactions, as well as committed transactions to compare how they differ, hoping that you may spot a difference that could lead to the source of the aborts.

    Cache set usage distribution

    You might need to see how the cache sets are populated in an aborted transaction to see if there are outliers or a non-uniformity which could be causing premature overflows. Then you can locate the data structure in this transaction and enhance it such that it uniformly uses the available cache sets -- thus avoiding capacity overflows (see CIA allocator as a possible mitigation). Under normal circumstances you may need to compare cache access distributions between committed and aborted transactions to see how they differ by plotting a histogram or any suitable graph with both cases.

    TxSDELogAnalyzer.py calculates the average cache population among all transactions of the same type from the input log file. It outputs the final result in CSV format, making it easy to plot a graph in an Excel sheet for example. For average distribution of cache lines access the following command structure is used.

    >$ TxSDELogAnalyzer.py -D <type_num> [-o <output_file>] <input_log_file>

    Where type_num specifies the type of transactions data to generate the distribution from. type_num can either be 2 for capacity-aborted transactions or 8 for committed transactions. This command calculates unique cache set accesses for each transaction data of type type_num and finally calculates the average of these cache access distributions.

    A sample output from this command from capacity aborted transactions log data is given below.

    Average cache set population for capacity aborts
    
    Cache set #, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63
    Avg. population, 5.00, 1.00, 0.00, 0.00, 1.00, 0.00, 1.00, 2.00, 3.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 2.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2.00, 2.00, 0.00, 2.00, 1.00, 0.00, 0.00, 2.00, 1.00, 0.00, 1.00, 2.00, 0.00, 1.00, 3.00, 0.00, 2.00, 2.00, 2.00, 2.00, 3.00, 2.00, 3.00, 2.00, 4.00, 1.00, 1.00, 0.00, 1.00, 1.00, 3.00

    Average cache set population

    Fig.1: A graphical representation of the sample output. It shows how many cache lines are used in each of 64 cache sets of a data cache.

    To easily interpret the graph in Fig.1 we need to understand the Intel® SDE's "-tsx_cache_set_size " parameter whose default value is 8 but 4 was used to generate the sample output above.

    Intel's L1 data cache is 8-way associative that means every set of the data cache has 8 ways. The two logical cores share the L1 cache dynamically (Intel(r) Hyper-Threading technology is enabled). To approximate this behavior we can statically limit the number of ways per thread to 4 using Intel SDE (but this assumes that the usage of ways is equal).

    If  "-tsx_cache_set_size " is set to the default value which is 8, it means hyper-threading is not "emulated" as all the ways in a cache set are used by the (only) single hardware thread in a core. To emulate Hyper-Threading, at least at data cache level, we set  "-tsx_cache_set_size " to 4. Thus only a half of the ways in each data cache set are used up. With the Hyper-Threading emulation a maximum of 4 ways per cache set should be uniquely modified in TSX-transaction execution. Otherwise a capacity overflow occurs aborting the transaction. From Figure 1, we see that at cache set #0 there is an overflow (5 unique way accesses instead of max 4) and this could lead to a capacity abort.

    Transactions data filtering

    Intel® SDE already has knobs to start/stop logging thus limiting the log output size. This is normally done by adding the following command options to the SDE command you use to run your test

    -control start:interactive:bcast,stop:interactive:bcast -interactive_file tmp

    However, you may sometimes need log data with longer execution coverage; hence a huge log file will be produced. Limiting the log output size may therefore not be enough and you many need filtering options.

    Moreover, your log data may contain data for various abort reasons and the commits, too. Even the log information of thread operations outside a transaction executions may be included in the SDE log. Therefore, it may be hard to analyze a huge log file with a mixture of information. Moreover, you may be interested in only logs of certain types -- e.g; capacity aborted transactions -- only. TxSDELogAnalyzer.py helps you filter out the log data and presents you only the data for a specific abort reason, or a commit. The command format to do exactly this is shown below.

    >$ TxSDELogAnalyzer.py -t <type_num> [-o <output_file>] <input_log_file>

    type_num can be 1 (all log data of both committed and aborted transactions), 2 (for capacity aborted transactions only) or 8 (for committed transactions only),

    # The command below outputs only capacity-abort related transaction logs>$ TxSDELogAnalyzer.py -t 2 sde-tsx-out.txt
    
    # The following command line writes only commit-related transaction logs to a file named log_data_for_commits_data.txt
    >$ TxSDELogAnalyzer.py -t 8 -o log_data_for_commits_data.txt sde-tsx-out.txt

    Sampling transactions

    Since you cannot analyse millions of transactions to identify code path patterns in a limited time you may want to extract log data for a few typical transactions out of millions. TxSDELogAnalyzer.py does exactly that. It randomly selects  num random transactions. It supplements these data with cache set accesses giving a full overview of the cache set population at each operation within a given transaction. This features works in conjunction with "Transactions data filtering" described above.

    An example of commands for this feature is as follows.

    >$ TxSDELogAnalyzer.py -t 8 -r <num> <input_log_file> # for commits, num random commits
    >$ TxSDELogAnalyzer.py -t 2 -r <num> <input_log_file> # for capacity aborts, num random capacity aborts

    A sample output from these commands is as below:

    Transaction #0
                      OPERATION                      ; CACHE LINE ADDRESS; CACHE SET #; NEW CACHE LINE; CACHE SET POPULATION;    FUNCTION              ;  MANGLED FUNCTION         ; LIBRARY                 ; SOURCE
    [ 1 1] @0x400b66 write access to 0x7f7bc2a39e7c:4;     0x7f7bc2a39e40;          57;           TRUE;                    1; tm_begin()               ; _Z8tm_beginv              ; /ABank/aBank:0x000000bee; /ABank/hle_lock.h 45
    [ 1 1] @0x400b69 read access to 0x7f7bc2a39e7c:4 ;     0x7f7bc2a39e40;          57;          FALSE;                    1; tm_begin()               ; _Z8tm_beginv              ; /ABank/aBank:0x000000bee; /ABank/hle_lock.h 45
    [ 1 1] @0x400c29 read access to 0x7f7bc2a39eac:4 ;     0x7f7bc2a39e80;          58;          FALSE;                    1; paySalaries(void*)       ; _Z11paySalariesPv         ; /ABank/aBank:0x000000c29; /ABank/SalaryBatch.cpp 78
    [ 1 1] @0x400a36 write access to 0x7f7bc2a39e80:8;     0x7f7bc2a39e80;          58;          FALSE;                    1; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a36; /ABank/SalaryBatch.cpp 62
    [ 1 1] @0x4008ca read access to 0x7f7bc2a39e1c:4 ;     0x7f7bc2a39e00;          56;          FALSE;                    1; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    . . .
    [ 1 1] @0x4009b0 write access to 0x605dac:4      ;           0x605d80;          54;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x4009b7 read access to 0x605db0:4       ;           0x605d80;          54;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a4b read access to 0x7f7bc2a39e7c:4 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
    [ 1 1] @0x400837 write access to 0x7f7bc2a39e04:4;     0x7f7bc2a39e00;          56;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a57 read access to 0x7f7bc2a39e7c:4 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
    [ 1 1] @0x400a4e write access to 0x7f7bc2a39e60:8;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
    [ 1 1] @0x40098f write access to 0x7f7bc2a39e48:8;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400993 read access to 0x7f7bc2a39e28:8 ;     0x7f7bc2a39e00;          56;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400997 read access to 0x605dd0:4       ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a35 read access to 0x7f7bc2a39e60:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a53 write access to 0x7f7bc2a39e7c:4;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
    [ 1 1] @0x400a57 read access to 0x7f7bc2a39e7c:4 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
    [ 1 1] @0x400a4b read access to 0x7f7bc2a39e7c:4 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
    [ 1 1] @0x400980 read access to 0x7f7bc2a39e24:4 ;     0x7f7bc2a39e00;          56;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x40098f write access to 0x7f7bc2a39e48:8;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400993 read access to 0x7f7bc2a39e28:8 ;     0x7f7bc2a39e00;          56;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400997 read access to 0x605dd8:4       ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x40099c read access to 0x7f7bc2a39e28:8 ;     0x7f7bc2a39e00;          56;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a22 read access to 0x7f7bc2a39e48:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a26 read access to 0x605dfc:4       ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a2c read access to 0x7f7bc2a39e48:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a30 write access to 0x605dfc:4      ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a33 read access to 0x7f7bc2a39e50:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a34 read access to 0x7f7bc2a39e58:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a35 read access to 0x7f7bc2a39e60:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a53 write access to 0x7f7bc2a39e7c:4;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
    [ 1 1] @0x400a4e write access to 0x7f7bc2a39e60:8;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
    [ 1 1] @0x40082c write access to 0x7f7bc2a39e58:8;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a06 read access to 0x605dfc:4       ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a0c read access to 0x7f7bc2a39e40:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a10 write access to 0x605dfc:4      ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a13 read access to 0x7f7bc2a39e48:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a17 read access to 0x605e00:4       ;           0x605e00;          56;           TRUE;                    5; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] @0x400a17 read access to 0x605e00:4       ;           0x605e00;          56;           TRUE;                    5; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
    [ 1 1] self abort transaction(1) abort reason 9
                                                     ;  TOTAL FOOTPRINT: 250 cache lines;
                                                     ;  TOTAL WRITE SET: 249 cache lines;
    
    
     

     

     

  • Intel Transactional Synchronization Extensions (Intel TSX)
  • Intel SDE
  • Restricted Transactional Memory (RTM)
  • Developers
  • Partners
  • Students
  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8.x
  • Unix*
  • Server
  • Python*
  • Advanced
  • Intel® Software Development Emulator
  • Intel® Transactional Synchronization Extensions
  • Debugging
  • Development Tools
  • Intel® Core™ Processors
  • Open Source
  • Optimization
  • Parallel Computing
  • Threading
  • Server
  • Desktop
  • License Agreement: 

    Protected Attachments: 

    AttachmentSize
    DownloadTxSDELogAnalyzer.zip5.05 KB
  • URL
  • Improving performance
  • Multithread development
  • Co-authors: 

    Roman Dementiev (Intel)

    Calculating “FLOP” using Intel® Software Development Emulator (Intel® SDE)

    $
    0
    0

    Purpose

    Floating point operations (FLOP) rate is used widely by the High Performance Computing (HPC) community as a metric for analysis and/or benchmarking purposes. Many HPC nominations (e.g., Gordon Bell) require the FLOP rate be specified for their application submissions.

    The methodology described here DOES NOT rely on the Performance Monitoring Unit (PMU) events/counters. This is an alternative software methodology to evaluate FLOP using the Intel® SDE.

    Methodology

    • We split the FLOP (Floating point Operations) count into two categories:
      • Unmasked FLOP: For Intel® Architectures that do not support masking feature
      • Unmasked + Masked FLOP: For Intel® Architectures that do support masking feature
        Example of some Intel® Architectures that do not support masking feature
        Processor NameCode Name
        2nd gen Intel® Core™ processor familySandy Bridge
        3rd gen Intel® Core™ processor familyIvy Bridge
        4th gen Intel® Core™ processor familyHaswell
        Example of some Intel® Architectures that do support masking feature
        Processor NameCode Name
        Intel® Xeon Phi™ coprocessorKnights Landing
    • There is some debate on what is considered to be a floating point instruction/operation.

    • Provided below is the list of general floating point instructions used in this method: ADD, SUB, MUL, DIV, SQRT, RCP, FMA, FMS, DPP, MAX, MIN (each has many flavors)

    • The high level idea is:
      • Decode every floating point instruction to identify the following:
        • Vector (packed) vs. Scalar
        • Data Type (Single Precision vs. Double Precision)
        • Register Type Used (xmm – 128 bits, ymm – 256 bits, zmm – 512 bits)
        • Masking – masked vs. unmasked instruction
      • Use the above information with its “dynamic execution” count to evaluate the FLOP count for that instruction.
        Example: vfmadd231pd zmm0, zmm30, zmm1 executed 500 times
        • p– packed instruction (vector), without any mask
        • d– double precision data type (64 bit)
        • zmm– operating on 512 bit registers
        • fma– fused multiply and add (2 floating point operations)

      • The FLOP count for the above instruction = 8 * 2 (fma) * 500 (execution count) = 8000 FLOP.
    • You do not need to parse/decode all of the above for every floating point instruction to evaluate the FLOP count for your application.

    • Intel SDE’s instruction mix histogram and dynamic mask profile provide a set of pre-evaluated counters (using the methodology described above + more) that can be used to evaluate the FLOP count on your application.

    The next section describes the details on this.
     

    Instructions to Count Unmasked FLOP

    • This is applicable for all Intel architectures (Sandy Bridge, Ivy Bridge, Haswell, Knights Landing, etc.)
    • Obtain the latest version of Intel SDE here.
    • Generate the instruction mix histogram for your application using Intel SDE as follows:
      • sde -<arch> -iform 1 -omix myapp_mix.out -top_blocks 5000 -- ./myapp.exe
        • <arch> is the architecture that you want to run on(e.g ivb,hsw,knl)
        • Compile the binary correctly for the architecture you are running on.
        • Supports multi-threaded runs
          Example:
          sde -knl -iform 1 -omix myapp_knl_mix.out -top_blocks 5000 -- ./myapp.knl.exe
    • In the instruction mix output (e.g., myapp_mix.out),under the “EMIT_GLOBAL_DYNAMIC_STATS” section, check for the following pre-evaluated counters:
      1. *elements_fp_(single/double)_(1/2/4/8/16)
      2. *elements_fp_(single/double)_(8/16) _masked
    • The different counters mean the following:
      elements_fp_single_1 – floating point instructions with single precision and one element
           (probably scalar) and no mask
      
           elements_fp_double_4 – floating point instructions with double precision and four
           elements and no mask (ymm)
      
           elements_fp_double_8 – floating point instructions with double precision and eight
           elements and no mask (zmm)
      
           …......
           elements_fp_single_16_masked – similar as above but now with masks
      
               (Note: you will see the mask counts only on architectures + ISA that support masking)
    • The above by itself is not sufficient since the Fused Multiply and Add instruction (FMA) is counted as 1 FLOP by the above counters.

    • “EMIT_GLOBAL_DYNAMIC_STATS” section also prints dynamic counts of every type/flavor of FMA executed in your application. Look for the following:
                     VFMADD213SD_XMMdq_XMMdq_XMMdq
                                   scalar, double precision, on xmm (128 bit) = 1 element
                     VFMADD231PD_YMMdq_YMMdq_YMMdq
                                   packed, double precision, on ymm (256 bit) = 4 elements
                     VFMADD132PS_ZMMdq_ZMMdq_ZMMdq
                                   packed, single precision, on zmm (512 bit) = 16 elements

      ......
      Other flavors of FMA like VFNMSUB132PD_YMMqq_YMMqq_MEMqq, VFNMADD231SD_XMMdq_XMMq_XMM, etc. may also be present.

    • Counting FLOPs

      Step 1
      • For each data type (single/double); use the “dynamic” instruction count corresponding to each of the above counters and multiply by the “elements (1/2/4/8/16)” to get the FLOP count.

        Example:

        Intel SDE (Haswell) Instruction Mix output (snapshot) of a Molecular Dynamics code from Sandia Mantevo Suite (look for the below section under EMIT_GLOBAL_DYNAMIC_STATS).



        Unmasked FLOP (Double Precision) =
        (23513724690 * 1 + 274320019 * 2 + 37317021308 * 4) = ~173.3304 GFLOP

      • Note/Caveats:
        • The above by itself is not sufficient since the Fused Multiply and Add instruction (FMA) is counted as 1 FLOP (see “Step 2” on how to take that into account).
        • For Intel® AVX-512 (KNL) instruction mix output you may see “*elements_fp*_masked” counters as well. Counting masked FLOP is covered in the next section.
        • Also the “masked” counters above do not specify the actual mask values, so can’t take them into account anyways.
    •   Step 2
       
      • Taking into account FMA and its flavors
      • For each FMA flavor, based on data type (single vs. double), packed vs. scalar, and register type as described above + the “dynamic” instruction count corresponding to each FMA, compute and add the corresponding FLOP “just one more time” to the above FLOP count computed in Step 1.

        Example:

        Intel SDE (Haswell) Instruction Mix output (snapshot) of a Molecular Dynamics code from Sandia Mantevo Suite (look for the VFM* section under EMIT_GLOBAL_DYNAMIC_STATS).
         
        VFMADD213PD_XMMdq_XMMdq_XMMdq                               1728000
        VFMADD213PD_YMMqq_YMMqq_YMMqq                              47496488
        VFMADD213SD_XMMdq_XMMq_MEMq                               825422220
        VFMADD213SD_XMMdq_XMMq_XMMq                              5733116808
        VFMADD231PD_XMMdq_XMMdq_XMMdq                                432000
        VFMADD231PD_YMMqq_YMMqq_YMMqq                            3189961568
        VFMADD231SD_XMMdq_XMMq_MEMq                                       4
        VFMADD231SD_XMMdq_XMMq_XMMq                               475482133
        VFMSUB213PD_YMMqq_YMMqq_YMMqq                            1594141168
        VFMSUB231PD_YMMqq_YMMqq_YMMqq                              47064488
        VFMSUB231SD_XMMdq_XMMq_XMMq                                 1656723

         

        Unmasked FMA FLOP (Double Precision) =
        (1728000 * 2 + 47496488 * 4 + 825422220 * 1 + 5733116808 * 1 + 432000 * 2 + 3189961568 * 4 + 4 * 1 + 475482133 * 1 + 1594141168 * 4 + 47064488 * 4 + 1656723 * 1) = ~26.5546 GFLOP

        Note/Caveats:

        • The multiplier used above (1,2,4..) is based on the type of FMA instructions (PD - packed double on XMM/YMM ; SD – scalar double on XMM …)
        • For Intel AVX-512 (KNL/SKL) instruction mix output all FMA (and its flavors) instructions (masked or full vectors) will be marked as masked (e.g., VFMADD132PD_ZMMf64_MASKmskw_ZMMf64_MEMf64_AVX512).
        • The next section “Instructions to Count Masked FLOP” will cover that.
        Step 3
       
      • Add the FLOP counted in step 1 and step 2.

        Example (for the Advection routine):
        Total Unmasked FLOP (Double Precision) = 173.3304 + 26.5546 = 199.885 GFLOP

      • If running on an architecture that does not support masking, then you have your total FLOP count (can skip the next section).

      • For floating point operation per second (FLOPS), divide the FLOP count computed using the above method by the application run time measured on appropriate hardware.



    • On another note, the FLOP count of an application will most likely be the same irrespective of the architecture it is run on (unless the compiler generates completely different code impacting FLOP count for the two different binaries–which is very rare). Thus, to find the FLOP count for an application, compute as described above on Ivy Bridge (or Haswell) with no hardware masking feature and use the same count for other architectures (like Knights Landing, etc.). Thus you do not have to deal with masking at all while evaluating FLOP count.

    • But if you still need to evaluate the FLOP count on architecture with masking support, refer to the next section, which describes how to count masked FLOP using the dynamic mask profile feature from Intel SDE.

    Instructions to Count Masked FLOP

    • Intel SDE has a dynamic mask profile feature that evaluates and prints the number of operations for each executed instruction with a mask.
    • Generate the dynamic mask profile for your application using Intel SDE as follows:
       
      • sde -<arch> -iform 1 -odyn_mask_profile myapp_msk.out -top_blocks 5000 -- ./myapp.exe
         
        • <arch> is the architecture that you want to run on (e.g. ivb,hsw,knl).
        • Compile the binary correctly for the architecture you are running on.
        • Supports multi-threaded runs.
          Example:
          sde -knl -iform 1 -odyn_mask_profile myapp_knl_msk.out -top_blocks 5000 -- ./myapp.knl.exe
    • The dynamic mask profile is an XML output, with a summary table per thread of the different categories of instructions with and without masking and their total instruction and operation count.

    • In addition, the mask profile also prints the dynamic instruction count and operation count per instruction.

    • Summary Table (Dynamic Mask Profile)

      Example: Intel® SDE (Knights Landing) dynamic mask profile output (snapshot below):

      ColumnHeaderDescription
      FirstmaskClassifies the masked instructions vs. unmasked instructions
      SecondcatClassifies categories of the instruction (e.g. memory instructions (data transfer), sparse (gather/scatter) and computational (mask)
      Thirdvec-lengthSpecifies the vector register width
      Fourth#elementsSpecifies maximum number of elements possible in the vector register (with vec-length in third column) and with element size (specified in fifth column)
      Fifthelement_sSpecifies size of element (or data type) in bits (e.g. 64b = 64 bits = 8 byte)
      Sixthelement_tClassifies based on element type (e.g. fp - floating point vs. int – integer)
      SeventhicountTotal Instruction count of each category/type
      Eighthcomp_countCorresponding computation count for the executed instructions of each category/type
      Ninth%max-compShows % vector lane utilization for each category/type
      • For example in the above snapshot only the rows highlighted have to be used for “masked” FLOP count
        • Please note (in your run) you need to mainly look for “masked” instructions with “mask” category and “element_t = fp” for masked FLOP count.
      • The “comp_count” number is basically the masked FLOP count.
        • But again FMA is counted as only one FLOP in the comp_count counter.
        • See the next section on how to take into account masked FMA (to count them as 2 FLOP).
    • Per Instruction Details (Dynamic Mask Profile)
      • In addition to the “summary table” per thread, the dynamic mask profile also prints the computational count on a per instruction basis.
      • Below is a snapshot of it.
      • In this case, the masked “vfmadd213pd” instruction has an execution count = 862280 and the computation count = 5052521. Thus all the executions of this instruction are NOT using all the vector lanes in this case.
      • In the snapshot above, the “vfmadd213pd” instruction has an execution count = 4000 and the computation count = 32000 (4000 * 8). Thus all the executions of this instruction are using all the vector lanes in this case (no mask).
      • Since the summary table accounts for the FMA instructions (and its flavors) as only 1 FLOP, you have to add the computation count for all the masked FMA instructions from the instruction-details (as above) “one more time” to account for 2 FLOP per FMA.
    • Counting Masked FLOP

      Step1

      • From the summary table add the “comp_count” value from all “masked” instructions with “mask” category and “element_t = fp”.
    •   Step2
           
      • Parse all the FMA instructions with mask, from per instruction-details and add the “computation-counts” to the above sum evaluated in Step 1 one more time.


      • Thus you have the total Masked FLOP count.

        Note/Caveats:
        • As mentioned in the previous section, in Intel AVX-512 (KNL/SKL) instruction mix output all FMA (and its flavors) instructions (masked or full vectors) are marked as masked (e.g. VFMADD132PD_ZMMf64_MASKmskw_ZMMf64_MEMf64_AVX512).
        • Thus you can use the dynamic mask profile “instruction-details” to evaluate the “computation-count” for all FMA instructions (masked or unmasked – full vectors).

    Validation

    The above methodology may look a bit overwhelming at first, but the reason for such detailed instructions is so that you can write your own simple scripts to parse the above information. We hope to provide the scripts (currently used internally) to evaluate FLOP count as part of the Intel SDE releases in the future.

    Below is a summary of the FLOP count validation on some applications.

    • The error margin is basically the difference in count between the Reference count and the FLOP count evaluated using Intel SDE.
    • The reason for the difference can be due to reasons like theoretical evaluation vs. actual code generation, instructions counted as FLOP, etc. We have not looked into the details for this difference.
    • But you can see the error margin is very minimal.
    WorkloadsReference FLOP
    (from NERSC)
    MPI ranksFLOP Count
    (using Intel® SDE)
    Error Margin
    MiniFE5.05435E+121445.18039E+121.03
    miniGhost6.55500E+12966.85624E+121.05
    AMG1.30418E+12961.43311E+121.10
    UMT1.30211E+13961.38806E+131.07

     

    Footnotes:

    Masking: Even on Intel® AVX/AVX2 (IvyBridge/Haswell) the compiler supports "masking" internally with blends and so forth. Thus in vectorized loops with conditionals there will be unused computations (e.g., compiler computes both the true and false branches and then blends them, throwing away the unused parts). This means that FLOP will be an overestimate of useful computation. Arguably the masked version (KNL/SKL) will be more accurate since the pop count of the mask is exact (assuming the compiler uses masks everywhere).

  • FLOP
  • Knight’s Landing
  • Intel® SDE
  • HPC
  • Floating point operations
  • Intel® Xeon Phi™ Coprocessor
  • Intel® Core™ processor family
  • Developers
  • Professors
  • Students
  • Linux*
  • Server
  • Advanced
  • Intermediate
  • Intel® Software Development Emulator
  • Intel® Many Integrated Core Architecture
  • Server
  • URL
  • Server
  • 使用英特尔® 软件开发仿真器(英特尔® SDE)计算 “FLOP”

    $
    0
    0

    目的

    作为分析指标和/或基于性能指标评测目的,浮点运算 (FLOP) 速度广泛运用于高性能计算 (HPC) 社区。 许多 HPC 贡献者(比如戈登·贝尔)要求提交应用时注明 FLOP 速度。

    本文所述的方法不依赖于性能监控单元 (PMU) 事件/计数器。 它是一种使用英特尔® SDE 评估 FLOP 的替代性软件方法。

    方法

    • 我们将 FLOP (浮点运算)划分为两类:
      • 非屏蔽 FLOP: 针对不支持屏蔽功能的英特尔® 架构
      • 非屏蔽 + 屏蔽 FLOP: 针对支持屏蔽功能的英特尔® 架构
        不支持屏蔽功能的英特尔® 架构示例
        处理器名称代号名称
        第二代英特尔® 酷睿™ 处理器系列Sandy Bridge
        第三代英特尔® 酷睿™ 处理器系列Ivy Bridge
        第四代英特尔® 酷睿™ 处理器系列Haswell
        支持屏蔽功能的英特尔® 架构示例
        处理器名称代号名称
        英特尔® 至强融核™ 协处理器Knights Landing
    • 浮点指令/运算所涵盖的范围尚存在争议。

    • 下面列举出该方法所使用的常用浮点指令: ADD、SUB、MUL、DIV、SQRT、RCP、FMA、FMS、DPP、 MAX、MIN(均包含多个分类)

    • 我们推荐:
      • 解码每个浮点指令来确定下列参数:
        • 矢量(封包)或标量
        • 数据类型(单精度或双精度)
        • 所使用的寄存器类型(xmm – 128 位,ymm – 256 位,zmm – 512 位)
        • 是否屏蔽 — 屏蔽或非屏蔽指令
      • 使用上述带有“动态执行”数的信息评估该指令的 FLOP 数。
        例如: vfmadd231pd zmm0, zmm30, zmm1 执行 500 次
        • p– 封包指令(矢量),非屏蔽
        • d– 双精度数据类型(64 位)
        • zmm– 运行于 512 位寄存器
        • fma– 融合乘加(2 次浮点运算)
         

        上述指令的 FLOP 数 = 8 * 2 (fma) * 500(执行数)= 8000 FLOP。

    • 评估应用的 FLOP 数时,不必为每个浮点指令进行上述解析/解码。

    • 英特尔 SDE 的指令混合直方图与动态掩码配置文件可提供一套预评估的计数器(使用上述或更多方法),可帮助您评估应用的 FLOP 数。

    下面将对此进行详细介绍。
     

    用于计算非屏蔽 FLOP 的指令

    • 这类指令适用于所有英特尔架构(Sandy Bridge、Ivy Bridge、Haswell,Knights Landing 等)
    • 点击此处获取最新版英特尔® SDE。
    • 使用英特尔 SDE 生成针对应用的指令混合直方图,如下所示:
      • sde -<arch> -iform 1 -omix myapp_mix.out -top_blocks 5000 -- ./myapp.exe
        • < arch >表示您希望基于此运行的架构(比如 ivb、hsw、knl)
        • 为基于此运行的架构准确编译二进制。
        • 支持多线程运行
          例如:
          sde -knl -iform 1 -omix myapp_knl_mix.out -top_blocks 5000 -- ./myapp.knl.exe
    • 在指令混合输出(比如 myapp_mix.out)中 “EMIT_GLOBAL_DYNAMIC_STATS” 的下方,检查下列预评估计数器:
      1. *elements_fp_(single/double)_(1/2/4/8/16)
      2. *elements_fp_(single/double)_(8/16) _masked
    • 不同计数器的含义各不相同:
      elements_fp_single_1 – floating point instructions with single precision and one element
           (probably scalar) and no mask
      
           elements_fp_double_4 – floating point instructions with double precision and four
           elements and no mask (ymm)
      
           elements_fp_double_8 – floating point instructions with double precision and eight
           elements and no mask (zmm)
      
           …......
           elements_fp_single_16_masked – similar as above but now with masks
      
               (Note: you will see the mask counts only on architectures + ISA that support masking)
    • 由于上述计数器将混合乘加指令 (FMA) 算作 1 FLOP,所以上述计算本身不够充分。

    • “EMIT_GLOBAL_DYNAMIC_STATS” 部分还可打印应用中执行的每种 FMA 的动态计数。 请看下面:
                     VFMADD213SD_XMMdq_XMMdq_XMMdq
                                   scalar, double precision, on xmm (128 bit) = 1 element
                     VFMADD231PD_YMMdq_YMMdq_YMMdq
                                   packed, double precision, on ymm (256 bit) = 4 elements
                     VFMADD132PS_ZMMdq_ZMMdq_ZMMdq
                                   packed, single precision, on zmm (512 bit) = 16 elements

      ......
      也可显示其他相类似的 FMA ,比如 VFNMSUB132PD_YMMqq_YMMqq_MEMqq、VFNMADD231SD_XMMdq_XMMq_XMM 等。

    • 计算 FLOP

      步骤 1
      • 对于不同数据类型(单精度/双精度),用与上述计数器相对应的“动态”指令数乘以“元素 (1/2/4/8/16)”,得出 FLOP 数。

        示例:

        Sandia Mantevo 套件中分子动力学代码的英特尔 SDE (Haswell) 指令混合输出(快照)(请查看 EMIT_GLOBAL_DYNAMIC_STATS 下方的内容)。



        非屏蔽 FLOP(双精度) =
        (23513724690 * 1 + 274320019 * 2 + 37317021308 * 4) = ~173.3304 GFLOP

      • 注意事项/警告:
        • 由于上述计数器将混合乘加指令 (FMA) 算作 1 FLOP,所以上述计算本身不够充分(请查看“步骤 2”-如何将其计算在内)。
        • 您也可以查看 “*elements_fp*_masked” 计数器,了解英特尔® AVX-512 (KNL) 指令混合输出。 下一部分将介绍如何计算屏蔽 FLOP。
        • 同样,由于“屏蔽”计数器未指定实际掩码值,所以无法将其计算在内。
    •   步骤 2
       
      • 将 FMA 及其分类计算在内
      • 关于 FMA 分类,可根据上述数据类型(单精度或双精度)、封包或标量和寄存器类型,以及与各 FMA 对应的“动态”指令数,将步骤 1 中计算的 FLOP 数加上“多 1 次” FLOP。

        示例:

        Sandia Mantevo 套件中分子动力学代码的英特尔 SDE (Haswell) 指令混合输出(快照)(请查看 EMIT_GLOBAL_DYNAMIC_STATS 下方 VFM* 部分)。
         
        VFMADD213PD_XMMdq_XMMdq_XMMdq                               1728000
        VFMADD213PD_YMMqq_YMMqq_YMMqq                              47496488
        VFMADD213SD_XMMdq_XMMq_MEMq                               825422220
        VFMADD213SD_XMMdq_XMMq_XMMq                              5733116808
        VFMADD231PD_XMMdq_XMMdq_XMMdq                                432000
        VFMADD231PD_YMMqq_YMMqq_YMMqq                            3189961568
        VFMADD231SD_XMMdq_XMMq_MEMq                                       4
        VFMADD231SD_XMMdq_XMMq_XMMq                               475482133
        VFMSUB213PD_YMMqq_YMMqq_YMMqq                            1594141168
        VFMSUB231PD_YMMqq_YMMqq_YMMqq                              47064488
        VFMSUB231SD_XMMdq_XMMq_XMMq                                 1656723

         

        非屏蔽 FMA FLOP(双精度) =
        (1728000 * 2 + 47496488 * 4 + 825422220 * 1 + 5733116808 * 1 + 432000 * 2 + 3189961568 * 4 + 4 * 1 + 475482133 * 1 + 1594141168 * 4 + 47064488 * 4 + 1656723 * 1) = ~26.5546 GFLOP

        注意事项/警告:

        • 之前使用的乘数(1,2,4..)根据 FMA 指令的类型(PD - 基于 XMM/YMM 的封装双精度;SD – 基于 XMM 的标量双精度…)来确定。
        • 对于英特尔 AVX-512 (KNL/SKL) 指令混合输出,所有 FMA (及其分类)指令(屏蔽或全部矢量)都将标记为屏蔽指令(比如 VFMADD132PD_ZMMf64_MASKmskw_ZMMf64_MEMf64_AVX512)。
        • 下一部分“用于计算屏蔽 FLOP 的指令”将对此予以介绍。
        步骤 3
       
      • 将步骤 1 和步骤 2 所计算的 FLOP 数相加。

        示例(advection 例程):
        非屏蔽 FLOP(双精度)总和 = 173.3304 + 26.5546 = 199.885 GFLOP

      • 如果基于不支持屏蔽的架构运行,可直接计算 FLOP 总数(跳过下一部分)。

      • 计算每秒浮点运算时 (FLOPS) ,用适当的硬件测量出的应用运行时间除以采用上述方法计算的 FLOP 数即可。





      •  
      •  
    • 换言之,应用的 FLOP 数最有可能保持不变,因为它与基于运行的架构没有任何关系(除非编译器生成了完全不同的代码,影响了两种不同二进制的 FLOP 数 — 这种情况很少见)。 因此,如需了解应用的 FLOP 数,按上述针对不具备硬件屏蔽功能的 Ivy Bridge(或 Haswell)的方法计算,并使用与其他架构(比如 Knights Landing 等)相同的数即可。 因此评估 FLOP 数时,根本无需在意是否屏蔽。

    • 不过,如果您仍需评估支持屏蔽功能的架构中的 FLOP 数,请参考下一部分,了解如何采用英特尔 SDE 的动态掩码配置文件特性计算屏蔽 FLOP。

    用于计算屏蔽 FLOP 的指令

    • 英特尔 SDE 具备动态掩码配置文件特性,可评估和打印带掩码的指令所执行的运算次数。
    • 使用英特尔 SDE 生成应用的动态掩码配置文件,如下所示:
       
      • sde -<arch> -iform 1 -odyn_mask_profile myapp_msk.out -top_blocks 5000 -- ./myapp.exe
         
        • <arch> 表示您希望基于此运行的架构(比如 ivb、hsw、knl)。
        • 为基于此运行的架构准确编译二进制。
        • 支持多线程运行
          例如:
          sde -knl -iform 1 -odyn_mask_profile myapp_knl_msk.out -top_blocks 5000 -- ./myapp.knl.exe
    • 动态掩码配置文件为 XML 输出,其中包含根据不同类型的指令(屏蔽或非屏蔽)的线程得出的汇总表,及其指令和运算总数。

    • 此外,该掩码配置文件还可打印各指令的动态指令数和运算次数。

    • 汇总表(动态掩码配置文件)

      示例: 英特尔® SDE (Knights Landing) 动态掩码配置文件输出(下列快照):

      标题描述
      mask分类屏蔽指令和非屏蔽指令
      cat区分指令类型(比如,内存指令(数据传输)、稀疏指令(聚集/分散)或计算指令(掩码))
      vec-length注明矢量寄存器宽度
      #elements注明矢量寄存器(如第三列 vec-length 所示)中的最大元素数,以及元素大小(见第五列)
      element_s注明元素(或数据类型)的大小(单位:位,比如 64b = 64 位 = 8 字节)
      element_t根据元素类型进行分类(比如 fp - floating point 和 int – integer)
      icount各类型的总指令数
      comp_count各类型执行指令的相应计算数
      %max-comp显示各类型的矢量通道利用率
      • 例如,在上述快照中,仅突出显示的两行需用于计算“屏蔽” FLOP 数。
        • 请注意,(在运行过程中)您需主要观察带 “mask” 类型和 “element_t = fp” 的 “屏蔽” 指令来计算屏蔽 FLOP 数。
      • “comp_count” 显示的数字主要为屏蔽 FLOP 数。
        • 不过同样,FMA 在 comp_count 计数器中仅算作 1 次 FLOP。
        • 请参阅下一部分了解如果将屏蔽 FMA 考虑在内(算作 2 次 FLOP)。
    • 根据指令详情(动态掩码配置文件)
      • 除了按照线程得出的“汇总表”,动态掩码配置文件还可打印各指令的计算数。
      • 下图显示其中一个快照。
      • 图例中,屏蔽 “vfmadd213pd” 指令的执行数 = 862280 加上计算数 = 5052521。 所以,该指令的总执行数没有采用所有矢量通道计算。
      • 上述快照中,“vfmadd213pd” 指令的执行数 = 4000 加上计算数 = 32000 (4000 * 8)。 因此,该例(非屏蔽)中指令的总执行数采用了所有矢量通道计算。
      • 由于汇总表将 FMA 指令(及其分类)以 1 次 FLOP 进行计算,因此您需要为指令详情(如上所述)中的所有屏蔽 FMA 指令的计算数加上“多一次”,按照每 FMA 2 次 FLOP 进行计算。
    • 计算屏蔽 FLOP

      步骤 1
       
      • 在汇总表中,将带有“mask” 类型的所有“屏蔽”指令的 “comp_count” 值与 “element_t = fp” 相加。
    •   步骤 2
      • 解析指令详情中带掩码的 FMA 指令,并将 “computation-counts” 与步骤 1 中评估的总和相加(一次)。




      •  
      •  
      • 以此得出屏蔽 FLOP 总数。

        注意事项/警告:
        • 如前所述,在英特尔 AVX-512 (KNL/SKL) 指令混合输出中,所有 FMA(及其分类)指令(屏蔽或全部矢量)均标记为屏蔽指令(比如 VFMADD132PD_ZMMf64_MASKmskw_ZMMf64_MEMf64_AVX512)。
        • 因此,您可采用动态掩码配置文件“instruction-details” 评估所有 FMA 指令(屏蔽或非屏蔽-全部矢量)的 “computation-count”。

    验证

    乍一看,上述方法可能令人无所适从,不过由于指令细节非常详细,您也可以自己编写脚本解析以上信息。 我们希望未来的英特尔 SDE 版本能够提供脚本(目前仅供内部使用),用以评估 FLOP 数。

    以下为部分应用的 FLOP 数验证汇总。

    • 误差幅度主要表现为参考数和采用英特尔 SDE 评估的 FLOP 数之间的数量差。
    • 造成差异的主要原因包括理论评估与实际代码生成之间的差别,以 FLOP 计算的指令等等。我们未对此差异进行深入的研究。
    • 不过大家会发现,这种误差幅度非常小。
    工作负载参考 FLOP
    (来自于 NERSC)
    MPI 等级FLOP 数
    (使用英特尔® SDE 计算)
    误差幅度
    MiniFE5.05435E+121445.18039E+121.03
    miniGhost6.55500E+12966.85624E+121.05
    AMG1.30418E+12961.43311E+121.10
    UMT1.30211E+13961.38806E+131.07

     

    脚注:

    屏蔽: 甚至在英特尔® AVX/AVX2 (IvyBridge/Haswell) 中,编译器也借助混合等特性在内部支持“屏蔽”。 因此,条件性矢量化循环中将会有未使用的计算(例如,编译器计算真假分支,将两者加以混合后丢弃未使用的部分)。 这意味着 FLOP 会将有用计算估计过高。 屏蔽版 (KNL/SKL) 可能更加准确,因为掩码的显示数比较精确(假设编译器在所有场合均使用掩码)。

  • FLOP
  • Knight’s Landing
  • Intel® SDE
  • HPC
  • Floating point operations
  • Intel® Xeon Phi™ Coprocessor
  • Intel® Core™ processor family
  • Developers
  • Professors
  • Students
  • Linux*
  • Server
  • Advanced
  • Intermediate
  • Intel® Software Development Emulator
  • Intel® Many Integrated Core Architecture
  • Server
  • URL
  • Server
  • Viewing all 25 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>