Quantcast
Channel: Intel® Software Development Emulator
Viewing all 25 articles
Browse latest View live

Emulation of new instructions

$
0
0

Hello and welcome to my blog. This is my first blog posting.

My name is Mark Charney and I work at Intel in Hudson, Massachusetts. Intel has just made available some software that I've been working on for emulation of new instructions: Intel® Software Development Emulator, or Intel® SDE for short. Intel SDE emulates instructions in the SSE4, the AES and PCLMULQDQ instruction set extensions, and also the Intel® AVX instructions. Intel SDE runs on Windows* and Linux* and supports IA-32 architecture and Intel® 64 architecture programs.

Intel SDE is a functional emulator, meaning that it is useful for testing software that uses these new instructions to make sure it computes the right answers. Testing software that uses instructions that do not exist in hardware yet requires an emulator. Intel SDE is not meant for predicting performance.

Intel SDE is actually a "Pintool" built upon the Pin dynamic binary instrumentation system.. The Pin that comes with Intel SDE uses a special version of the software encoder decoder called XED that I also develop. While Intel SDE is primarily useful for learning about the new instructions, it also has some features for doing simple workload analysis. The "mix" tool compute static and dynamic histograms. It can compute histograms by the type of the instruction (ADD, SUB, MUL, etc.) or by "iforms" which are XED classifications of instructions that include the operands, or by instruction length.

Intel SDE is fairly speedy. I actually haven't measured it because it was so much faster than the other emulator we have been using (over 100x faster) that I'm not getting any complaints internally. We routinely run SPEC2006 using Intel SDE using the reference inputs. Most of the inputs can run in several hours while a few of the longer running inputs take about a day. Emulation performance is tricky to measure as each instruction requires a different amount of work and each application is different. I could probably take the slow down relative to a version of SPEC2006 that only used native instructions. The reason that Intel SDE is faster than our previous "trap-and-emulate" emulator basically has to do with the fact that we do not rely on the illegal-instruction exception saving 1000s of cycles dispatching and returning from the emulation routines. Because Intel SDE is built upon Pin, we can JIT-translate the original program and branch to the emulation routines, saving that exception overhead.

Right now, there are several ways that I know about to write programs using the new instructions. If you want to use the SSE4, AES and PCLMULQDQ instructions, then you can use the Intel® Compiler. The Intel Compiler supporting Intel AVX is expected to be released in the first quarter of 2009. GCC4.3 supports SSE4. There is also a version of GCC that supports AES and PCLMULQDQ available in the svn (subversion) respository svn://gcc.gnu.org/svn/gcc/branches/ix86/gcc-4_3-branch . GCC for Intel AVX is under development as well: svn://gcc.gnu.org/svn/gcc/branches/ix86/avx. GNU binutils which includes the "gas" assembler is available for AES, PCLMULQDQ and Intel AVX. Also available are the YASM and NASM assemblers.

If anyone has questions about this or suggestions for something they'd like me to write about, please post a comment. I'd like to hear about what is important to you. There are so many aspects of this that I'd like to describe in future postings:

    • How it works

    • Isolation issues

    • Debugging

    • Advanced use options

    • Program checkers



Also if you have software questions you can post them to the Intel® AVX and CPU forum at:
/en-us/forums/intel-avx-and-cpu-instructions/

  • AES
  • emulate
  • emulation
  • new instructions
  • Pin
  • SSE4
  • xed
  • Icon Image: 

  • WhatIf Software
  • Intel® Software Development Emulator
  • Intel® Advanced Vector Extensions

  • Licenses for runtime libraries for SDE on Linux

    $
    0
    0

    Read more about GNU General Public License, version 2

    Back to the Intel® Software Development Emulator

    For libstdc++:
    
    // Copyright (C) 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006
    // Free Software Foundation, Inc.
    //
    // This file is part of the GNU ISO C++ Library.  This library is free
    // software; you can redistribute it and/or modify it under the
    // terms of the GNU General Public License as published by the
    // Free Software Foundation; either version 2, or (at your option)
    // any later version.
    
    // This library is distributed in the hope that it will be useful,
    // but WITHOUT ANY WARRANTY; without even the implied warranty of
    // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    // GNU General Public License for more details.
    
    // You should have received a copy of the GNU General Public License along
    // with this library; see the file COPYING.  If not, write to the Free
    // Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301,
    // USA.
    
    // As a special exception, you may use this file as part of a free software
    // library without restriction.  Specifically, if other files instantiate
    // templates or use macros or inline functions from this file, or you compile
    // this file and link it with other files to produce an executable, this
    // file does not by itself cause the resulting executable to be covered by
    // the GNU General Public License.  This exception does not however
    // invalidate any other reasons why the executable file might be covered by
    // the GNU General Public License.
    
    And for libgcc_s:
    
    /* Copyright (C) 1989, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
       2000, 2001, 2002, 2003, 2004, 2005  Free Software Foundation, Inc.
    
    This file is part of GCC.
    
    GCC is free software; you can redistribute it and/or modify it under
    the terms of the GNU General Public License as published by the Free
    Software Foundation; either version 2, or (at your option) any later
    version.
    
    In addition to the permissions in the GNU General Public License, the
    Free Software Foundation gives you unlimited permission to link the
    compiled version of this file into combinations with other programs,
    and to distribute those combinations without any restriction coming
    from the use of this file.  (The General Public License restrictions
    do apply in other respects; for example, they cover modification of
    the file, and distribution when not linked into a combine
    executable.)
    
    GCC is distributed in the hope that it will be useful, but WITHOUT ANY
    WARRANTY; without even the implied warranty of MERCHANTABILITY or
    FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    for more details.
    
    You should have received a copy of the GNU General Public License
    along with GCC; see the file COPYING.  If not, write to the Free
    Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA
    02110-1301, USA.  */
  • Linux*
  • WhatIf Software
  • Intel® Software Development Emulator
  • Parallel Computing
  • Recent Intel® AVX Architectural Changes

    $
    0
    0

    Dear Intel® AVX developers,

    We recently made some significant changes to the Intel® Advanced Vector Extensions Programmer’s Reference Manual (please download the latest version at /sites/avx/). If you are writing tools or software based on AVX, this may impact you. The big changes are a very different FMA syntax and the removal of two instructions (4-operand permutes).

    Since the initial Intel® AVX spec was released in April 2008, I wanted to recap the changes we’ve made in the previous two Programmer’s Reference Manual Updates.

    Added: VEX forms of AES instructions
    The AES instructions we plan to release on our upcoming 32-nm cores (codename “Westmere”) will be extended by the VEX prefix in the following cores (codename “Sandy Bridge”). The VEX brings a distinct destination register to 4 of the 5 AES instructions (VAESDEC, VAESDECLAST, VASEINC, and VAESINCLAST… AESIMC already had a distinct destination). Only the 128-bit form of the instructions are supported – the instructions are not promoted to 256-bit. As with the AES instruction set, VAES instructions may not be enabled by all products in all geographies. For the VAES instructions we have therefore created a unique way of detecting their presence in hardware, by requiring you to check that two CPUID flags are set (CPUID.AES AND CPUID.AVX).

    Added: 256-bit forms of streaming stores
    Astute readers noticed that the streaming store instructions MOVNTDQ, MOVNTPS, and MOVNTPD had originally been supported only in 128-bit forms. We now have them in 256-bit forms in our Sandy Bridge cores. It’s not clear that they will be any faster than the 128-bit forms on Sandy Bridge, but we encouraged their use for future performance. Note that Streaming Load (VMOVNTDQA) is still (only) 128-bit – yes that’s intentional.

    Removed: VPERMIL2PS and VPERMIL2PD
    All PERMIL2 instructions are gone – both the 128-bit and 256-bit flavors. Like the FMA below, they used the VEX.W bit to select which source was from memory – we’re not moving in the direction of using VEX.W for that purpose any more.

    Changed: All FMA instructions
    We previously defined 4-operand FMA’s with 3 sources and a separate destination. We now have 3 operands – and still 3 sources, so one of them gets destroyed (this makes the FMA instructions unique in AVX). For each of the old forms, we now have 3 new instructions, using 132, 213, and 231 designations. The VEX.W bit no longer selects the source that comes from memory – instead, it selects the floating point type (single or double precision). Finally, the scalar instructions preserve the upper bits of the destination (up to bit 127) instead of zeroing. These instructions are (still) not in Sandy Bridge, we are planning to ship them in a subsequent processor.

    Example: we previously had

    VFMADDSS xmm1, xmm2, xmm3, m32, which was xmm1 = xmm2*xmm3 + m32


    (and the upper bits of the XMM register - from bit 32 to 127 - were zeroed).

    NOW we have three forms:

    VFMADD132SS xmm1, xmm2, m32, which is xmm1 = xmm1*m32 + xmm2
    VFMADD213SS xmm1, xmm2, m32, which is xmm1 = xmm2*xmm1 + m32
    VFMADD231SS xmm1, xmm2, m32, which is xmm1 = xmm2*m32 + xmm1


    (and note that now the upper bits of the XMM register – from bit 32 to 127 - are unchanged).

    The numbers in the instruction mnemonics come from the order of the operands in the expression .... so VFMADD132 is 1*3+2; VFMADD213 is 2*1+3, and VFMADD231 is 2*3+1.

    The three forms allow you to avoid having to do extra copies or loads – most of the time. The primary exception is in code where you really did need to re-use all the sources, as in butterflies:

    y0 = x0*c0 + x1
    y0 = x0*c0 – x1


    which generally incurs one copy for every 2 FMA’s. So far, this doesn’t appear to be much of a performance hit.

    Added: VEX forms of the PCLMULQDQ instruction
    The PCLMULQDQ instruction we plan to release on our upcoming 32-nm cores (codename “Westmere”) will be extended by the VEX prefix in our subsequent cores (codename “Sandy Bridge”). The VEX form brings a distinct destination register. Only the 128-bit form of the instructions are supported – the instructions are not promoted to 256-bit. As with the VAES instructions, VPCLMULQDQ will require the careful programmer to check that two CPUID flags are set (in this case CPUID.PCLMULQDQ AND CPUID.AVX).

    A number of miscellanea
    We clarified the alignment-check exception (#AC) behavior for MASKMOV instructions (by the way, does anyone actually use #AC?) .
    We clarified the exception type for the packed shift instructions (PSLL, PSRL, PSRA).
    The Encoding rule table (4-3) is clarified to reflect PEXTRW.

    We hope these changes are not too disruptive to you and thank you for your support of our ongoing early disclosure policy. We have updated the Intel® Software Development Emulator with all of these changes. If you have any questions or concerns about the impact of these changes to your application, or would like more detail on any of these changes, I encourage you to start a thread at /en-us/forums/intel-avx-and-cpu-instructions, or contact me directly.

    Regards,
    Mark Buxton
    Software Engineer
    Intel Corporation
    Mark.J.Buxton@intel.com

    Icon Image: 

  • Parallel Computing
  • WhatIf Software
  • Intel® Software Development Emulator
  • Intel® Advanced Vector Extensions
  • Developers
  • AVX debugging или все-таки как?

    $
    0
    0

    AVX определен, зафиксирован и уже идет к нам. Ранее много говорилось о разных способах разработки: компиляция, эмуляция, документация и даже профайлинг (очень рекомендую заглянуть сюда /en-us/avx/ ), – но довольно мало было информации по поводу отладки.

    Хотя, если сказать честно – все уже было. Но сегодня стало еще удобнее и даже нагляднее отлаживать перемещение битов по 256 битному полю AVX регистров.

    В общем, рекомендую ближе познакомиться с SDE (/en-us/articles/intel-software-development-emulator ).

    Эмулятор позволяет не только отлично, но и тихо обрабатывать набор всех инструкций, а также показывать, что именно происходило.

    Для начала хочу обратить ваше внимание на дополнительный аргумент помощи  - thelp, который раскрывается в довольно длинный набор аргументов, среди которых можно найти и так называемые Debugtrace knobs, где отдельно стоит отметить -debugtrace и -dt_start_int3.

    Их использование позволяет нам создать файл отчета debugtrace.out ( имя по умолчанию ), где будут явно видны команды и, главное, их операнды с используемыми значениями.

    У меня, например, получается:

    TID0: INS 0x00401f4d                     vrcpss xmm7, xmm5, xmm5


    TID0:      XMM7 := 00000000_00000000_00000000_3ba57800


    XMM7 (doubles) := 0 4.94411e-315


    XMM7 (floats)  := 0 0 0 0.00504971


    TID0: INS 0x00401f51                     vsubss xmm5, xmm1, xmm0


    TID0:      XMM5 := 00000000_00000000_00000000_43460000


    XMM5 (doubles) := 0 5.57633e-315


    XMM5 (floats)  := 0 0 0 198


    TID0: INS 0x00401f55                     vmulss xmm5, xmm5, xmm7


    TID0:      XMM5 := 00000000_00000000_00000000_3f7ff5a0


    XMM5 (doubles) := 0 5.26353e-315


    XMM5 (floats)  := 0 0 0 0.999842



    Здесь явно видно, что vmulss ( скалярное умножение с плавающей точкой ) в виде операндов получает

    0.00504971 (XMM7) и 198 (XMM5). Результат остается в XMM5 (0.999842), что согласно моему калькулятору является истиной.

    Структура debugtrace.out на самом деле довольно проста, и практически сразу, ну или со второго взгляда можно увидеть последние значения используемых регистров или памяти  J.

    Для большего удобства советую также обратить внимание на dt_start_int3, который позволяет «окружать» интересный код для более детального разбора уже из SDE.

    Я думаю проблем уже нет или ?

    Icon Image: 

  • Open Source
  • Intel® Software Development Emulator
  • Intel® Advanced Vector Extensions
  • Intel® Software Development Emulator Download

    $
    0
    0

    Intel® Software Development Emulator (released November 16, 2013)

    Previous versions of the Intel® Software Development Emulator

    Please take a moment to register with Intel® DZ to participate in forum discussions.

    Back to the Intel® Software Development Emulator page.

  • Developers
  • WhatIf Software
  • Intel® Software Development Emulator
  • Intel® Advanced Vector Extensions
  • Intel® Memory Protection Extensions
  • Intel® Secure Hash Algorithm Extensions
  • Intel® Streaming SIMD Extensions
  • Intel® Transactional Synchronization Extensions
  • Parallel Computing
  • License Agreement: 

    Protected Attachments: 

    Buy or Renew Intel® Software Development Products

    $
    0
    0

    Intel offers several licensing options for our software development products. Review the choices below to buy or renew Intel® software.

    30 day evaluation versions of Intel® Software Development Products are also available for free download. Visit our Software Evaluation Center to download free evaluation versions of the products.

    All prices listed below are for single developer commercial licenses. All prices are Manufacturer Suggested List Prices (MSRP) and subject to change without notice. Prices do NOT include Value Added Taxes (VAT) or any other state or local taxes or charges.

    • For floating licenses, node-locked licenses, or other licensing options, contact a reseller, or contact an Intel representative at intel.software.sales@intel.com.
    • To purchase an academic research license, please select your desired product and the discounted price will be displayed during check out. For additional information on all of our education offerings, visit our Education Offerings Center, or contact an Intel representative at academicdevelopersinfo@intel.com.
    • Support Renewal extends your support for one year from the expiration date of your current support agreement.
    • Existing customers can take advantage of special upgrade prices for Intel® Parallel Studio XE, Intel® C++ Studio XE or Intel® Fortran Studio XE .See details of upgrade offer.

    Category Name

    Product MSRP
    (Single user)

    Support Renewal MSRP
    (Single User)

    Options

    Product Suites

    Intel® Parallel Studio XE for Windows*
    Includes Intel® Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE
    $2,299$799**Find a reseller ›
    See all options ›
    Intel® Parallel Studio XE for Linux*
    Includes Intel® Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE
    $2,299$799**Find a reseller ›
    See all options ›
    Intel® C++ Studio XE for Windows or Linux
    Includes Intel® C++ Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE
    $1,599$599**Find a reseller ›
    See all options ›
    Intel® Visual Fortran Studio XE for Windows
    Includes Intel® Visual Fortran Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE
    $1,899$699**Find a reseller ›
    See all options ›
    Intel® Fortran Studio XE for Linux
    Includes Intel® Fortran Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE
    $1,899$699**Find a reseller ›
    See all options ›
    Intel® Parallel Studio
    Includes Intel® Parallel Advisor, Intel® Parallel Amplifier, Intel® Parallel Composer, Intel® Parallel Inspector
    $320Find a reseller ›
    See all options ›
    Intel® Cluster Studio XE for Windows
    Includes Intel® C++ Composer XE, Intel® Visual Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE
    $2,949$1,049**Find a reseller ›
    See all options ›
    Intel® Cluster Studio XE for Linux
    Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE
    $2,949$1,049**Find a reseller ›
    See all options ›
    Intel® Cluster Studio for Windows
    Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks
    $2,049$749**Find a reseller ›
    See all options ›
    Intel® Cluster Studio for Linux
    Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks
    $2,049$749**Find a reseller ›
    See all options ›
    Intel® System Studio for Linux* including JTAG Debugger
    Intel® Vtune™ Amplifier, Intel® Inspector, Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel® JTAG Debugger, GDB* Debugger
    $3,499$1,299Find a reseller ›
    See all options ›
    Intel® System Studio for Linux*
    Intel® Vtune™ Amplifier, Intel® Inspector, Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, GDB* Debugger
    $1,999$699Find a reseller ›
    See all options ›

    Compilers and Libraries

    Intel® Composer XE for Windows
    Includes Intel® C++ Composer XE, Intel® Visual Fortran Composer XE
    $1,199$449**Find a reseller ›
    See all options ›
    Intel® Composer XE for Linux
    Includes Intel® C++ Composer XE, Intel® Fortran Composer XE
    $1,449$499**Find a reseller ›
    See all options ›
    Intel® C++ Composer XE for Windows, Linux, or OS X*
    Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives 8.0, Intel® Math Kernel Library, Intel® Parallel Building Blocks
    $699$249**Find a reseller ›
    See all options ›
    Intel® Visual Fortran Composer XE for Windows
    Includes Intel® Visual Fortran Compiler, Intel® Math Kernel Library
    $849$299**Find a reseller ›
    See all options ›
    Intel® Fortran Composer XE for Linux
    Includes Intel® Fortran Compiler, Intel® Math Kernel Library
    $999$349**Find a reseller ›
    See all options ›
    Intel® Fortran Composer XE for OS X
    Includes Intel® Fortran Compiler, Intel® Math Kernel Library
    $849$299**Find a reseller ›
    See all options ›
    Intel® C++ Compiler for Android*$79.95N/AFind a reseller ›
    See all options ›
    Intel® C++ Compiler Professional Edition for QNX Neutrino* RTOS Support
    Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives 8.0
    $599$240See all options ›
    Intel® C Compiler for EFI Byte Code$995$398Find a reseller ›
    See all options ›
    Intel® Visual Fortran Composer XE for Windows with IMSL 6.0*
    Includes Intel® Visual Fortran Compiler, IMSL* Fortran Numerical Library, Intel® Math Kernel Library Includes 1 developer and 1 deployment license for the developer.
    $2,049$849**Find a reseller ›
    See all options ›

    Embedded and Mobile System Development

    Intel® System Studio for Linux* including JTAG Debugger
    Intel® Vtune™ Amplifier, Intel® Inspector, Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel® JTAG Debugger, GDB* Debugger
    $3,499$1,299Find a reseller ›
    See all options ›
    Intel® System Studio for Linux*
    Intel® Vtune™ Amplifier, Intel® Inspector, Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, GDB* Debugger
    $1,999$699Find a reseller ›
    See all options ›

    Performance Libraries

    Intel® Integrated Performance Primitives 8.0 for Windows, Linux, or OS X$199$69**Find a reseller ›
    See all options ›
    Intel® Math Kernel Library for Windows or Linux$499$179**Find a reseller ›
    See all options ›
    Intel® Threading Building Blocks for Windows, Linux, or OS X$499$179**Find a reseller ›
    See all options ›

    Performance Profilers

    Intel® VTune™ Amplifier XE for Windows or Linux$899$349**Find a reseller ›
    See all options ›

    Thread and Memory Checkers

    Intel® Inspector XE for Windows or Linux$899$349**Find a reseller ›
    See all options ›

    Cluster Tools

    Intel® Cluster Studio XE for Windows
    Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE
    $2,949$1,049**Find a reseller ›
    See all options ›
    Intel® Cluster Studio XE for Linux
    Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE
    $2,949$1,049**Find a reseller ›
    See all options ›
    Intel® Cluster Studio for Windows
    Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks
    $2,049$749**Find a reseller ›
    See all options ›
    Intel® Cluster Studio for Linux
    Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks
    $2,049$749**Find a reseller ›
    See all options ›
    Intel® MPI Library for Windows or Linux$499$179**Find a reseller ›
    See all options ›

    System Modeling and Simulation Tools

    CoFluent Studio*N/AN/ASee all options ›
    CoFluent ReaderN/AN/ASee all options ›

    Intel® Graphics Performance Analyzers

    Intel® Graphics Performance AnalyzersN/AN/ASee all options ›

    **Lowest Price available if you renew prior to current subscription expiration. For more information on renewals click here.

    Intel takes your privacy seriously. Refer to Intel's Privacy Notice and Serial Number Validation Notice regarding the collection and handling of your personal information, the Intel product’s serial number and other information.

    Intel® Software Development Emulator Release Notes

    $
    0
    0
    2013-11-16 version 6.12.0
    • Added support to Mac OSX version 10.9.
    • Improved the TSX statistics information.
    • Various fixes with the emulation of floating-point instructions of Intel AVX-512.
    • Enabled the alignment checker tool by default for instructions that require alignment.
    • Fixed mismatch between mix and dynamic mask profiler.
    • Updated the Intel MPX runtime libraries for Windows. 
    • Performance improvements when modeling a CPU prior to AVX-512.
    2013-09-21 version 6.7.0
    • Debugging with GDB is now supported with Intel® AVX-512. Download the new GDB from here.
    • Emulation of  Intel® AVX2 FMA and  Intel AVX-512 FMA uses native FMA instructions when running on Haswell host.
    • Various fixes with the emulation of floating-point and conversion instructions of Intel AVX-512.
    • Disassembly of control transfer instructions displays the 'bnd' prefix when used with Intel® MPX.
    • Updated the XED ISA set names for Intel AVX-512. This is visible in 'mix' statistics output.
    • This release goes with 2013-08-29 version of the Intel MPX runtime.
    2013-07-22 version 6.1.0
    • Emulation support for the Intel®Advanced Vector Extensions 512 (Intel® AVX-512) instructions present on the Intel Knights Landing microarchitecture.
    • Emulation support for the Intel® Secure Hash Algorithm (Intel® SHA) extensions present on the Intel Goldmont microarchtiecture. 
    • Emulation support for the Intel® Memory Protection Extensions (Intel® MPX)  present on the Intel Skylake and Goldmont microarchitectures.
    • Support for Hardware Lock Elision introduced on the Intel Haswell microarchitecture
    • Improved support for Restricted Transactional Memory introduced on the Intel Haswell microarchitecture.
    • Improved support for the OS X* operating system (Mountain Lion)
    • The footprint tool now has the ability to compute footprint over time for working-set estimation.
    • A new tool called the dynamic mask profiler is provided using -dyn_mask_profile knob. The output is in a simple XML format.
    The Intel SDE development team has grown to include Michael Berezalsky, Mark Charney, Michael Gorin, Omer Mor, Ariel Slonim and Ady Tal.
     
     
    2013-01-03 version 5.38
    • Improvements in RTM emulation stability. Added statistics knobs.  Updated knobs.
    • Support for debugging integration with Microsoft Visual Studio 2012. See main page for information.
    • Improved multithreaded stability when using the AVX/SSE transition checker
    • Mac OS X: support for code-signed binaries, simplifying execution. See main page for information about the "taskport".
    • XED: added elf/dwarf support back to the command line tool
    • TZCNT ZF flags fix
     
    2012-11-01 version 5.31 - major update
    • Major update including fixes for the processor codenamed Haswell and introduction of instructions in the processor codenamed Broadwell
    • First public SDE release for OS X, 10.6 and 10.7.  See additional information on the main Intel SDE web page for required permissions.
    • HSW's RTM mode is supported with the "-rtm-mode full" option. This feature is very new and the Intel SDE implementation might be a little unstable.
    • Completely new mechanism for handling of CPUID. CPUID values now come from an input file.
    • SDE's -chip-check feature checks to make sure instructions are valid for the specified chip. See "sde -help" for the various chip options.
    • Exception handling fixes
    • Haswell BMI emulation fixes, including flags output.
    • Debugtrace multithreading safety improvements
    • Mix top-blocks sorting issues.  Mix also has better support for allocating stats to overlapping blocks.
    • Mix default blocks size is now 1500 instructions to avoid fragmenting large hot blocks.
    • XED now can emit "dot" graphs for specified regions:  path-to-sde-kit/xed -i SOMEEXE -as 0x40316b -ae 0x4031b3 -dot foo.dot; dot -O -Tpdf foo.dot
    • Mix has prefix a legacy-prefix histogram
    • Footprint tool can now collect stats about unique memory pages as well as unique cache lines. The footprint tool is now faster as well.
    • Improved speed of AVX/SSE transition checker by roughly 12%. See the -ast knob in "sde -thelp".
    • Fixed some numerical errors in our software emulation of the FMA instruction for denormal numbers.
    • Various stability improvements from using a newer version of Pin.
    • Better handling of MXCSR exception status bigs for AVX1/2 instructions. We still do not support raising unmasked floating point errors from emulated instructions.
    • Can now set environment variables from the command line with the -env VAR VALUE option.
    • The commands for the GDB interface have been updated. See "monitor help sde" when attached as described on the main page. Please use GDB 7.4 or later.
    • The chip check error message includes the instruction bytes of the offending instruction.
    • Multiprocess output file handling. You used to have to supply "-i" to get the process id inserted in to the file name to avoid multiprocess applications from overwriting the common output files. Now we attempt to detect the creating of other processes and add the PID to the file names automatically.  The parent / child relationship is recorded in the file name.
    • Better support for unused bits in the VEX encodings in 32b mode.
    The Intel SDE development team has grown to include Michael Berezalsky, Mark Charney, Michael Gorin, Omer Mor, Ariel Slonim and Ady Tal.
     
    2011-12-15 version 4.46
    • Linux* 3.x is supported
    • Better support for running on Intel® AVX-enabled hosts
    • All output files now begin "sde-" and end with ".txt" by default
    • Mix is faster and does more analysis of SIMD operations
    • Mix has line number support for the top blocks when the information is available in the application
    • The -ptr-chk option now checks the memory refernces of gather operations
    • Fixed support for file descriptor leak when exec'ing thousands of threads on Linux*.
    • Misc other stability improvements.
     
    2011-07-01 version 4.29
    • Support for the Haswell new instructions in the Intel AVX programmers reference version 11.
    • Mix now includes category and instruction length histograms automatically so the corresponding knobs were removed.
    • Many other changes
     
    2010-12-23 verison 3.89 (Linux* only)
    • Fixed runtime libraries. Version 3.88 accidentally included runtime libraries that require a newer version of glibc than is present on older systems (like RHEL4).
    2010-12-21 version 3.88
     
    • Support for the post-32nm processor instructions for the processor codenamed Ivy Bridge in the 008 revision of the Intel AVX programmers reference document
    • Many stability improvements
    • "sde -thelp" goes to stdout, not stderr
    • mix has a "-demangle 0" option to turn off demangling
    • xed disassembler handles uninitialized code sections in windows binaries
    • xed supports dwarf line number information with the -line knob on Linux*.
    • mix has improved memory efficiency
    • To debug on Linux*, you no longer need the -avx-gdb knob but you must use gdb 7.2 or later which supports a new XML remote-debug protocol.
     
    2010-03-11 version 3.09
     
     
    • When pin or sde crashes due to bugs in user applications, the output of the circular buffer use for -itrace-execute (etc.) was not being dumped to disk. It is now.
    • Fixed circular buffer used for -itrace-execute and -itrace-execute-emulate. It was not initializing the circular buffer when -itrace-lines was used and would just crash immediately.  In addition to *actually* making the feature work, I sped it up immensely by reusing allocated string buffers.
    • Fixed 14 scalar Intel AVX instructions that were referencing too much memory (128b instead of 32b or 64b).
    • Made the xsave emulator be enabled all the time even when xsave is present on the hardware. One can disable it with '-xsave 0'.
    • All output log / stats file names now end in .txt by default.
    • Added a descriptive header to the top of the Intel AVX/Intel SSE transition output file.
    • debugtrace now print mmx (and x87) register values
    • vmaskmov* instructions are now implemented in a thread-safe way.
    • vpmov[sz]x instructions now correctly reference less memory to avoid extra page accesses.
    • New memory pointer checker. This option check all memory references for accessibility before the user application program is allowed to access memory.  There is also a null pointer checker which previously would only check Intel AVX instructions. The null checker writes to stderr (if accessible) and to a file sde-null-check.out.txt. The pointer checker writes to stderr (if accessible) and to a file sde-ptr-check.out.txt. The new knobs are: -null-check and -ptr-check
    • enforcing VL=128 on any Intel AVX scalar instructions.
    • fixed for the -no-avx and -no-aes knobs in the sde driver
    • xed: many corner case bugs fixed after yet another validation review
     
     
    2010-02-08 version 3.00
    • Changed output files to have .txt suffix.
    • debugtrace prints x87 and mmx registers
    • thread-safety fix for vmaskmov* instructions
    • reduced amount of memory referenced by vpmov[sz]* instructions.
    • New memory pointer checker (See -ptr-check and -null-check knobs)
    • Added VL=128 requirement for Intel AVX scalar instructions.
    • Fixed knobs -no-avx and -no-aes in the sde front end driver
     
    2009-12-31 version 2.94
    Major update.
    • Better support for recent Linux* distributions, like Ubuntu* 9.10.
    • Better support for debugging with GDB on Linux*.
    • Using GDB 7.0.50, and "sde -debug -avx-gdb -- yourapp", gdb can directly obtain Intel AVX register values without requiring "monitor yreg N" or "monitor yregs" commands.
    • Windows version supports latest dbghelp.dll 6.11.1.404
    • Fixes for paths with spaces
    • Using Pin's "safecopy" mechanism to access user memory
    • Spelling fixes
    • Tool arguments grouped more sensibly; See the output of "sde -thelp"
    • Support for Intel AVX unmasked zero divide exceptions on Windows
    • Intel AVX/Intel SSE transition tracing feature with -ast-trace knob
    • Intel AVX/Intel SSE transition checker emits previous block information
    • CPUID leaf-zero emulation support
    • Alignment checker upgrades
    • XED disassembler supports windows debugging symbols (via dbghelp.dll)
    • Fix for Nan case in Intel®SSE4.1 roundss on Linux* only
    • Fix for Intel® SSE4 PEXTRW gpr,xmm
    • More CPUID feature knobs for Intel® SSE technologies
    • Fix for case emulation of FMA single precision that affected accuracy
    • Support for FZ and DAZ in FMA routines
    • Data watch point support
    • Fix for MXCSR.OE and IE for vcomiss/vucomiss an Nan inputs
    • New chip-check feature to restrict instructions to specific chips. See "sde -thelp"
    • Fast icounting feature (faster than using mix)
    • Fixes for Nan issues on windows with sqrt, mul, div, sub and cmp - it was quieting SNANs.
    • Upgraded pin can execute instructions with illegal instructions and an application-installed handler will be invoked.
    • New -itrace* knobs
    • Circular buffer support in debugtrace


    2009-01-30 version 1.70
    Added VPCLMULQDQ
     
    2009-01-09 version 1.61
    Synchronizing with Intel AVX architecture update.
    New 3-operand FMA instructions, removed VPERMIL2{PS,PD}, miscellaneous bug fixes.
    New footprint feature.
    Rearranged mix output, added function summaries.
    New version of dbghelp.dll required for windows (See the FAQ).
     

    2008-08-10 version 1.13

    Initial Release
  • AVX2
  • FMA
  • TSX
  • RTM
  • HLE
  • AVX-512
  • MPX
  • SHA
  • WhatIf Software
  • Intel® Software Development Emulator
  • Intel® Advanced Vector Extensions
  • Intel® Memory Protection Extensions
  • Intel® Secure Hash Algorithm Extensions
  • Intel® Streaming SIMD Extensions
  • Intel® Transactional Synchronization Extensions
  • Parallel Computing
  • Exploring Intel® Transactional Synchronization Extensions with Intel® Software Development Emulator

    $
    0
    0

    Intel® Transactional Synchronization Extensions (Intel® TSX) is perhaps one of the most non-trivial extensions of instruction set architecture introduced in the 4th generation Intel® Core™ microarchitecture code name Haswell. Intel® TSX implements hardware support for a best-effort “transactional memory”, which is a simpler mechanism for scalable thread synchronization as opposed to inherently complex fine-grained locking or lock-free algorithms. The extensions have two interfaces: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM). 

    In this blog I will show how you can write your first RTM code and execute it in an emulated environment now, without waiting until the 4th generation Intel® Core™ processors become available for purchase.

    Before diving in, please make sure you have a basic understanding of the new RTM instructions. I refer you to this blog as an introduction. Check out also the Intel Developer Forum’12 presentation by Ravi Rajwar&Martin Dixon discussing the details of Intel TSX implementation in Haswell hardware and a presentation by Andi Kleen on adding lock elision (also using RTM) to Linux.

    My plan was to write a toy bank account processing application using popular C++ thread-unaware data structures from STL with concurrent access to bank records managed by Intel TSX. This way the implementation should be very simple, thread-safe and scalable.

    Development Environment

    For this experiment one needs the recent version (5.31) of Intel® Software Development Emulator (Intel® SDE) and a compiler that can generate RTM instructions (via intrinsics or direct machine code). Please note that performance measurements with Intel SDE running RTM are of limited value because the overhead of emulating TM in software instead of using real hardware is huge, but as you will see later Intel SDE can already demonstrate important points for RTM usage for concurrency library developers and application programmers.

    Since my laptop runs Windows I decided to try Intel SDE/RTM on Windows. I have chosen the C++ compiler from “Microsoft Visual Studio 2012 for Windows Desktop” (there is a free “Express” version that works for my purpose too). With a few clicks I quickly setup a console application project and included immintrin.h header the main .cpp file to use RTM intrinsics.

    The Test

    As a bank account structure the simple std::vector<int> from C++ standard template library has been chosen. “Accounts[i]” stores current account balance for account number i. This is very simple and popular but thread-unsafe data structure which must be protected by concurrency control mechanisms for parallel access. Usually locks/mutexes are used to limit the number of threads accessing the structure simultaneously. However, for parallel write accesses the whole data structure usually is locked exclusively even if distinct parts of it have to be updated. Intel TSX should help here since it can optimistically execute writes, and if there is no real data conflict happening, the writes are committed without serializing.

    To simplify the operations on the accounts I wanted to implement an easy-to-use C++ wrapper for protecting the current C++ scope from unsafe concurrent access to the data:

    {
            std::cout << "open new account"<< std::endl;
            TransactionScope guard; // protect everything in this scope
            Accounts.push_back(0);
    }
    {
            std::cout << "open new account"<< std::endl;
            TransactionScope guard; // protect everything in this scope
            Accounts.push_back(0);
    }
    {
            std::cout << "put 100 units into account 0"<<std::endl;
            TransactionScope guard; // protect everything in this scope
            Accounts[0] += 100; // atomic update due to RTM
    }
    {
            std::cout << "transfer 10 units from account 0 to account 1 atomically!"<< std::endl;
            TransactionScope guard; // protect everything in this scope
            Accounts[0] -= 10;
            Accounts[1] += 10;
    }
    {
            std::cout << "atomically draw 10 units from account 0 if there is enough money"<< std::endl;
            TransactionScope guard; // protect everything in this scope
            if(Accounts[0] >= 10) Accounts[0] -= 10;
    }
    {
            std::cout << "add 1000 empty accounts atomically"<< std::endl;
            TransactionScope guard; // protect everything in this scope
            Accounts.resize(Accounts.size() + 1000, 0);
    }
    

    Legacy applications implement such guards using a lock that allows only a single writer to execute the critical section (read-write locks are more complicated to handle and also do not make much sense here in our case because all accesses are writes/updates):

    class TransactionScope
    {
            SimpleSpinLock & lock;
            TransactionScope(); // forbidden
    public:
            TransactionScope(SimpleSpinLock & lock_): lock(lock_) { lock.lock(); }
            ~TransactionScope() { lock.unlock(); }
    };
    

    Implementing and Testing with RTM

    A naive RTM implementation for TransactionScope (handling both read/lookup and write/update accesses transparently) would be (changed lines are marked with ):

    class TransactionScope
    {
    public:
            TransactionScope()
    {
    █               int nretries = 0;
    █               while(1)
    █               {
    █                       ++nretries;
    █                       unsigned status = _xbegin();
    █                       if(status == _XBEGIN_STARTED) return; // successful start
    █                       // abort handler
    █                       std::cout << "DEBUG: Transaction aborted "<< nretries <<
    █                          " time(s) with the status "<< status << std::endl;
    █               }
            }
    █       ~TransactionScope() { _xend(); }
    };
     

    I have successfully compiled this code and tried to run it through Intel SDE:

    ./sde-bdw-external-5.31.0-2012-11-01-win/sde.exe -hsw -rtm-mode full -- ./ConsoleApplication1.exe
    open new account
    DEBUG: Transaction aborted 1 time(s) with the status 0
    DEBUG: Transaction aborted 2 time(s) with the status 0
    DEBUG: Transaction aborted 3 time(s) with the status 0
    DEBUG: Transaction aborted 4 time(s) with the status 0
    DEBUG: Transaction aborted 5 time(s) with the status 0
    DEBUG: Transaction aborted 6 time(s) with the status 0
    DEBUG: Transaction aborted 7 time(s) with the status 0
    DEBUG: Transaction aborted 8 time(s) with the status 0
    DEBUG: Transaction aborted 9 time(s) with the status 0
    DEBUG: Transaction aborted 10 time(s) with the status 0
    DEBUG: Transaction aborted 11 time(s) with the status 0
    DEBUG: Transaction aborted 12 time(s) with the status 0
    DEBUG: Transaction aborted 13 time(s) with the status 0
    DEBUG: Transaction aborted 14 time(s) with the status 0
    DEBUG: Transaction aborted 15 time(s) with the status 0
    DEBUG: Transaction aborted 16 time(s) with the status 0
    

    and so on…

    The program went into infinite loop always aborting on the first transaction. The RTM debug log from Intel SDE (emx-rtm.txt) also confirmed that (used option “-rtm_debug_log 2”). Well, a general rule is that failure is more or less expected for any implementation that ignores specification… Intel® Architecture Instruction Set Extensions Programming Reference explicitly mentions that “the hardware provides no guarantees as to whether an RTM region will ever successfully commit transactionally”. Because of that the software using RTM must provide (non-transactional) fall-back path that is executed if (many) aborts are happening (By the way: HLE provides the fall-back automatically, since on the first abort, the same critical section is executed non-transactionally).

    Implementing Fall-Back

    Here is our second attempt that acquires a fall-back spin lock non-transactionally after specified number of retries.

    LONGLONG naborted = 0; // global abort statistics, alternatively use “–rtm_debug_log 2” Intel SDE option
     
    class TransactionScope
    {
    █       SimpleSpinLock & fallBackLock;
            TransactionScope(); // forbidden
    public:
    █       TransactionScope(SimpleSpinLock & fallBackLock_, int max_retries = 3) :
    █               fallBackLock(fallBackLock_)
            {
                    int nretries = 0;
                    while(1)
                    {
                            ++nretries;
                            unsigned status = _xbegin();
                            if(status == _XBEGIN_STARTED)
                            {
    █                               if(!fallBackLock.isLocked())
    █                                         return; // successfully started transaction
    █                               /* started transaction but someone is executing 
    █                                  the transaction section non-speculatively (acquired
    █                                  the fall-back lock) -> aborting */
    █                               _xabort(0xff); // abort with code 0xff
                            }
                            // abort handler
                            InterlockedIncrement64(&naborted); // do abort statistics
                            std::cout << "DEBUG: Transaction aborted "<< nretries <<
                                  " time(s) with the status "<< status << std::endl;
    █                       // handle _xabort(0xff) from above
    █                       if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff
    █                            && !(status & _XABORT_NESTED))
    █                       {       // wait until the lock is free
    █                               while(fallBackLock.isLocked()) _mm_pause();
    █                       }
    █                       // too many retries, take the fall-back lock
    █                       if(nretries >= max_retries) break;
                    }
    █               fallBackLock.lock();
            }
            ~TransactionScope()
            {
    █               if(fallBackLock.isLocked())
    █                       fallBackLock.unlock();
    █               else
                            _xend();
            }
    };
    

    The output looks much better now:

    open new account
    DEBUG: Transaction aborted 1 time(s) with the status 0
    DEBUG: Transaction aborted 2 time(s) with the status 0
    DEBUG: Transaction aborted 3 time(s) with the status 0
    open new account
    put 100 units into account 0
    transfer 10 units from account 0 to account 1 atomically!
    atomically draw 10 units from account 0 if there is enough money
    add 1000 empty accounts atomically
     

    One can see that all transaction except the first one succeeded on the very first attempt. The first one took the fall-back lock after three attempts. It was special since it had to reserve and touch new memory for the vector from the operating system. This is a very complex process involving system calls, privilege ring transitions (ring 3 [application]->ring 0 [OS]), page faults and initialization/zeroing of very big chunks of memory which may not fit into the transactional buffer. All this may cause aborts according to the Intel® Architecture Instruction Set Extensions Programming Reference.

    Leveraging RTM Abort Status Bits

    A further optimization that I came up with is leveraging the abort status information: in case of such “hard” aborts the “retry” bit (position 1) in the abort status is not set. The bit is set if hardware thinks the transaction may succeed on retry. I added the line marked below in the abort handler to implement it:

     

     // handle _xabort(0xff) from above
     if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff
          && !(status & _XABORT_NESTED))
     {
            while(fallBackLock.isLocked()) _mm_pause(); // wait until lock is free
    
    █} else if(!(status & _XABORT_RETRY)) break; /* take the fall-back lock
        if the retry abort flag is not set */
     

    The output:

    open new account
    DEBUG: Transaction aborted 1 time(s) with the status 0
    open new account
    put 100 units into account 0
    transfer 10 units from account 0 to account 1 atomically!
    atomically draw 10 units from account 0 if there is enough money
    add 1000 empty accounts atomically
     

    Now we see that the program makes faster progress by taking the fall-back lock sooner in the case of a “hard” abort.

    As you may notice, the changes so far were isolated within some synchronization interface, TransactionScope. The application code was not changed. As generally available TSX software infrastructure evolves in future you should look for a proven existing library that has (scope) locks with RTM support to avoid pitfalls in your synchronization primitives (we will talk about pitfalls in applicationcode in future blogs). For example a TSX-enabled pthread library for Linux is already available. On the other hand, it is not uncommon for existing applications to use an extended or custom synchronization interfaces, converting them to take advantage of TSX is not a complicated task either if done with care.

    Concurrent Accesses from Several Threads Managed by Intel TSX

    After basic debugging the time has come to see the real power of Intel TSX: run two worker threads doing random concurrent updates to the central account data structure:

    unsigned __stdcall thread_worker(void * arg)
    {
            int thread_nr = (int) arg;
            std::cout << "Thread "<< thread_nr<< " started."<< std::endl;
            // create thread-local TR1 C++ random generator from <random>
            std::tr1::minstd_rand myRand(thread_nr); 
            long int loops = 10000;
     
            while(--loops)
            {
                    {
                            TransactionScope guard(globalFallBackLock);
                            // put 100 units into a random account atomically
                            Accounts[myRand() % Accounts.size()] += 100;
                    }
     
                    {
                            TransactionScope guard(globalFallBackLock);
                            /* transfer 100 units between random accounts 
                               (if there is enough money) atomically */
                            int a = myRand() % Accounts.size()
                            int b = myRand() % Accounts.size();
                            if(Accounts[a] >= 100)
                            {
                                    Accounts[a] -= 100;
                                    Accounts[b] += 100;
                            }
                    }
            }
            std::cout << "Thread "<< thread_nr<< " finished."<< std::endl;
            return 0;
    }
     

    I built Release build without DEBUG output and see that there are only about 100-300 aborts for the total of 20000 transactions. Debug output says that the abort flag status is 6: retry and “memory access conflict” bits are set. This is exactly what I expected from Intel TSX: almost all updates are done in parallel and only a few have been rolled back due to a conflict.

    To double check if my conclusions are right and emulator works as I expected I added an increment/update of a global counter in the transactions to introduce a huge number of conflicting accesses. And yes, it worked: with that change I have seen about 5-15K aborts. Although the absolute numbers obtained from the RTM emulator are not able to exactly predict the execution metrics on future hardware, the orders of magnitude should still indicate possible issues with RTM usage.

    Last Words

    These were my experiences with RTM and the new Intel® Software Development Emulator. Get prepared for Haswell and check out how your software can use Restricted Transactional Memory with Intel SDE now!

    --

    Roman 

    (the complete source code is attached to the article)

  • Intel Transactional Synchronization Extensions (Intel TSX)
  • Restricted Transactional Memory (RTM)
  • Haswell
  • Intel Software Development Emulator
  • sde
  • Icon Image: 

    Attachments: 

    http://software.intel.com/sites/default/files/blog/335035/exploringinteltsx.cpp
  • Case Study
  • Download
  • Sample Code
  • Success Story
  • Technical Article
  • Tutorial
  • Debugging
  • Development Tools
  • Enterprise
  • Intel® Core™ Processors
  • Microsoft Windows* 8 Desktop
  • Optimization
  • Parallel Computing
  • Porting
  • Threading
  • Intel® Software Development Emulator
  • Intel® Transactional Synchronization Extensions
  • C/C++
  • Business Client
  • Server
  • Windows*
  • Server
  • Laptop
  • Developers
  • Professors
  • Students
  • Apple Mac OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8

  • Building and Simulating an App using the HTML5 Development Environment Beta

    $
    0
    0

    The HTML5 section within the Intel Developer Zone was updated just before the US Thanksgiving Holiday to release the new Intel® HTML5 Development Environment Beta and I tried out a few of the sample apps. It took me about fifteen minutes to get one of the samples packaged into an .apk file and running on my Android tablet. This is the first cloud based tool set that Intel has provided and it's a basic solution to learn HTML5 and develop real cross-platform apps, that can be submitted to the iOS, Google* Play, or even a Blackberry* app store.

    Before you get started.


    There are a few things you need to do before you can get to this 15 minute example. Some necessary, some just recommended, some you may already have done, but all free..

    #1 You need to login to the the Intel Developer Zone, or register if you are new: Click at the top right of the page you are reading now.

    #2  Request an account for the HTML5 Development Environment Beta: Click here to goto the HTML5 page, or goto  http://software.intel.com/en-us/form/html5-beta-request

    #3  Download Google Chrome:  The mobile system emulator in the IDE only runs in Chrome, so you will need this once you get your account.

    #4  Get a Github Account:  https://github.com/signup/free. Recommended, as this can be used by the HTML5 Development Environment Beta, but also as a login ID for Adobe Phonegap

    #5 Get an account with http://build.phonegap.com (or use the github account you setup in step #4 )

    If you have any questions or comments about this procedure, or the tools, please post them in the HTML5 Forum section. http://software.intel.com/forums/html5-forum

    The Intel® HTML5 Development Environment is enabled with IDZ single-sign-on, so once you get your approval email, crank up Chrome, login to IDZ and click to "Launch the Tool" on  the HTML5 page.

    Be sure to check out the Mobile Device Emulator included with this tool. You can select from multiple screen sizes and orientations and see how your app will look and run on everything from a small phone to a large tablet.  The Emulator also includes support for tablet and phone sensors, so you can determine if you app responds correctly to GPS timeouts, for example. See the screenshot below for more details.

    And finally, if you can't wait to get your IDE account, you can play some of the HTML5 games in your Chrome Browser while you are waiting, or browse the accompanying articles for the samples that are included.

    Stewart Christie is the HTML5 and Tizen App Community Manager. Follow him on twitter @intel_stewart

    IDE Emulator

    Icon Image: 

  • News
  • Registration and Licensing
  • Geolocation
  • Sensors
  • Touch Interfaces
  • User Experience and Design
  • Intel® XDK
  • Intel® Software Development Emulator
  • HTML5
  • JavaScript*
  • Android*
  • HTML5
  • Tizen*
  • Windows*
  • Phone
  • Tablet
  • Laptop
  • Developers
  • Intel AppUp® Developers
  • Android*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Tizen*
  • Покупка или продление лицензий на продукты Intel® для разработки программного обеспечения

    $
    0
    0

    Корпорация Intel предлагает различные варианты лицензирования продуктов, предназначенных для разработки программного обеспечения. Различные варианты приобретаемых или продлеваемых лицензий на программы Intel® приведены ниже.

    Также можно бесплатно загрузить 30-дневные ознакомительные версии продуктов Intel® для разработки программного обеспечения. Загрузить бесплатные ознакомительные версии наших продуктов можно на сайте центра ознакомительных версий программ.

    Все приведенные ниже цены являются ценами на коммерческую лицензию для одного разработчика. Все указанные цены являются розничными, рекомендуемыми производителем, и могут быть изменены без предварительного уведомления. Цены указаны БЕЗ учета НДС и прочих применимых налогов и сборов.

    • Для получения информации о передаваемых лицензиях, лицензиях с возможностью использования только на определенных узлах и о прочих вариантах лицензирования обратитесь к торговому посредникуили к представителю корпорации Intel по адресу intel.software.sales@intel.com.
    • Для приобретения лицензии для научных, исследовательских и учебных заведений, выберите нужный продукт, и при оформлении покупки будет показана цена со скидкой. Для получения дополнительных сведений обо всех наших продуктах для учебных заведений посетите сайт центра предложений для учебных заведенийили обратитесь к представителю корпорации Intel по адресу academicdevelopersinfo@intel.com.
    • Продление поддержки — поддержка в течение одного года с даты истечения срока действия текущего соглашения о поддержке.
    • Существующим клиентам предоставляются специальные цены на обновление для продуктов Intel® Parallel Studio XE 2013, Intel® C++ Studio XE 2013 и Intel® Fortran Studio XE 2013.Подробнее о возможностях обновления.

    Категория

    Рекомендуемая розничная цена продукта
    (один пользователь)

    Рекомендуемая розничная цена продления поддержки
    (один пользователь)

    Варианты

    Пакеты продуктов

    Intel® Parallel Studio XE 2013 для Windows*
    Включает Intel® Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013
    2299 долл. США799 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Parallel Studio XE 2013 для Linux*
    Включает Intel® Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013
    2299 долл. США799 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® C++ Studio XE 2013 для Windows или для Linux
    Включает Intel® C++ Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013
    1599 долл. США599 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Visual Fortran Studio XE 2013 для Windows
    Включает Intel® Visual Fortran Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013
    1899 долл. США699 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Fortran Studio XE 2013 для Linux
    Включает Intel® Fortran Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013
    1899 долл. США699 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Parallel Studio
    Включает Intel® Parallel Advisor, Intel® Parallel Amplifier, Intel® Parallel Composer и Intel® Parallel Inspector
    320 долл. СШАНайти торгового посредника ›
    Все варианты ›
    Intel® Cluster Studio XE 2013 для Windows
    Включает Intel® C++ Composer XE 2013, Intel® Visual Fortran Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1, Intel® MPI Benchmarks, Intel® Inspector XE 2013 иIntel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013
    2949 долл. США1049 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Cluster Studio XE 2013 для Linux
    Включает Intel® C++ Composer XE 2013, Intel® Fortran Composer XE 2013, Intel® Trace Analyzer and Collector, Intel® MPI Library 4.1, Intel® MPI Benchmarks, Intel® Inspector XE 2013 и Intel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013
    2949 долл. США1049 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Cluster Studio 2013 для Windows
    Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks
    2049 долл. США749 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Cluster Studio 2013 для Linux
    Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks
    2049 долл. США749 долл. США**Найти торгового посредника ›
    Все варианты ›

    Компиляторы и библиотеки

    Intel® Composer XE 2013 для Windows
    Включает Intel® C++ Composer XE 2013 и Intel® Visual Fortran Composer XE 2013
    1199 долл. США449 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Composer XE 2013 для Linux
    Включает Intel® C++ Composer XE 2013 и Intel® Fortran Composer XE 2013
    1449 долл. США499 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® C++ Composer XE 2013 для Windows, для Linux или для OS X*
    Включает Intel® C++ Compiler, Intel® Integrated Performance Primitives 7.1, Intel® Math Kernel Library 11.0 и Intel® Parallel Building Blocks
    699 долл. США249 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Visual Fortran Composer XE 2013 для Windows
    Включает Intel® Visual Fortran Compiler и Intel® Math Kernel Library 11.0
    849 долл. США299 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Fortran Composer XE 2013 для Linux
    Включает Intel® Fortran Compiler и Intel® Math Kernel Library 11.0
    999 долл. США349 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Fortran Composer XE 2013 для OS X
    Включает Intel® Fortran Compiler и Intel® Math Kernel Library 11.0
    849 долл. США299 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® C++ Compiler Professional Edition с поддержкой ОС реального времени QNX Neutrino*
    Включает Intel® C++ Compiler и Intel® Integrated Performance Primitives 7.1
    599 долл. США240 долл. СШАВсе варианты ›
    Компилятор Intel® C для байтового кода EFI995 долл. США398 долл. СШАНайти торгового посредника ›
    Все варианты ›
    Intel® Visual Fortran Composer XE 2013 для Windows с IMSL 6.0*
    Включает Intel® Visual Fortran Compiler, IMSL* Fortran Numerical Library и Intel® Math Kernel Library 11.0. Включает лицензию на 1 разработчика и 1 лицензию на развертывание, предназначенную для разработчика. При предоставлении приложений, содержащих код IMSL, пользователям, отличным от разработчиков, требуется лицензия на развертывание.
    2049 долл. США749 долл. США**Найти торгового посредника ›
    Все варианты ›

    Лицензии IMSL* на выполнение (также называемые лицензиями IMSL* на развертывание)

    Вопросы и ответы о лицензировании IMSL*

    Коммерческая однопользовательская лицензия на выполнение приложений с кодом IMSL на системах, содержащих не более 16 процессорных ядер2049 долл. США685 долл. СШАНайти торгового посредника ›
    Все варианты ›
    Пакет из 10 коммерческих однопользовательских лицензий на выполнение приложений с кодом IMSL на системах, содержащих не более 16 процессорных ядер9709 долл. США1826 долл. СШАНайти торгового посредника ›
    Все варианты ›
    Коммерческая многопользовательская лицензия на выполнение приложений с кодом IMSL на системах, содержащих не более 64 процессорных ядер13 592 долл. США2557 долл. СШАНайти торгового посредника ›
    Все варианты ›

    Инструменты для процессора Intel® Atom™

    Набор инструментов Intel® для разработки ПО для встраиваемых систем на базе процессора Intel® Atom™
    Включает Intel® C++ Compiler, Intel® Application Debugger, Intel® JTAG Debugger, Intel® Integrated Performance Primitives 7.1 и Intel® VTune™ Performance Analyzer
    1999 долл. США799 долл. СШАНайти торгового посредника ›
    Все варианты ›

    Библиотеки для повышения производительности

    Intel® Integrated Performance Primitives 7.1 для Windows или для Linux199 долл. США69 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Math Kernel Library 11.0 для Windows или для Linux499 долл. США179 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Threading Building Blocks 4.1 для Windows, для Linux или для OS X499 долл. США179 долл. США**Найти торгового посредника ›
    Все варианты ›

    Анализаторы производительности приложений

    Intel® VTune™ Amplifier XE 2013 для Windows или для Linux899 долл. США349 долл. США**Найти торгового посредника ›
    Все варианты ›

    Средства проверки работы с памятью и потоками

    Intel® Inspector XE 2013 для Windows или для Linux899 долл. США349 долл. США**Найти торгового посредника ›
    Все варианты ›

    Средства для работы с кластерами

    Intel® Cluster Studio XE 2013 для Windows
    Включает Intel® C++ Composer XE 2013, Intel® Fortran Composer XE 2013, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE 2013 и Intel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013
    2949 долл. США1049 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Cluster Studio XE 2013 для Linux
    Включает Intel® C++ Composer XE 2013, Intel® Fortran Composer XE 2013, Intel® Trace Analyzer and Collector, Intel® MPI Library 4.1, Intel® MPI Benchmarks, Intel® Inspector XE 2013 и Intel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013
    2949 долл. США1049 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Cluster Studio 2013 для Windows
    Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks
    2049 долл. США749 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® Cluster Studio 2013 для Linux
    Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks
    2049 долл. США749 долл. США**Найти торгового посредника ›
    Все варианты ›
    Intel® MPI Library 4.1 для Windows или для Linux499 долл. США179 долл. США**Найти торгового посредника ›
    Все варианты ›

    Средства моделирования и имитации работы систем

    CoFluent Studio*Н/дН/дВсе варианты ›
    CoFluent ReaderН/дН/дВсе варианты ›

    Средства для профилирования и отладки графических приложений

    Средства для профилирования и отладки графический приложений Intel®Н/дН/дВсе варианты ›

    **Вы можете обновить лицензии по минимальной цене только в том случае, если сделаете это до истечения срока предыдущей подписки. Для получения дополнительной информации о продлении лицензий нажмите здесь.

    Корпорация Intel прилагает все необходимые усилия для соблюдения конфиденциальности. Ознакомиться с действующими правилами сбора и обработки личных сведений о заказчиках, данных о серийных номерах продуктов Intel и прочих данных можно в нашем уведомлении о конфиденциальностии в уведомлении о проверке серийных номеров.

    Fun with Intel® Transactional Synchronization Extensions

    $
    0
    0

    By now, many of you have heard of Intel® Transactional Synchronization Extensions (Intel® TSX). If you have not, I encourage you to check out this page (http://www.intel.com/software/tsx) before you read further. In a nutshell, Intel TSX provides transactional memory support in hardware, making the lives of developers who need to write synchronization codes for concurrent and parallel applications easier. It comes in two flavors: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM). If you haven’t read the background, go and do so now, since from here on, I assume that you have that basic knowledge.

    I had been developing a PIN-based emulator for Intel TSX for the past few years. The emulator is now integrated into Intel Software Development Emulator. During the development, I had a lot of grins and grimaces with respect to HLE/RTM. I would like to share three such particularly memorable incidents.

    The Incidents

    Example 1.

    The following codelet is a part of a test program a colleague of mine wrote who wanted to learn how to use RTM. With the array ‘data’ containing integer values and the array ‘group’ mapping the data’s elements to the slots in the array ‘sums’, the test program tries to store the sum of the data belonging to a group in the corresponding slot in the array ‘sums’. Since multiple threads may access the same slot simultaneously, each addition is performed in an RTM transaction. When a transaction aborts, the thread re-executes the addition in the critical section along the fallback path (i.e. ‘else’). Do you think it is correct? If you don’t, can you spot what is wrong?

    #pragma omp parallel for
        for(int i = 0; i < N; i++){
            int mygroup = group[i];
            if(_xbegin()==-1) {
                  sums[mygroup] += data[i];
                _xend();
              } else {
                  #pragma omp critical
                  {
                      sums[mygroup] += data[i];
                  }
              }
          }

    Example 2.

    I was taught code reuse is imporant when I was in school (sorry, not in the kindergarten ;^)). So, I decided to put to work what I learned when a need arose to write an RTM test. The test was similar to the one in Example 1, except that this test alternates RTM and HLE transactions. (Notice that the test does not have the non-speculative fallback path required for the RTM transaction. Having no fallback path makes the test UNSAFE because Intel TSX does not guarantee forward-progress; i.e., it can abort RTM transactions forever.) The test has two addition statements: one is protected with RTM and the other is protected with HLE. Quite a feat, eh? I felt proud of myself ;-) ... until I started to run the test. The test occasionally printed out incorrect sums. I panicked at first because the test was simple and looked almost identical with other tests, leading me to believe, however briefly, that the emulator had a nasty bug that had hidden unnoticed for a long time. But after a closer look, I realized the test had a flaw. Can you see what I did wrong?

        #define PREFIX_XACQUIRE ".byte 0xF2; "    #define PREFIX_XRELEASE ".byte 0xF3; " 
        class mutex_elided {
          uint8_t flag;
          inline bool try_lock_elided() {
            uint8_t value = 1;
            __asm__ volatile (PREFIX_XACQUIRE "lock; xchgl %0, %1"                : "=r"(value),"=m"(flag):"0"(value),"m"(flag):"memory" );
            return uint8_t(value^1);
          }
        public:
          inline void acquire() {
            for(;;) {
                exponential_backoff backoff;
                while((volatile unsigned char&)flag==1)
                    backoff.pause();
                if(try_lock_elided())
                    return;
                __asm__ volatile ("pause\n" : : : "memory" );
            }
          }
     
          inline void release() {
            __asm__ volatile (PREFIX_XRELEASE "movl $0, %0"               : "=m"(flag) : "m"(flag) : "memory" );
          }
        };
        ...
     
     
          mutex_elided m;
    #pragma omp parallel for
        for(int i = 0; i < N; i++) {
            int mygroup = group[i];
            if( (i&1) ) {
                while(_xbegin()!=-1) ;
                // must have a fallback path
                sums[mygroup] += 1;
                _xend();
            } else {
                m.acquire();
                sums[mygroup] += 1;
                m.release();
            }
        }

    Example 3.

    A colleague of mine tried to use RTM to improve performance of a benchmark. (I changed function names for clarity.) The following fragment of the benchmark permutes an array of IDs by, for each ID, swapping its value with that of a randomly picked partner. In the fallback path, elements i and j are exclusively acquired in the increasing order of their indices, and then written back in the reverse order. He was running it on the emulator and came back to me with an occasional hang problem. Can you come up with a sequence of events that leads to an indefinite wait?

    bool pause( volatile int64_t* l ) {
        __asm__ __volatile__( "pause\n" : : : "memory" );
        return true;
    }
     
    int64_t read_and_lock( volatile int64_t* loc ) {
        int64_t val;
        while(1) {
            while( pause( loc ) )
                if(  empty_val != (val = *loc) )
                        break;
            assert( val!=empty_val );
            if ( __sync_bool_compare_and_swap( loc, val, empty_val ) )
                break;
        }
        assert( val!=0 );
        return val;
    }
     
    void write_and_release( volatile int64_t* loc, int64_t val ) {
        while( pause( loc ) )
            if( __sync_bool_compare_and_swap( loc, empty_val, val ) )
                break;
        return;
    }
     
    ...
    #pragma omp parallel for num_threads(16)
        for (int i=0; i<n; i++) {
            int j = (int64_t) ( n * gen_rand() );
     
            if( _xbegin()==-1 ) {
                if(i != j) {
                    const vid_t tmp_val = vid_values[i];
                    vid_values[i] = vid_values[j];
                    vid_values[j] = tmp_val;
                }
                _xend();
            } else {
                if (i < j) {
                    const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                    const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                    write_and_release( &vid_values[j], tmp_val_i );
                    write_and_release( &vid_values[i], tmp_val_j );
                } else if (j < i) {
                    const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                    const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                    write_and_release( &vid_values[i], tmp_val_j );
                    write_and_release( &vid_values[j], tmp_val_i );
                }
            }
        }

    Analysis

    Example 1.

    The fallback path has a race with the code in the RTM path. For example, the following interleaving may happen. (Always keep in mind that one should not make any assumption on relative speeds of threads!)

    Thread 1
    Thread 2
    start critical section
     
    read sums[mygroup]
     
     
    do transaction that updates 
    sums[mygroup]
    write sums[mygroup]
     

    As a result, the example occasionally loses the increment done in the RTM transaction.

    Example 2.

    Don’t let the HLE transaction fool you. When an HLE transaction gets aborted, it acquires the same mutex non-speculatively. When this happens, the case effectively becomes identical to Example 1.

    Example 3.

    Again, one should not make any assumption on relative speeds of concurrently executing threads. Even though the fallback path is race free on its own, it has a race with the code in the RTM path. For example, the following sequence of events may occur.

    Thread 1
    Thread 2
    read_and_lock( vid_values[i]  )
     
    do transaction that swaps vid_values[i] and
    vid_values[k] and makes vid_values[i] non-zero
    read_and_lock( vid_values[j] )
    write_and_release( vid_values[j] )
     
    Wait for vid_values[i] to become 0
     

    Possible Fixes

    Now that we have concrete diagnosis for each of the examples, the fixes are straightforward.

    Example 1.

    Replacing ‘omp critical’ with an atomic increment such as __sync_add_and_fetch would fix the problem. I.e.,

        __sync_add_and_fetch( &sums[mygroup], data[i] );

     A more general solution is to use a mutex in the fallback path and add it to the readset of the RTM transaction to force the transaction to abort if the mutex is acquired by another thread.

    mutex fallback_mutex;
     
    ...
    #pragma omp parallel for num_threads(8)
        for(int i = 0; i < N; i++){
            int mygroup = group[i];
            if(_xbegin()==-1) {
                if( !fallback_mutex.is_acquired() ) {
                    sums[mygroup] += data[i];
                } else {
                    _xabort(1);
                }
                _xend();
            } else {
                fallback_mutex.acquire();
                sums[mygroup] += data[i];
                fallback_mutex.release();
            }
        }

    Example 2.

    Similarly, we may extend mutex_elided to have the is_acquired() method. Since the lock variable is read inside the RTM transaction, any non-speculative execution of the HLE path which makes the change to the lock variable visible will abort the transaction.

        mutex_elided m;
    #pragma omp parallel for num_threads(8)
        for(int i = 0; i < N; i++) {
            int mygroup = group[i];
            if( (i&1) ) {
                while(_xbegin()!=-1) // having no fallback path is
                    ;                // UNSAFE
                if( !m.is_acquired() )
                    sums[mygroup] += data[i];
                else
                    _xabort(0);
                _xend();
            } else {
                m.acquire();
                sums[mygroup] += data[i];
                m.release();
            }
        }

    Example 3.

    We can also apply the mutex-based approach to this example. Another approach is to read the two ID values in the RTM transaction and check if either of them contains the ‘empty_value’. If so, we abort the transaction and force the thread to follow the fallback path.

    #pragma omp parallel for num_threads(16)
        for (int i=0; i<n; i++) {
            int j = (int64_t) ( n * gen_rand() );
            if( _xbegin()==-1 ) {
                if(i != j) {
                    const vid_t tmp_val_i = vid_values[i];
                    const vid_t tmp_val_j = vid_values[j];
                    if( tmp_val_i==0 || tmp_val_j==0 )
                        _xabort(0);
                    vid_values[i] = tmp_val_j;
                    vid_values[j] = tmp_val_i;
                }
                _xend();
            } else {
                if (i < j) {
                    const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                    const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                    write_and_release( &vid_values[j], tmp_val_i );
                    write_and_release( &vid_values[i], tmp_val_j );
                } else if (j < i) {
                    const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                    const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                    write_and_release( &vid_values[i], tmp_val_j );
                    write_and_release( &vid_values[j], tmp_val_i );
                }
            }
        }

    Conclusions

    So, what have I learned from these examples? As you may have already noticed, all of these are related to the ‘restricted’ part of RTM. Intel TSX has great potential for improving performance of concurrent/parallel applications. But, the synchronization between the speculative code inside the RTM transaction and the non-speculative fallback path needs to be carefully managed, since the interactions are subtle. I gather most programmers won’t need to worry too much about it because higher-level abstractions in supporting libraries should hide most of agonizing synchronization details. But for those who are willing to get their hands dirty to squeeze out the last drop of performance gain, it always pays to have a watchful eye on the interactions between an RTM code path and its non-speculative fallback. (And we have many tools such as Intel SDE to assist you.)

    Disclaimer: The opinion expressed in the blog is the author's own and reflects none of his employer's or his colleagues'.

  • Intel Transactional Synchronization Extensions (Intel TSX)
  • Restricted Transactional Memory (RTM)
  • Transactional memory
  • hardware lock elision
  • Icon Image: 

  • Development Tools
  • Intel® Core™ Processors
  • Optimization
  • Parallel Computing
  • Threading
  • Intel® C++ Compiler
  • Intel® C++ Composer XE
  • Intel® Threading Building Blocks
  • Intel® Software Development Emulator
  • Intel® Transactional Synchronization Extensions
  • OpenMP*
  • C/C++
  • Android*
  • Code for Good
  • Server
  • Windows*
  • Laptop
  • Server
  • Tablet
  • Desktop
  • Phone
  • Developers
  • Students
  • Android*
  • Apple Mac OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Unix*
  • Emulation of new instructions

    $
    0
    0

    Hello and welcome to my blog. This is my first blog posting.

    My name is Mark Charney and I work at Intel in Hudson, Massachusetts. Intel has just made available some software that I've been working on for emulation of new instructions: Intel® Software Development Emulator, or Intel® SDE for short. Intel SDE emulates instructions in the SSE4, the AES and PCLMULQDQ instruction set extensions, and also the Intel® AVX instructions. Intel SDE runs on Windows* and Linux* and supports IA-32 architecture and Intel® 64 architecture programs.

    Intel SDE is a functional emulator, meaning that it is useful for testing software that uses these new instructions to make sure it computes the right answers. Testing software that uses instructions that do not exist in hardware yet requires an emulator. Intel SDE is not meant for predicting performance.

    Intel SDE is actually a "Pintool" built upon the Pin dynamic binary instrumentation system.. The Pin that comes with Intel SDE uses a special version of the software encoder decoder called XED that I also develop. While Intel SDE is primarily useful for learning about the new instructions, it also has some features for doing simple workload analysis. The "mix" tool compute static and dynamic histograms. It can compute histograms by the type of the instruction (ADD, SUB, MUL, etc.) or by "iforms" which are XED classifications of instructions that include the operands, or by instruction length.

    Intel SDE is fairly speedy. I actually haven't measured it because it was so much faster than the other emulator we have been using (over 100x faster) that I'm not getting any complaints internally. We routinely run SPEC2006 using Intel SDE using the reference inputs. Most of the inputs can run in several hours while a few of the longer running inputs take about a day. Emulation performance is tricky to measure as each instruction requires a different amount of work and each application is different. I could probably take the slow down relative to a version of SPEC2006 that only used native instructions. The reason that Intel SDE is faster than our previous "trap-and-emulate" emulator basically has to do with the fact that we do not rely on the illegal-instruction exception saving 1000s of cycles dispatching and returning from the emulation routines. Because Intel SDE is built upon Pin, we can JIT-translate the original program and branch to the emulation routines, saving that exception overhead.

    Right now, there are several ways that I know about to write programs using the new instructions. If you want to use the SSE4, AES and PCLMULQDQ instructions, then you can use the Intel® Compiler. The Intel Compiler supporting Intel AVX is expected to be released in the first quarter of 2009. GCC4.3 supports SSE4. There is also a version of GCC that supports AES and PCLMULQDQ available in the svn (subversion) respository svn://gcc.gnu.org/svn/gcc/branches/ix86/gcc-4_3-branch . GCC for Intel AVX is under development as well: svn://gcc.gnu.org/svn/gcc/branches/ix86/avx. GNU binutils which includes the "gas" assembler is available for AES, PCLMULQDQ and Intel AVX. Also available are the YASM and NASM assemblers.

    If anyone has questions about this or suggestions for something they'd like me to write about, please post a comment. I'd like to hear about what is important to you. There are so many aspects of this that I'd like to describe in future postings:

      • How it works

      • Isolation issues

      • Debugging

      • Advanced use options

      • Program checkers



    Also if you have software questions you can post them to the Intel® AVX and CPU forum at:
    /en-us/forums/intel-avx-and-cpu-instructions/

  • AES
  • emulate
  • emulation
  • new instructions
  • Pin
  • SSE4
  • xed
  • Icon Image: 

  • WhatIf Software
  • Intel® Software Development Emulator
  • Intel® Advanced Vector Extensions
  • Theme Zone: 

    IDZone

    Licenses for runtime libraries for SDE on Linux

    $
    0
    0

    Read more about GNU General Public License, version 2

    Back to the Intel® Software Development Emulator

    For libstdc++:
    
    // Copyright (C) 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006
    // Free Software Foundation, Inc.
    //
    // This file is part of the GNU ISO C++ Library.  This library is free
    // software; you can redistribute it and/or modify it under the
    // terms of the GNU General Public License as published by the
    // Free Software Foundation; either version 2, or (at your option)
    // any later version.
    
    // This library is distributed in the hope that it will be useful,
    // but WITHOUT ANY WARRANTY; without even the implied warranty of
    // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    // GNU General Public License for more details.
    
    // You should have received a copy of the GNU General Public License along
    // with this library; see the file COPYING.  If not, write to the Free
    // Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301,
    // USA.
    
    // As a special exception, you may use this file as part of a free software
    // library without restriction.  Specifically, if other files instantiate
    // templates or use macros or inline functions from this file, or you compile
    // this file and link it with other files to produce an executable, this
    // file does not by itself cause the resulting executable to be covered by
    // the GNU General Public License.  This exception does not however
    // invalidate any other reasons why the executable file might be covered by
    // the GNU General Public License.
    
    And for libgcc_s:
    
    /* Copyright (C) 1989, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
       2000, 2001, 2002, 2003, 2004, 2005  Free Software Foundation, Inc.
    
    This file is part of GCC.
    
    GCC is free software; you can redistribute it and/or modify it under
    the terms of the GNU General Public License as published by the Free
    Software Foundation; either version 2, or (at your option) any later
    version.
    
    In addition to the permissions in the GNU General Public License, the
    Free Software Foundation gives you unlimited permission to link the
    compiled version of this file into combinations with other programs,
    and to distribute those combinations without any restriction coming
    from the use of this file.  (The General Public License restrictions
    do apply in other respects; for example, they cover modification of
    the file, and distribution when not linked into a combine
    executable.)
    
    GCC is distributed in the hope that it will be useful, but WITHOUT ANY
    WARRANTY; without even the implied warranty of MERCHANTABILITY or
    FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    for more details.
    
    You should have received a copy of the GNU General Public License
    along with GCC; see the file COPYING.  If not, write to the Free
    Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA
    02110-1301, USA.  */
  • Linux*
  • WhatIf Software
  • Intel® Software Development Emulator
  • Parallel Computing
  • Theme Zone: 

    IDZone

    Recent Intel® AVX Architectural Changes

    $
    0
    0

    Dear Intel® AVX developers,

    We recently made some significant changes to the Intel® Advanced Vector Extensions Programmer’s Reference Manual (please download the latest version at /sites/avx/). If you are writing tools or software based on AVX, this may impact you. The big changes are a very different FMA syntax and the removal of two instructions (4-operand permutes).

    Since the initial Intel® AVX spec was released in April 2008, I wanted to recap the changes we’ve made in the previous two Programmer’s Reference Manual Updates.

    Added: VEX forms of AES instructions
    The AES instructions we plan to release on our upcoming 32-nm cores (codename “Westmere”) will be extended by the VEX prefix in the following cores (codename “Sandy Bridge”). The VEX brings a distinct destination register to 4 of the 5 AES instructions (VAESDEC, VAESDECLAST, VASEINC, and VAESINCLAST… AESIMC already had a distinct destination). Only the 128-bit form of the instructions are supported – the instructions are not promoted to 256-bit. As with the AES instruction set, VAES instructions may not be enabled by all products in all geographies. For the VAES instructions we have therefore created a unique way of detecting their presence in hardware, by requiring you to check that two CPUID flags are set (CPUID.AES AND CPUID.AVX).

    Added: 256-bit forms of streaming stores
    Astute readers noticed that the streaming store instructions MOVNTDQ, MOVNTPS, and MOVNTPD had originally been supported only in 128-bit forms. We now have them in 256-bit forms in our Sandy Bridge cores. It’s not clear that they will be any faster than the 128-bit forms on Sandy Bridge, but we encouraged their use for future performance. Note that Streaming Load (VMOVNTDQA) is still (only) 128-bit – yes that’s intentional.

    Removed: VPERMIL2PS and VPERMIL2PD
    All PERMIL2 instructions are gone – both the 128-bit and 256-bit flavors. Like the FMA below, they used the VEX.W bit to select which source was from memory – we’re not moving in the direction of using VEX.W for that purpose any more.

    Changed: All FMA instructions
    We previously defined 4-operand FMA’s with 3 sources and a separate destination. We now have 3 operands – and still 3 sources, so one of them gets destroyed (this makes the FMA instructions unique in AVX). For each of the old forms, we now have 3 new instructions, using 132, 213, and 231 designations. The VEX.W bit no longer selects the source that comes from memory – instead, it selects the floating point type (single or double precision). Finally, the scalar instructions preserve the upper bits of the destination (up to bit 127) instead of zeroing. These instructions are (still) not in Sandy Bridge, we are planning to ship them in a subsequent processor.

    Example: we previously had

    VFMADDSS xmm1, xmm2, xmm3, m32, which was xmm1 = xmm2*xmm3 + m32


    (and the upper bits of the XMM register - from bit 32 to 127 - were zeroed).

    NOW we have three forms:

    VFMADD132SS xmm1, xmm2, m32, which is xmm1 = xmm1*m32 + xmm2
    VFMADD213SS xmm1, xmm2, m32, which is xmm1 = xmm2*xmm1 + m32
    VFMADD231SS xmm1, xmm2, m32, which is xmm1 = xmm2*m32 + xmm1


    (and note that now the upper bits of the XMM register – from bit 32 to 127 - are unchanged).

    The numbers in the instruction mnemonics come from the order of the operands in the expression .... so VFMADD132 is 1*3+2; VFMADD213 is 2*1+3, and VFMADD231 is 2*3+1.

    The three forms allow you to avoid having to do extra copies or loads – most of the time. The primary exception is in code where you really did need to re-use all the sources, as in butterflies:

    y0 = x0*c0 + x1
    y0 = x0*c0 – x1


    which generally incurs one copy for every 2 FMA’s. So far, this doesn’t appear to be much of a performance hit.

    Added: VEX forms of the PCLMULQDQ instruction
    The PCLMULQDQ instruction we plan to release on our upcoming 32-nm cores (codename “Westmere”) will be extended by the VEX prefix in our subsequent cores (codename “Sandy Bridge”). The VEX form brings a distinct destination register. Only the 128-bit form of the instructions are supported – the instructions are not promoted to 256-bit. As with the VAES instructions, VPCLMULQDQ will require the careful programmer to check that two CPUID flags are set (in this case CPUID.PCLMULQDQ AND CPUID.AVX).

    A number of miscellanea
    We clarified the alignment-check exception (#AC) behavior for MASKMOV instructions (by the way, does anyone actually use #AC?) .
    We clarified the exception type for the packed shift instructions (PSLL, PSRL, PSRA).
    The Encoding rule table (4-3) is clarified to reflect PEXTRW.

    We hope these changes are not too disruptive to you and thank you for your support of our ongoing early disclosure policy. We have updated the Intel® Software Development Emulator with all of these changes. If you have any questions or concerns about the impact of these changes to your application, or would like more detail on any of these changes, I encourage you to start a thread at /en-us/forums/intel-avx-and-cpu-instructions, or contact me directly.

    Regards,
    Mark Buxton
    Software Engineer
    Intel Corporation
    Mark.J.Buxton@intel.com

    Icon Image: 

  • Parallel Computing
  • WhatIf Software
  • Intel® Software Development Emulator
  • Intel® Advanced Vector Extensions
  • Developers
  • Theme Zone: 

    IDZone

    AVX debugging или все-таки как?

    $
    0
    0

    AVX определен, зафиксирован и уже идет к нам. Ранее много говорилось о разных способах разработки: компиляция, эмуляция, документация и даже профайлинг (очень рекомендую заглянуть сюда /en-us/avx/ ), – но довольно мало было информации по поводу отладки.

    Хотя, если сказать честно – все уже было. Но сегодня стало еще удобнее и даже нагляднее отлаживать перемещение битов по 256 битному полю AVX регистров.

    В общем, рекомендую ближе познакомиться с SDE (/en-us/articles/intel-software-development-emulator ).

    Эмулятор позволяет не только отлично, но и тихо обрабатывать набор всех инструкций, а также показывать, что именно происходило.

    Для начала хочу обратить ваше внимание на дополнительный аргумент помощи  - thelp, который раскрывается в довольно длинный набор аргументов, среди которых можно найти и так называемые Debugtrace knobs, где отдельно стоит отметить -debugtrace и -dt_start_int3.

    Их использование позволяет нам создать файл отчета debugtrace.out ( имя по умолчанию ), где будут явно видны команды и, главное, их операнды с используемыми значениями.

    У меня, например, получается:

    TID0: INS 0x00401f4d                     vrcpss xmm7, xmm5, xmm5


    TID0:      XMM7 := 00000000_00000000_00000000_3ba57800


    XMM7 (doubles) := 0 4.94411e-315


    XMM7 (floats)  := 0 0 0 0.00504971


    TID0: INS 0x00401f51                     vsubss xmm5, xmm1, xmm0


    TID0:      XMM5 := 00000000_00000000_00000000_43460000


    XMM5 (doubles) := 0 5.57633e-315


    XMM5 (floats)  := 0 0 0 198


    TID0: INS 0x00401f55                     vmulss xmm5, xmm5, xmm7


    TID0:      XMM5 := 00000000_00000000_00000000_3f7ff5a0


    XMM5 (doubles) := 0 5.26353e-315


    XMM5 (floats)  := 0 0 0 0.999842



    Здесь явно видно, что vmulss ( скалярное умножение с плавающей точкой ) в виде операндов получает

    0.00504971 (XMM7) и 198 (XMM5). Результат остается в XMM5 (0.999842), что согласно моему калькулятору является истиной.

    Структура debugtrace.out на самом деле довольно проста, и практически сразу, ну или со второго взгляда можно увидеть последние значения используемых регистров или памяти  J.

    Для большего удобства советую также обратить внимание на dt_start_int3, который позволяет «окружать» интересный код для более детального разбора уже из SDE.

    Я думаю проблем уже нет или ?

    Icon Image: 

  • Open Source
  • Intel® Software Development Emulator
  • Intel® Advanced Vector Extensions
  • Theme Zone: 

    IDZone

    Intel® Software Development Emulator Download

    $
    0
    0

    Intel® Software Development Emulator (released July 29, 2014)

    Intel® Software Development Emulator (released July 20, 2014)

    Intel® Software Development Emulator (released March 06, 2014)

    Previous versions of the Intel® Software Development Emulator

    Back to the Intel® Software Development Emulator page.

  • Developers
  • WhatIf Software
  • Intel® Software Development Emulator
  • Intel® Advanced Vector Extensions
  • Intel® Memory Protection Extensions
  • Intel® Secure Hash Algorithm Extensions
  • Intel® Streaming SIMD Extensions
  • Intel® Transactional Synchronization Extensions
  • Parallel Computing
  • License Agreement: 

    Protected Attachments: 

    Buy or Renew Intel® Software Development Products

    $
    0
    0

    Intel offers several licensing options for our software development products. Review the choices below to buy or renew Intel® software. You may use the Product Support Renewal/Upgrade Options page to determine renewal and upgrade options for your product(s).

    30 day evaluation versions of Intel® Software Development Products are also available for free download. Visit our Software Evaluation Center to download free evaluation versions of the products.

    All prices listed below are for named-user licenses. All prices are Manufacturer Suggested List Prices (MSRP) and subject to change without notice. Prices do NOT include Value Added Taxes (VAT) or any other state or local taxes or charges.

    • For floating licenses, node-locked licenses, or other licensing options, contact a reseller, or contact an Intel representative at intel.software.sales@intel.com.
    • To purchase an academic research license, please select your desired product and the discounted price will be displayed during check out. For additional information on all of our education offerings, visit our Education Offerings Center, or contact an Intel representative at academicdevelopersinfo@intel.com.
    • Support Renewal extends your support for one year from the expiration date of your current support agreement.
    • Existing customers can take advantage of special upgrade prices for Intel® Parallel Studio XE, Intel® C++ Studio XE or Intel® Fortran Studio XE .See details of upgrade offer.

    Category Name

    Product MSRP
    (Named-User)

    Support Renewal MSRP
    (Named-User)

    Options

    Product Suites

    Intel® Parallel Studio XE for Windows*
    Includes Intel® Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE
    $2,299$799**Find a reseller ›
    See all options ›
    Intel® Parallel Studio XE for Linux*
    Includes Intel® Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE
    $2,299$799**Find a reseller ›
    See all options ›
    Intel® C++ Studio XE for Windows or Linux
    Includes Intel® C++ Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE
    $1,599$599**Find a reseller ›
    See all options ›
    Intel® Visual Fortran Studio XE for Windows
    Includes Intel® Visual Fortran Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE
    $1,899$699**Find a reseller ›
    See all options ›
    Intel® Fortran Studio XE for Linux
    Includes Intel® Fortran Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE
    $1,899$699**Find a reseller ›
    See all options ›
    Intel® Cluster Studio XE for Windows
    Includes Intel® C++ Composer XE, Intel® Visual Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE
    $2,949$1,049**Find a reseller ›
    See all options ›
    Intel® Cluster Studio XE for Linux
    Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE
    $2,949$1,049**Find a reseller ›
    See all options ›
    Intel® Cluster Studio for Windows
    Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks
    $2,049$749**Find a reseller ›
    See all options ›
    Intel® Cluster Studio for Linux
    Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks
    $2,049$749**Find a reseller ›
    See all options ›
    Intel® System Studio including JTAG Debugger
    Includes: Intel® C++ Compiler for Embedded OS Linux*, Intel® C++ Compiler for Android*, Intel® Math Kernel Library, Intel® Integrated Performance Primitives, Intel® VTune™ Amplifier for Systems, Intel® Energy Profiler, Intel® Inspector for Systems, Intel® GPA System Analyzers, Intel® JTAG Debugger, SVEN SDK, GDB - The GNU* Project Debugger
    $2,399$849Find a reseller ›
    See all options ›
    Intel® System Studio
    Includes: Intel® C++ Compiler for Embedded OS Linux*, Intel® C++ Compiler for Android*, Intel® Math Kernel Library, Intel® Integrated Performance Primitives, Intel® VTune™ Amplifier for Systems, Intel® Energy Profiler, Intel® Inspector for Systems, Intel® GPA System Analyzer, SVEN SDK, GDB - The GNU* Project Debugger
    $1,649$599Find a reseller ›
    See all options ›

    Compilers and Libraries

    Intel® Composer XE for Windows
    Includes Intel® C++ Composer XE, Intel® Visual Fortran Composer XE
    $1,199$449**Find a reseller ›
    See all options ›
    Intel® Composer XE for Linux
    Includes Intel® C++ Composer XE, Intel® Fortran Composer XE
    $1,449$499**Find a reseller ›
    See all options ›
    Intel® C++ Composer XE for Windows, Linux, or OS X*
    Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel® Parallel Building Blocks
    $699$249**Find a reseller ›
    See all options ›
    Intel® Visual Fortran Composer XE for Windows
    Includes Intel® Visual Fortran Compiler, Intel® Math Kernel Library
    $849$299**Find a reseller ›
    See all options ›
    Intel® Fortran Composer XE for Linux
    Includes Intel® Fortran Compiler, Intel® Math Kernel Library
    $999$349**Find a reseller ›
    See all options ›
    Intel® Fortran Composer XE for OS X
    Includes Intel® Fortran Compiler, Intel® Math Kernel Library
    $849$299**Find a reseller ›
    See all options ›
    Intel® C++ Compiler for Android*$79.95N/AFind a reseller ›
    See all options ›
    Intel® C++ Compiler Professional Edition for QNX Neutrino* RTOS Support
    Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives
    $599$240See all options ›
    Intel® C Compiler for EFI Byte Code$995$398Find a reseller ›
    See all options ›
    Intel® Visual Fortran Composer XE with Rogue Wave* IMSL* Fortran Numerical Library 7.0 for Windows
    Includes Intel® Visual Fortran Compiler, IMSL* Fortran Numerical Library, Intel® Math Kernel Library Includes 1 developer and 1 deployment license for the developer.
    $1,749$649**Find a reseller ›
    See all options ›

    Embedded and Mobile System Development

    Intel® System Studio including JTAG Debugger
    Includes: Intel® C++ Compiler for Embedded OS Linux*, Intel® C++ Compiler for Android*, Intel® Math Kernel Library, Intel® Integrated Performance Primitives, Intel® VTune™ Amplifier for Systems, Intel® Energy Profiler, Intel® Inspector for Systems, Intel® GPA System Analyzer, Intel® JTAG Debugger, SVEN SDK, GDB - The GNU* Project Debugger
    $2,399$849Find a reseller ›
    See all options ›
    Intel® System Studio
    Includes: Intel® C++ Compiler for Embedded OS Linux*, Intel® C++ Compiler for Android*, Intel® Math Kernel Library, Intel® Integrated Performance Primitives, Intel® VTune™ Amplifier for Systems, Intel® Energy Profiler, Intel® Inspector for Systems, Intel® GPA System Analyzer, SVEN SDK, GDB - The GNU* Project Debugger
    $1,649$599Find a reseller ›
    See all options ›

    Performance Libraries

    Intel® Integrated Performance Primitives for Windows, Linux, or OS X$199$69**Find a reseller ›
    See all options ›
    Intel® Math Kernel Library for Windows or Linux$499$179**Find a reseller ›
    See all options ›
    Intel® Threading Building Blocks for Windows, Linux, or OS X$499$179**Find a reseller ›
    See all options ›
    Rogue Wave* IMSL* Fortran Numerical Library 7.0 for Windows$999$499Find a reseller ›
    See all options ›

    Performance Profilers

    Intel® VTune™ Amplifier XE for Windows or Linux$899$349**Find a reseller ›
    See all options ›

    Thread and Memory Checkers

    Intel® Inspector XE for Windows or Linux$899$349**Find a reseller ›
    See all options ›
     

    Cluster Tools

    Intel® Cluster Studio XE for Windows
    Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE
    $2,949$1,049**Find a reseller ›
    See all options ›
    Intel® Cluster Studio XE for Linux
    Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE
    $2,949$1,049**Find a reseller ›
    See all options ›
    Intel® Cluster Studio for Windows
    Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks
    $2,049$749**Find a reseller ›
    See all options ›
    Intel® Cluster Studio for Linux
    Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks
    $2,049$749**Find a reseller ›
    See all options ›
    Intel® MPI Library for Windows or Linux$499$179**Find a reseller ›
    See all options ›

    System Modeling and Simulation Tools

    CoFluent Studio*N/AN/ASee all options ›
    CoFluent ReaderN/AN/ASee all options ›

    Visual Computing Tools

    Intel® Graphics Performance AnalyzersN/AN/ASee all options ›
    Intel® Media SDK for Servers$499N/ASee all options ›

    **Lowest Price available if you renew prior to current subscription expiration. For more information on renewals click here.

    Intel takes your privacy seriously. Refer to Intel's Privacy Notice and Serial Number Validation Notice regarding the collection and handling of your personal information, the Intel product’s serial number and other information.

    Intel® Software Development Emulator Release Notes

    $
    0
    0
    2014-07-29 version 7.2.0
    • Added -p4 (Pentium4) and -p4p (Pentium4-Prescott) knobs to SDE.
    • Updated CPUID definition files.
    2014-07-20 version 7.1.0
    • Added support for additional Intel® AVX-512 instructions.
    • Support for debugging integration with Microsoft Visual Studio 2012 and Visual Studio 2013.
    • New controller implementation.
    • Improved TSX statistics.
    2014-03-06 version 6.22.0
    • Exclude PAUSE from chip-check because it is a NOP on older CPUs and on quark. 
    2014-02-13 version 6.20.0
    • Added support for XSAVEC and CLFLUSHOPT.
    • Disabled TSX CPUID bits when TSX emulation is not requested.
    • Improved disassembly for MPX instructions.
    • Added an option for running chip-check only on the main executable.
    • Added support for -quark (Pentium ISA).
    • Added application debugging for Mac OSX with the lldb debugger.
    2013-11-16 version 6.12.0
    • Added support to Mac OSX version 10.9.
    • Improved the TSX statistics information.
    • Various fixes with the emulation of floating-point instructions of Intel AVX-512.
    • Enabled the alignment checker tool by default for instructions that require alignment.
    • Fixed mismatch between mix and dynamic mask profiler.
    • Updated the Intel MPX runtime libraries for Windows. 
    • Performance improvements when modeling a CPU prior to AVX-512.
    2013-09-21 version 6.7.0
    • Debugging with GDB is now supported with Intel® AVX-512. Download the new GDB from here.
    • Emulation of  Intel® AVX2 FMA and  Intel AVX-512 FMA uses native FMA instructions when running on Haswell host.
    • Various fixes with the emulation of floating-point and conversion instructions of Intel AVX-512.
    • Disassembly of control transfer instructions displays the 'bnd' prefix when used with Intel® MPX.
    • Updated the XED ISA set names for Intel AVX-512. This is visible in 'mix' statistics output.
    • This release goes with 2013-08-29 version of the Intel MPX runtime.
    2013-07-22 version 6.1.0
    • Emulation support for the Intel®Advanced Vector Extensions 512 (Intel® AVX-512) instructions present on the Intel Knights Landing microarchitecture.
    • Emulation support for the Intel® Secure Hash Algorithm (Intel® SHA) extensions present on the Intel Goldmont microarchtiecture. 
    • Emulation support for the Intel® Memory Protection Extensions (Intel® MPX)  present on the Intel Skylake and Goldmont microarchitectures.
    • Support for Hardware Lock Elision introduced on the Intel Haswell microarchitecture
    • Improved support for Restricted Transactional Memory introduced on the Intel Haswell microarchitecture.
    • Improved support for the OS X* operating system (Mountain Lion)
    • The footprint tool now has the ability to compute footprint over time for working-set estimation.
    • A new tool called the dynamic mask profiler is provided using -dyn_mask_profile knob. The output is in a simple XML format.
    The Intel SDE development team has grown to include Michael Berezalsky, Mark Charney, Michael Gorin, Omer Mor, Ariel Slonim and Ady Tal.
     
     
    2013-01-03 version 5.38
    • Improvements in RTM emulation stability. Added statistics knobs.  Updated knobs.
    • Support for debugging integration with Microsoft Visual Studio 2012. See main page for information.
    • Improved multithreaded stability when using the AVX/SSE transition checker
    • Mac OS X: support for code-signed binaries, simplifying execution. See main page for information about the "taskport".
    • XED: added elf/dwarf support back to the command line tool
    • TZCNT ZF flags fix
     
    2012-11-01 version 5.31 - major update
    • Major update including fixes for the processor codenamed Haswell and introduction of instructions in the processor codenamed Broadwell
    • First public SDE release for OS X, 10.6 and 10.7.  See additional information on the main Intel SDE web page for required permissions.
    • HSW's RTM mode is supported with the "-rtm-mode full" option. This feature is very new and the Intel SDE implementation might be a little unstable.
    • Completely new mechanism for handling of CPUID. CPUID values now come from an input file.
    • SDE's -chip-check feature checks to make sure instructions are valid for the specified chip. See "sde -help" for the various chip options.
    • Exception handling fixes
    • Haswell BMI emulation fixes, including flags output.
    • Debugtrace multithreading safety improvements
    • Mix top-blocks sorting issues.  Mix also has better support for allocating stats to overlapping blocks.
    • Mix default blocks size is now 1500 instructions to avoid fragmenting large hot blocks.
    • XED now can emit "dot" graphs for specified regions:  path-to-sde-kit/xed -i SOMEEXE -as 0x40316b -ae 0x4031b3 -dot foo.dot; dot -O -Tpdf foo.dot
    • Mix has prefix a legacy-prefix histogram
    • Footprint tool can now collect stats about unique memory pages as well as unique cache lines. The footprint tool is now faster as well.
    • Improved speed of AVX/SSE transition checker by roughly 12%. See the -ast knob in "sde -thelp".
    • Fixed some numerical errors in our software emulation of the FMA instruction for denormal numbers.
    • Various stability improvements from using a newer version of Pin.
    • Better handling of MXCSR exception status bigs for AVX1/2 instructions. We still do not support raising unmasked floating point errors from emulated instructions.
    • Can now set environment variables from the command line with the -env VAR VALUE option.
    • The commands for the GDB interface have been updated. See "monitor help sde" when attached as described on the main page. Please use GDB 7.4 or later.
    • The chip check error message includes the instruction bytes of the offending instruction.
    • Multiprocess output file handling. You used to have to supply "-i" to get the process id inserted in to the file name to avoid multiprocess applications from overwriting the common output files. Now we attempt to detect the creating of other processes and add the PID to the file names automatically.  The parent / child relationship is recorded in the file name.
    • Better support for unused bits in the VEX encodings in 32b mode.
    The Intel SDE development team has grown to include Michael Berezalsky, Mark Charney, Michael Gorin, Omer Mor, Ariel Slonim and Ady Tal.
     
    2011-12-15 version 4.46
    • Linux* 3.x is supported
    • Better support for running on Intel® AVX-enabled hosts
    • All output files now begin "sde-" and end with ".txt" by default
    • Mix is faster and does more analysis of SIMD operations
    • Mix has line number support for the top blocks when the information is available in the application
    • The -ptr-chk option now checks the memory refernces of gather operations
    • Fixed support for file descriptor leak when exec'ing thousands of threads on Linux*.
    • Misc other stability improvements.
     
    2011-07-01 version 4.29
    • Support for the Haswell new instructions in the Intel AVX programmers reference version 11.
    • Mix now includes category and instruction length histograms automatically so the corresponding knobs were removed.
    • Many other changes
     
    2010-12-23 verison 3.89 (Linux* only)
    • Fixed runtime libraries. Version 3.88 accidentally included runtime libraries that require a newer version of glibc than is present on older systems (like RHEL4).
    2010-12-21 version 3.88
     
    • Support for the post-32nm processor instructions for the processor codenamed Ivy Bridge in the 008 revision of the Intel AVX programmers reference document
    • Many stability improvements
    • "sde -thelp" goes to stdout, not stderr
    • mix has a "-demangle 0" option to turn off demangling
    • xed disassembler handles uninitialized code sections in windows binaries
    • xed supports dwarf line number information with the -line knob on Linux*.
    • mix has improved memory efficiency
    • To debug on Linux*, you no longer need the -avx-gdb knob but you must use gdb 7.2 or later which supports a new XML remote-debug protocol.
     
    2010-03-11 version 3.09
     
     
    • When pin or sde crashes due to bugs in user applications, the output of the circular buffer use for -itrace-execute (etc.) was not being dumped to disk. It is now.
    • Fixed circular buffer used for -itrace-execute and -itrace-execute-emulate. It was not initializing the circular buffer when -itrace-lines was used and would just crash immediately.  In addition to *actually* making the feature work, I sped it up immensely by reusing allocated string buffers.
    • Fixed 14 scalar Intel AVX instructions that were referencing too much memory (128b instead of 32b or 64b).
    • Made the xsave emulator be enabled all the time even when xsave is present on the hardware. One can disable it with '-xsave 0'.
    • All output log / stats file names now end in .txt by default.
    • Added a descriptive header to the top of the Intel AVX/Intel SSE transition output file.
    • debugtrace now print mmx (and x87) register values
    • vmaskmov* instructions are now implemented in a thread-safe way.
    • vpmov[sz]x instructions now correctly reference less memory to avoid extra page accesses.
    • New memory pointer checker. This option check all memory references for accessibility before the user application program is allowed to access memory.  There is also a null pointer checker which previously would only check Intel AVX instructions. The null checker writes to stderr (if accessible) and to a file sde-null-check.out.txt. The pointer checker writes to stderr (if accessible) and to a file sde-ptr-check.out.txt. The new knobs are: -null-check and -ptr-check
    • enforcing VL=128 on any Intel AVX scalar instructions.
    • fixed for the -no-avx and -no-aes knobs in the sde driver
    • xed: many corner case bugs fixed after yet another validation review
     
     
    2010-02-08 version 3.00
    • Changed output files to have .txt suffix.
    • debugtrace prints x87 and mmx registers
    • thread-safety fix for vmaskmov* instructions
    • reduced amount of memory referenced by vpmov[sz]* instructions.
    • New memory pointer checker (See -ptr-check and -null-check knobs)
    • Added VL=128 requirement for Intel AVX scalar instructions.
    • Fixed knobs -no-avx and -no-aes in the sde front end driver
     
    2009-12-31 version 2.94
    Major update.
    • Better support for recent Linux* distributions, like Ubuntu* 9.10.
    • Better support for debugging with GDB on Linux*.
    • Using GDB 7.0.50, and "sde -debug -avx-gdb -- yourapp", gdb can directly obtain Intel AVX register values without requiring "monitor yreg N" or "monitor yregs" commands.
    • Windows version supports latest dbghelp.dll 6.11.1.404
    • Fixes for paths with spaces
    • Using Pin's "safecopy" mechanism to access user memory
    • Spelling fixes
    • Tool arguments grouped more sensibly; See the output of "sde -thelp"
    • Support for Intel AVX unmasked zero divide exceptions on Windows
    • Intel AVX/Intel SSE transition tracing feature with -ast-trace knob
    • Intel AVX/Intel SSE transition checker emits previous block information
    • CPUID leaf-zero emulation support
    • Alignment checker upgrades
    • XED disassembler supports windows debugging symbols (via dbghelp.dll)
    • Fix for Nan case in Intel®SSE4.1 roundss on Linux* only
    • Fix for Intel® SSE4 PEXTRW gpr,xmm
    • More CPUID feature knobs for Intel® SSE technologies
    • Fix for case emulation of FMA single precision that affected accuracy
    • Support for FZ and DAZ in FMA routines
    • Data watch point support
    • Fix for MXCSR.OE and IE for vcomiss/vucomiss an Nan inputs
    • New chip-check feature to restrict instructions to specific chips. See "sde -thelp"
    • Fast icounting feature (faster than using mix)
    • Fixes for Nan issues on windows with sqrt, mul, div, sub and cmp - it was quieting SNANs.
    • Upgraded pin can execute instructions with illegal instructions and an application-installed handler will be invoked.
    • New -itrace* knobs
    • Circular buffer support in debugtrace


    2009-01-30 version 1.70
    Added VPCLMULQDQ
     
    2009-01-09 version 1.61
    Synchronizing with Intel AVX architecture update.
    New 3-operand FMA instructions, removed VPERMIL2{PS,PD}, miscellaneous bug fixes.
    New footprint feature.
    Rearranged mix output, added function summaries.
    New version of dbghelp.dll required for windows (See the FAQ).
     

    2008-08-10 version 1.13

    Initial Release
  • AVX2
  • FMA
  • TSX
  • RTM
  • HLE
  • AVX-512
  • MPX
  • SHA
  • WhatIf Software
  • Intel® Software Development Emulator
  • Intel® Advanced Vector Extensions
  • Intel® Memory Protection Extensions
  • Intel® Secure Hash Algorithm Extensions
  • Intel® Streaming SIMD Extensions
  • Intel® Transactional Synchronization Extensions
  • Parallel Computing
  • Theme Zone: 

    IDZone

    Exploring Intel® Transactional Synchronization Extensions with Intel® Software Development Emulator

    $
    0
    0

    Intel® Transactional Synchronization Extensions (Intel® TSX) is perhaps one of the most non-trivial extensions of instruction set architecture introduced in the 4th generation Intel® Core™ microarchitecture code name Haswell. Intel® TSX implements hardware support for a best-effort “transactional memory”, which is a simpler mechanism for scalable thread synchronization as opposed to inherently complex fine-grained locking or lock-free algorithms. The extensions have two interfaces: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM). 

    In this blog I will show how you can write your first RTM code and execute it in an emulated environment now, without waiting until the 4th generation Intel® Core™ processors become available for purchase.

    Before diving in, please make sure you have a basic understanding of the new RTM instructions. I refer you to this blog as an introduction. Check out also the Intel Developer Forum’12 presentation by Ravi Rajwar&Martin Dixon discussing the details of Intel TSX implementation in Haswell hardware and a presentation by Andi Kleen on adding lock elision (also using RTM) to Linux.

    My plan was to write a toy bank account processing application using popular C++ thread-unaware data structures from STL with concurrent access to bank records managed by Intel TSX. This way the implementation should be very simple, thread-safe and scalable.

    Development Environment

    For this experiment one needs the newest version (later than 5.3.1) of Intel® Software Development Emulator (Intel® SDE) and a compiler that can generate RTM instructions (via intrinsics or direct machine code). Please note that performance measurements with Intel SDE running RTM are of limited value because the overhead of emulating TM in software instead of using real hardware is huge, but as you will see later Intel SDE can already demonstrate important points for RTM usage for concurrency library developers and application programmers.

    Since my laptop runs Windows I decided to try Intel SDE/RTM on Windows. I have chosen the C++ compiler from “Microsoft Visual Studio 2012 for Windows Desktop” (there is a free “Express” version that works for my purpose too). With a few clicks I quickly setup a console application project and included immintrin.h header the main .cpp file to use RTM intrinsics.

    The Test

    As a bank account structure the simple std::vector<int> from C++ standard template library has been chosen. “Accounts[i]” stores current account balance for account number i. This is very simple and popular but thread-unsafe data structure which must be protected by concurrency control mechanisms for parallel access. Usually locks/mutexes are used to limit the number of threads accessing the structure simultaneously. However, for parallel write accesses the whole data structure usually is locked exclusively even if distinct parts of it have to be updated. Intel TSX should help here since it can optimistically execute writes, and if there is no real data conflict happening, the writes are committed without serializing.

    To simplify the operations on the accounts I wanted to implement an easy-to-use C++ wrapper for protecting the current C++ scope from unsafe concurrent access to the data:

    
    {
    
            std::cout << "open new account"<< std::endl;
    
            TransactionScope guard; // protect everything in this scope
    
            Accounts.push_back(0);
    
    }
    
    {
    
            std::cout << "open new account"<< std::endl;
    
            TransactionScope guard; // protect everything in this scope
    
            Accounts.push_back(0);
    
    }
    
    {
    
            std::cout << "put 100 units into account 0"<<std::endl;
    
            TransactionScope guard; // protect everything in this scope
    
            Accounts[0] += 100; // atomic update due to RTM
    
    }
    
    {
    
            std::cout << "transfer 10 units from account 0 to account 1 atomically!"<< std::endl;
    
            TransactionScope guard; // protect everything in this scope
    
            Accounts[0] -= 10;
    
            Accounts[1] += 10;
    
    }
    
    {
    
            std::cout << "atomically draw 10 units from account 0 if there is enough money"<< std::endl;
    
            TransactionScope guard; // protect everything in this scope
    
            if(Accounts[0] >= 10) Accounts[0] -= 10;
    
    }
    
    {
    
            std::cout << "add 1000 empty accounts atomically"<< std::endl;
    
            TransactionScope guard; // protect everything in this scope
    
            Accounts.resize(Accounts.size() + 1000, 0);
    
    }
    
    

    Legacy applications implement such guards using a lock that allows only a single writer to execute the critical section (read-write locks are more complicated to handle and also do not make much sense here in our case because all accesses are writes/updates):

    
    class TransactionScope
    
    {
    
            SimpleSpinLock & lock;
    
            TransactionScope(); // forbidden
    
    public:
    
            TransactionScope(SimpleSpinLock & lock_): lock(lock_) { lock.lock(); }
    
            ~TransactionScope() { lock.unlock(); }
    
    };
    
    

    Implementing and Testing with RTM

    A naive RTM implementation for TransactionScope (handling both read/lookup and write/update accesses transparently) would be (changed lines are marked with ):

    
    class TransactionScope
    
    {
    
    public:
    
            TransactionScope()
    
    {
    
    █               int nretries = 0;
    
    █               while(1)
    
    █               {
    
    █                       ++nretries;
    
    █                       unsigned status = _xbegin();
    
    █                       if(status == _XBEGIN_STARTED) return; // successful start
    
    █                       // abort handler
    
    █                       std::cout << "DEBUG: Transaction aborted "<< nretries <<
    
    █                          " time(s) with the status "<< status << std::endl;
    
    █               }
    
            }
    
    █       ~TransactionScope() { _xend(); }
    
    };
    
     

    I have successfully compiled this code and tried to run it through Intel SDE:

    
    ./sde-bdw-external-5.31.0-2012-11-01-win/sde.exe -hsw -rtm-mode full -- ./ConsoleApplication1.exe
    
    open new account
    
    DEBUG: Transaction aborted 1 time(s) with the status 0
    
    DEBUG: Transaction aborted 2 time(s) with the status 0
    
    DEBUG: Transaction aborted 3 time(s) with the status 0
    
    DEBUG: Transaction aborted 4 time(s) with the status 0
    
    DEBUG: Transaction aborted 5 time(s) with the status 0
    
    DEBUG: Transaction aborted 6 time(s) with the status 0
    
    DEBUG: Transaction aborted 7 time(s) with the status 0
    
    DEBUG: Transaction aborted 8 time(s) with the status 0
    
    DEBUG: Transaction aborted 9 time(s) with the status 0
    
    DEBUG: Transaction aborted 10 time(s) with the status 0
    
    DEBUG: Transaction aborted 11 time(s) with the status 0
    
    DEBUG: Transaction aborted 12 time(s) with the status 0
    
    DEBUG: Transaction aborted 13 time(s) with the status 0
    
    DEBUG: Transaction aborted 14 time(s) with the status 0
    
    DEBUG: Transaction aborted 15 time(s) with the status 0
    
    DEBUG: Transaction aborted 16 time(s) with the status 0
    
    

    and so on…

    The program went into infinite loop always aborting on the first transaction. The RTM debug log from Intel SDE (emx-rtm.txt) also confirmed that (used option “-rtm_debug_log 2”). Well, a general rule is that failure is more or less expected for any implementation that ignores specification… Intel® Architecture Instruction Set Extensions Programming Reference explicitly mentions that “the hardware provides no guarantees as to whether an RTM region will ever successfully commit transactionally”. Because of that the software using RTM must provide (non-transactional) fall-back path that is executed if (many) aborts are happening (By the way: HLE provides the fall-back automatically, since on the first abort, the same critical section is executed non-transactionally).

    Implementing Fall-Back

    Here is our second attempt that acquires a fall-back spin lock non-transactionally after specified number of retries.

    
    LONGLONG naborted = 0; // global abort statistics, alternatively use “–rtm_debug_log 2” Intel SDE option
    
     
    
    class TransactionScope
    
    {
    
    █       SimpleSpinLock & fallBackLock;
    
            TransactionScope(); // forbidden
    
    public:
    
    █       TransactionScope(SimpleSpinLock & fallBackLock_, int max_retries = 3) :
    
    █               fallBackLock(fallBackLock_)
    
            {
    
                    int nretries = 0;
    
                    while(1)
    
                    {
    
                            ++nretries;
    
                            unsigned status = _xbegin();
    
                            if(status == _XBEGIN_STARTED)
    
                            {
    
    █                               if(!fallBackLock.isLocked())
    
    █                                         return; // successfully started transaction
    
    █                               /* started transaction but someone is executing 
    
    █                                  the transaction section non-speculatively (acquired
    
    █                                  the fall-back lock) -> aborting */
    
    █                               _xabort(0xff); // abort with code 0xff
    
                            }
    
                            // abort handler
    
                            InterlockedIncrement64(&naborted); // do abort statistics
    
                            std::cout << "DEBUG: Transaction aborted "<< nretries <<
    
                                  " time(s) with the status "<< status << std::endl;
    
    █                       // handle _xabort(0xff) from above
    
    █                       if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff
    
    █                            && !(status & _XABORT_NESTED))
    
    █                       {       // wait until the lock is free
    
    █                               while(fallBackLock.isLocked()) _mm_pause();
    
    █                       }
    
    █                       // too many retries, take the fall-back lock
    
    █                       if(nretries >= max_retries) break;
    
                    }
    
    █               fallBackLock.lock();
    
            }
    
            ~TransactionScope()
    
            {
    
    █               if(fallBackLock.isLocked())
    
    █                       fallBackLock.unlock();
    
    █               else
    
                            _xend();
    
            }
    
    };
    
    

    The output looks much better now:

    
    open new account
    
    DEBUG: Transaction aborted 1 time(s) with the status 0
    
    DEBUG: Transaction aborted 2 time(s) with the status 0
    
    DEBUG: Transaction aborted 3 time(s) with the status 0
    
    open new account
    
    put 100 units into account 0
    
    transfer 10 units from account 0 to account 1 atomically!
    
    atomically draw 10 units from account 0 if there is enough money
    
    add 1000 empty accounts atomically
    
     

    One can see that all transaction except the first one succeeded on the very first attempt. The first one took the fall-back lock after three attempts. It was special since it had to reserve and touch new memory for the vector from the operating system. This is a very complex process involving system calls, privilege ring transitions (ring 3 [application]->ring 0 [OS]), page faults and initialization/zeroing of very big chunks of memory which may not fit into the transactional buffer. All this may cause aborts according to the Intel® Architecture Instruction Set Extensions Programming Reference.

    Leveraging RTM Abort Status Bits

    A further optimization that I came up with is leveraging the abort status information: in case of such “hard” aborts the “retry” bit (position 1) in the abort status is not set. The bit is set if hardware thinks the transaction may succeed on retry. I added the line marked below in the abort handler to implement it:

     

    
     // handle _xabort(0xff) from above
    
     if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff
    
          && !(status & _XABORT_NESTED))
    
     {
    
            while(fallBackLock.isLocked()) _mm_pause(); // wait until lock is free
    
     
    
    █} else if(!(status & _XABORT_RETRY)) break; /* take the fall-back lock
    
        if the retry abort flag is not set */
    
     

    The output:

    
    open new account
    
    DEBUG: Transaction aborted 1 time(s) with the status 0
    
    open new account
    
    put 100 units into account 0
    
    transfer 10 units from account 0 to account 1 atomically!
    
    atomically draw 10 units from account 0 if there is enough money
    
    add 1000 empty accounts atomically
    
     

    Now we see that the program makes faster progress by taking the fall-back lock sooner in the case of a “hard” abort.

    As you may notice, the changes so far were isolated within some synchronization interface, TransactionScope. The application code was not changed. As generally available TSX software infrastructure evolves in future you should look for a proven existing library that has (scope) locks with RTM support to avoid pitfalls in your synchronization primitives (we will talk about pitfalls in applicationcode in future blogs). For example a TSX-enabled pthread library for Linux is already available. On the other hand, it is not uncommon for existing applications to use an extended or custom synchronization interfaces, converting them to take advantage of TSX is not a complicated task either if done with care.

    Concurrent Accesses from Several Threads Managed by Intel TSX

     

    After basic debugging the time has come to see the real power of Intel TSX: run two worker threads doing random concurrent updates to the central account data structure:

    
    unsigned __stdcall thread_worker(void * arg)
    
    {
    
            int thread_nr = (int) arg;
    
            std::cout << "Thread "<< thread_nr<< " started."<< std::endl;
    
            // create thread-local TR1 C++ random generator from <random>
    
            std::tr1::minstd_rand myRand(thread_nr); 
    
            long int loops = 10000;
    
     
    
            while(--loops)
    
            {
    
                    {
    
                            TransactionScope guard(globalFallBackLock);
    
                            // put 100 units into a random account atomically
    
                            Accounts[myRand() % Accounts.size()] += 100;
    
                    }
    
     
    
                    {
    
                            TransactionScope guard(globalFallBackLock);
    
                            /* transfer 100 units between random accounts 
    
                               (if there is enough money) atomically */
    
                            int a = myRand() % Accounts.size()
    
                            int b = myRand() % Accounts.size();
    
                            if(Accounts[a] >= 100)
    
                            {
    
                                    Accounts[a] -= 100;
    
                                    Accounts[b] += 100;
    
                            }
    
                    }
    
            }
    
            std::cout << "Thread "<< thread_nr<< " finished."<< std::endl;
    
            return 0;
    
    }
    
     

    I built Release build without DEBUG output and see that there are only about 100-300 aborts for the total of 20000 transactions. Debug output says that the abort flag status is 6: retry and “memory access conflict” bits are set. This is exactly what I expected from Intel TSX: almost all updates are done in parallel and only a few have been rolled back due to a conflict.

    To double check if my conclusions are right and emulator works as I expected I added an increment/update of a global counter in the transactions to introduce a huge number of conflicting accesses. And yes, it worked: with that change I have seen about 5-15K aborts. Although the absolute numbers obtained from the RTM emulator are not able to exactly predict the execution metrics on future hardware, the orders of magnitude should still indicate possible issues with RTM usage.

    Last Words

    These were my experiences with RTM and the new Intel® Software Development Emulator. Get prepared for Haswell and check out how your software can use Restricted Transactional Memory with Intel SDE now!

    --

    Roman 

    (the complete source code is attached to the article)

  • Intel Transactional Synchronization Extensions (Intel TSX)
  • Restricted Transactional Memory (RTM)
  • Haswell
  • Intel Software Development Emulator
  • sde
  • Icon Image: 

    Attachments: 

    https://software.intel.com/sites/default/files/blog/335035/exploringinteltsx.cpp
  • Case Study
  • Download
  • Sample Code
  • Success Story
  • Technical Article
  • Tutorial
  • Debugging
  • Development Tools
  • Enterprise
  • Intel® Core™ Processors
  • Microsoft Windows* 8 Desktop
  • Optimization
  • Parallel Computing
  • Porting
  • Threading
  • Intel® Software Development Emulator
  • Intel® Transactional Synchronization Extensions
  • C/C++
  • Business Client
  • Server
  • Windows*
  • Laptop
  • Server
  • Developers
  • Professors
  • Students
  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Theme Zone: 

    IDZone

    Building and Simulating an App using the HTML5 Development Environment Beta

    $
    0
    0

    The HTML5 section within the Intel Developer Zone was updated just before the US Thanksgiving Holiday to release the new Intel® HTML5 Development Environment Beta and I tried out a few of the sample apps. It took me about fifteen minutes to get one of the samples packaged into an .apk file and running on my Android tablet. This is the first cloud based tool set that Intel has provided and it's a basic solution to learn HTML5 and develop real cross-platform apps, that can be submitted to the iOS, Google* Play, or even a Blackberry* app store.

    Before you get started.


    There are a few things you need to do before you can get to this 15 minute example. Some necessary, some just recommended, some you may already have done, but all free..

    #1 You need to login to the the Intel Developer Zone, or register if you are new: Click at the top right of the page you are reading now.

    #2  Request an account for the HTML5 Development Environment Beta: Click here to goto the HTML5 page, or goto  http://software.intel.com/en-us/form/html5-beta-request

    #3  Download Google Chrome:  The mobile system emulator in the IDE only runs in Chrome, so you will need this once you get your account.

    #4  Get a Github Account:  https://github.com/signup/free. Recommended, as this can be used by the HTML5 Development Environment Beta, but also as a login ID for Adobe Phonegap

    #5 Get an account with http://build.phonegap.com (or use the github account you setup in step #4 )

    If you have any questions or comments about this procedure, or the tools, please post them in the HTML5 Forum section. http://software.intel.com/forums/html5-forum

    The Intel® HTML5 Development Environment is enabled with IDZ single-sign-on, so once you get your approval email, crank up Chrome, login to IDZ and click to "Launch the Tool" on  the HTML5 page.

    Be sure to check out the Mobile Device Emulator included with this tool. You can select from multiple screen sizes and orientations and see how your app will look and run on everything from a small phone to a large tablet.  The Emulator also includes support for tablet and phone sensors, so you can determine if you app responds correctly to GPS timeouts, for example. See the screenshot below for more details.

    And finally, if you can't wait to get your IDE account, you can play some of the HTML5 games in your Chrome Browser while you are waiting, or browse the accompanying articles for the samples that are included.

    Stewart Christie is the HTML5 and Tizen App Community Manager. Follow him on twitter @intel_stewart

    IDE Emulator

    Icon Image: 

  • News
  • Registration and Licensing
  • Geolocation
  • Sensors
  • Touch Interfaces
  • User Experience and Design
  • Intel® XDK
  • Intel® Software Development Emulator
  • HTML5
  • JavaScript*
  • Android*
  • HTML5
  • Tizen*
  • Windows*
  • Phone
  • Tablet
  • Laptop
  • Developers
  • Intel AppUp® Developers
  • Android*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 8
  • Tizen*
  • Theme Zone: 

    IDZone
    Viewing all 25 articles
    Browse latest View live




    Latest Images