Emulation of new instructions

↧

Licenses for runtime libraries for SDE on Linux

August 12, 2008, 2:31 pm

Latest and popular articles on Intel Technologies

≫ Next: Recent Intel® AVX Architectural Changes

≪ Previous: Emulation of new instructions

Read more about GNU General Public License, version 2

Back to the Intel® Software Development Emulator

For libstdc++:

// Copyright (C) 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006
// Free Software Foundation, Inc.
//
// This file is part of the GNU ISO C++ Library.  This library is free
// software; you can redistribute it and/or modify it under the
// terms of the GNU General Public License as published by the
// Free Software Foundation; either version 2, or (at your option)
// any later version.

// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.

// You should have received a copy of the GNU General Public License along
// with this library; see the file COPYING.  If not, write to the Free
// Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301,
// USA.

// As a special exception, you may use this file as part of a free software
// library without restriction.  Specifically, if other files instantiate
// templates or use macros or inline functions from this file, or you compile
// this file and link it with other files to produce an executable, this
// file does not by itself cause the resulting executable to be covered by
// the GNU General Public License.  This exception does not however
// invalidate any other reasons why the executable file might be covered by
// the GNU General Public License.

And for libgcc_s:

/* Copyright (C) 1989, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
   2000, 2001, 2002, 2003, 2004, 2005  Free Software Foundation, Inc.

This file is part of GCC.

GCC is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2, or (at your option) any later
version.

In addition to the permissions in the GNU General Public License, the
Free Software Foundation gives you unlimited permission to link the
compiled version of this file into combinations with other programs,
and to distribute those combinations without any restriction coming
from the use of this file.  (The General Public License restrictions
do apply in other respects; for example, they cover modification of
the file, and distribution when not linked into a combine
executable.)

GCC is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received a copy of the GNU General Public License
along with GCC; see the file COPYING.  If not, write to the Free
Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA.  */

Linux*

↧

Recent Intel® AVX Architectural Changes

January 29, 2009, 8:33 am

Latest and popular articles on Intel Technologies

≫ Next: AVX debugging или все-таки как?

≪ Previous: Licenses for runtime libraries for SDE on Linux

Dear Intel® AVX developers,

We recently made some significant changes to the Intel® Advanced Vector Extensions Programmer’s Reference Manual (please download the latest version at /sites/avx/). If you are writing tools or software based on AVX, this may impact you. The big changes are a very different FMA syntax and the removal of two instructions (4-operand permutes).

Since the initial Intel® AVX spec was released in April 2008, I wanted to recap the changes we’ve made in the previous two Programmer’s Reference Manual Updates.

Added: VEX forms of AES instructions
The AES instructions we plan to release on our upcoming 32-nm cores (codename “Westmere”) will be extended by the VEX prefix in the following cores (codename “Sandy Bridge”). The VEX brings a distinct destination register to 4 of the 5 AES instructions (VAESDEC, VAESDECLAST, VASEINC, and VAESINCLAST… AESIMC already had a distinct destination). Only the 128-bit form of the instructions are supported – the instructions are not promoted to 256-bit. As with the AES instruction set, VAES instructions may not be enabled by all products in all geographies. For the VAES instructions we have therefore created a unique way of detecting their presence in hardware, by requiring you to check that two CPUID flags are set (CPUID.AES AND CPUID.AVX).

Added: 256-bit forms of streaming stores
Astute readers noticed that the streaming store instructions MOVNTDQ, MOVNTPS, and MOVNTPD had originally been supported only in 128-bit forms. We now have them in 256-bit forms in our Sandy Bridge cores. It’s not clear that they will be any faster than the 128-bit forms on Sandy Bridge, but we encouraged their use for future performance. Note that Streaming Load (VMOVNTDQA) is still (only) 128-bit – yes that’s intentional.

Removed: VPERMIL2PS and VPERMIL2PD
All PERMIL2 instructions are gone – both the 128-bit and 256-bit flavors. Like the FMA below, they used the VEX.W bit to select which source was from memory – we’re not moving in the direction of using VEX.W for that purpose any more.

Changed: All FMA instructions
We previously defined 4-operand FMA’s with 3 sources and a separate destination. We now have 3 operands – and still 3 sources, so one of them gets destroyed (this makes the FMA instructions unique in AVX). For each of the old forms, we now have 3 new instructions, using 132, 213, and 231 designations. The VEX.W bit no longer selects the source that comes from memory – instead, it selects the floating point type (single or double precision). Finally, the scalar instructions preserve the upper bits of the destination (up to bit 127) instead of zeroing. These instructions are (still) not in Sandy Bridge, we are planning to ship them in a subsequent processor.

Example: we previously had

VFMADDSS xmm1, xmm2, xmm3, m32, which was xmm1 = xmm2*xmm3 + m32

(and the upper bits of the XMM register - from bit 32 to 127 - were zeroed).

NOW we have three forms:

VFMADD132SS xmm1, xmm2, m32, which is xmm1 = xmm1*m32 + xmm2
VFMADD213SS xmm1, xmm2, m32, which is xmm1 = xmm2*xmm1 + m32
VFMADD231SS xmm1, xmm2, m32, which is xmm1 = xmm2*m32 + xmm1

(and note that now the upper bits of the XMM register – from bit 32 to 127 - are unchanged).

The numbers in the instruction mnemonics come from the order of the operands in the expression .... so VFMADD132 is 1*3+2; VFMADD213 is 2*1+3, and VFMADD231 is 2*3+1.

The three forms allow you to avoid having to do extra copies or loads – most of the time. The primary exception is in code where you really did need to re-use all the sources, as in butterflies:

y0 = x0*c0 + x1
y0 = x0*c0 – x1

which generally incurs one copy for every 2 FMA’s. So far, this doesn’t appear to be much of a performance hit.

Added: VEX forms of the PCLMULQDQ instruction
The PCLMULQDQ instruction we plan to release on our upcoming 32-nm cores (codename “Westmere”) will be extended by the VEX prefix in our subsequent cores (codename “Sandy Bridge”). The VEX form brings a distinct destination register. Only the 128-bit form of the instructions are supported – the instructions are not promoted to 256-bit. As with the VAES instructions, VPCLMULQDQ will require the careful programmer to check that two CPUID flags are set (in this case CPUID.PCLMULQDQ AND CPUID.AVX).

A number of miscellanea
We clarified the alignment-check exception (#AC) behavior for MASKMOV instructions (by the way, does anyone actually use #AC?) .
We clarified the exception type for the packed shift instructions (PSLL, PSRL, PSRA).
The Encoding rule table (4-3) is clarified to reflect PEXTRW.

We hope these changes are not too disruptive to you and thank you for your support of our ongoing early disclosure policy. We have updated the Intel® Software Development Emulator with all of these changes. If you have any questions or concerns about the impact of these changes to your application, or would like more detail on any of these changes, I encourage you to start a thread at /en-us/forums/intel-avx-and-cpu-instructions, or contact me directly.

Regards,
Mark Buxton
Software Engineer
Intel Corporation
Mark.J.Buxton@intel.com

Icon Image:

↧

AVX debugging или все-таки как?

January 29, 2010, 10:25 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Software Development Emulator Download

≪ Previous: Recent Intel® AVX Architectural Changes

AVX определен, зафиксирован и уже идет к нам. Ранее много говорилось о разных способах разработки: компиляция, эмуляция, документация и даже профайлинг (очень рекомендую заглянуть сюда /en-us/avx/ ), – но довольно мало было информации по поводу отладки.

Хотя, если сказать честно – все уже было. Но сегодня стало еще удобнее и даже нагляднее отлаживать перемещение битов по 256 битному полю AVX регистров.

В общем, рекомендую ближе познакомиться с SDE (/en-us/articles/intel-software-development-emulator ).

Эмулятор позволяет не только отлично, но и тихо обрабатывать набор всех инструкций, а также показывать, что именно происходило.

Для начала хочу обратить ваше внимание на дополнительный аргумент помощи - thelp, который раскрывается в довольно длинный набор аргументов, среди которых можно найти и так называемые Debugtrace knobs, где отдельно стоит отметить -debugtrace и -dt_start_int3.

Их использование позволяет нам создать файл отчета debugtrace.out ( имя по умолчанию ), где будут явно видны команды и, главное, их операнды с используемыми значениями.

У меня, например, получается:

TID0: INS 0x00401f4d                     vrcpss xmm7, xmm5, xmm5
TID0:      XMM7 := 00000000_00000000_00000000_3ba57800
XMM7 (doubles) := 0 4.94411e-315
XMM7 (floats) := 0 0 0 0.00504971
TID0: INS 0x00401f51                     vsubss xmm5, xmm1, xmm0
TID0:      XMM5 := 00000000_00000000_00000000_43460000
XMM5 (doubles) := 0 5.57633e-315
XMM5 (floats) := 0 0 0 198
TID0: INS 0x00401f55                     vmulss xmm5, xmm5, xmm7
TID0:      XMM5 := 00000000_00000000_00000000_3f7ff5a0
XMM5 (doubles) := 0 5.26353e-315
XMM5 (floats) := 0 0 0 0.999842

Здесь явно видно, что vmulss ( скалярное умножение с плавающей точкой ) в виде операндов получает

0.00504971 (XMM7) и 198 (XMM5). Результат остается в XMM5 (0.999842), что согласно моему калькулятору является истиной.

Структура debugtrace.out на самом деле довольно проста, и практически сразу, ну или со второго взгляда можно увидеть последние значения используемых регистров или памяти J.

Для большего удобства советую также обратить внимание на dt_start_int3, который позволяет «окружать» интересный код для более детального разбора уже из SDE.

Я думаю проблем уже нет или ?

Icon Image:

Open Source

↧

Intel® Software Development Emulator Download

December 16, 2011, 6:56 am

Latest and popular articles on Intel Technologies

≫ Next: Buy or Renew Intel® Software Development Products

≪ Previous: AVX debugging или все-таки как?

Intel® Software Development Emulator (released November 16, 2013)

DOWNLOAD Intel® SDE for WINDOWS* (sde-external-6.12.0-2013-11-16-win.tar.bz2)
- Note: If you use Cygwin's tar command to unpack the Windows* kit, you must execute a "chmod -R +x" on the unpacked directory.
- DOWNLOAD Intel® SDE debugging integration for WINDOWS* (sde-msvs2012-1.0.5.msi.zip)
DOWNLOAD Intel® SDE for LINUX* (sde-external-6.12.0-2013-11-16-lin.tar.bz2)
DOWNLOAD Intel® SDE for OS X* (sde-external-6.12.0-2013-11-16-mac.tar.bz2)
DOWNLOAD Intel® MPX runtime for LINUX* (2013-10-29-mpx-external-lin.tar.bz2)
DOWNLOAD Intel® MPX runtime for WINDOWS* (2013-10-29-mpx-runtime-external-win.zip)

Previous versions of the Intel® Software Development Emulator

DOWNLOAD Intel® SDE for WINDOWS* (sde-external-6.7.0-2013-09-21-win.tar.bz2)
- Note: If you use Cygwin's tar command to unpack the Windows* kit, you must execute a "chmod -R +x" on the unpacked directory.
- DOWNLOAD Intel® SDE debugging integration for WINDOWS* (sde-msvs2012-1.0.5.msi.zip)
DOWNLOAD Intel® SDE for LINUX* (sde-external-6.7.0-2013-09-21-lin.tar.bz2)
DOWNLOAD Intel® SDE for OS X* (sde-external-6.7.0-2013-09-21-mac.tar.bz2)
DOWNLOAD Intel® MPX runtime for LINUX* (2013-08-29-mpx-external-lin.tar.bz2)
DOWNLOAD Intel® MPX runtime for WINDOWS* (2013-08-29-mpx-external-win.zip)

DOWNLOAD Intel® SDE for WINDOWS* (sde-external-6.1.0-2013-07-22-win.tar.bz2)
- Note: If you use Cygwin's tar command to unpack the Windows* kit, you must execute a "chmod -R +x" on the unpacked directory.
- DOWNLOAD Intel® SDE debugging integration for WINDOWS* (sde-msvs2012-1.0.5.msi.zip)
DOWNLOAD Intel® SDE for LINUX* (sde-external-6.1.0-2013-07-22-lin.tar.bz2)
DOWNLOAD Intel® SDE for OS X* (sde-external-6.1.0-2013-07-22-mac.tar.bz2)
DOWNLOAD Intel® MPX runtime for LINUX* (2013-07-22-mpx-external-lin.tar.bz2)
DOWNLOAD Intel® MPX runtime for WINDOWS* (2013-07-22-mpx-external-win.zip)
DOWNLOAD Intel SDE for WINDOWS* (sde-bdw-external-5.38.0-2013-01-03-win.tar.bz2)
- Note: If you use Cygwin's tar command to unpack the Windows* kit, you must execute a "chmod -R +x" on the unpacked directory.
- DOWNLOAD Intel SDE debugging integration for WINDOWS* (sde-msvs2012-1.0.5.msi.zip)
DOWNLOAD Intel SDE for LINUX* (sde-bdw-external-5.38.0-2013-01-03-lin.tar.bz2)
DOWNLOAD Intel SDE for OS X* (sde-bdw-external-5.38.0-2013-01-03-mac.tar.bz2)

DOWNLOAD Intel SDE for WINDOWS* (sde-bdw-external-5.31.0-2012-11-01-win.tar.bz2)
- Note: If you use Cygwin's tar command to unpack the Windows* kit, you must execute a "chmod -R +x" on the unpacked directory.
DOWNLOAD Intel SDE for LINUX* (sde-bdw-external-5.31.0-2011-11-01-lin.tar.gz)
DOWNLOAD Intel SDE for OS X* (sde-bdw-external-5.31.0-2011-11-01-mac.tar.bz2)

Please take a moment to register with Intel® DZ to participate in forum discussions.

Back to the Intel® Software Development Emulator page.

What If Pre-Release License Agreement

License Agreement:

Protected Attachments:

Attachment	Size
Download sde-bdw-external-5.31.0-2012-11-01-win.tar.bz2	7.67 MB
Download sde-bdw-external-5.31.0-2012-11-01-mac.tar.bz2	7.23 MB
Download sde-bdw-external-5.31.0-2012-11-01-lin.tar.gz	14.41 MB
Download sde-msvs2012-1.0.5.msi.zip	405.59 KB
Download sde-bdw-external-5.38.0-2013-01-03-lin.tar.bz2	13.88 MB
Download sde-bdw-external-5.38.0-2013-01-03-mac.tar.bz2	7.42 MB
Download sde-bdw-external-5.38.0-2013-01-03-win.tar.bz2	7.55 MB
Download sde-external-6.1.0-2013-07-22-lin.tar.bz2	14.52 MB
Download sde-external-6.1.0-2013-07-22-mac.tar.bz2	8.13 MB
Download sde-external-6.1.0-2013-07-22-win.tar.bz2	8.12 MB
Download 2013-07-22-mpx-runtime-external-lin.tar.bz2	18.29 KB
Download 2013-07-22-mpx-runtime-external-win.zip	58.07 KB
Download 2013-08-29-mpx-runtime-external-lin.tar.bz2	16.53 KB
Download 2013-08-29-mpx-runtime-external-win.zip	58.07 KB
Download sde-external-6.7.0-2013-09-21-lin.tar.bz2	14.15 MB
Download sde-external-6.7.0-2013-09-21-mac.tar.bz2	8.4 MB
Download sde-external-6.7.0-2013-09-21-win.tar.bz2	8.21 MB
Download sde-external-6.12.0-2013-11-16-lin.tar.bz2	13.65 MB
Download sde-external-6.12.0-2013-11-16-mac.tar.bz2	8.02 MB
Download sde-external-6.12.0-2013-11-16-win.tar.bz2	7.88 MB
Download 2013-10-29-mpx-runtime-external-lin.tar.bz2	17.86 KB
Download 2013-10-29-mpx-runtime-external-win-0.zip	58.07 KB

↧

Buy or Renew Intel® Software Development Products

March 27, 2012, 8:04 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Software Development Emulator Release Notes

≪ Previous: Intel® Software Development Emulator Download

Intel offers several licensing options for our software development products. Review the choices below to buy or renew Intel® software.

30 day evaluation versions of Intel® Software Development Products are also available for free download. Visit our Software Evaluation Center to download free evaluation versions of the products.

All prices listed below are for single developer commercial licenses. All prices are Manufacturer Suggested List Prices (MSRP) and subject to change without notice. Prices do NOT include Value Added Taxes (VAT) or any other state or local taxes or charges.

For floating licenses, node-locked licenses, or other licensing options, contact a reseller, or contact an Intel representative at intel.software.sales@intel.com.
To purchase an academic research license, please select your desired product and the discounted price will be displayed during check out. For additional information on all of our education offerings, visit our Education Offerings Center, or contact an Intel representative at academicdevelopersinfo@intel.com.
Support Renewal extends your support for one year from the expiration date of your current support agreement.
Existing customers can take advantage of special upgrade prices for Intel® Parallel Studio XE, Intel® C++ Studio XE or Intel® Fortran Studio XE .See details of upgrade offer.

Category Name	Product MSRP (Single user)	Support Renewal MSRP (Single User)	Options
Product Suites
Intel® Parallel Studio XE for Windows* Includes Intel® Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE	$2,299	$799**	Find a reseller › See all options ›
Intel® Parallel Studio XE for Linux* Includes Intel® Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE	$2,299	$799**	Find a reseller › See all options ›
Intel® C++ Studio XE for Windows or Linux Includes Intel® C++ Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE	$1,599	$599**	Find a reseller › See all options ›
Intel® Visual Fortran Studio XE for Windows Includes Intel® Visual Fortran Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE	$1,899	$699**	Find a reseller › See all options ›
Intel® Fortran Studio XE for Linux Includes Intel® Fortran Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE	$1,899	$699**	Find a reseller › See all options ›
Intel® Parallel Studio Includes Intel® Parallel Advisor, Intel® Parallel Amplifier, Intel® Parallel Composer, Intel® Parallel Inspector		$320	Find a reseller › See all options ›
Intel® Cluster Studio XE for Windows Includes Intel® C++ Composer XE, Intel® Visual Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE	$2,949	$1,049**	Find a reseller › See all options ›
Intel® Cluster Studio XE for Linux Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE	$2,949	$1,049**	Find a reseller › See all options ›
Intel® Cluster Studio for Windows Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks	$2,049	$749**	Find a reseller › See all options ›
Intel® Cluster Studio for Linux Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks	$2,049	$749**	Find a reseller › See all options ›
Intel® System Studio for Linux* including JTAG Debugger Intel® Vtune™ Amplifier, Intel® Inspector, Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel® JTAG Debugger, GDB* Debugger	$3,499	$1,299	Find a reseller › See all options ›
Intel® System Studio for Linux* Intel® Vtune™ Amplifier, Intel® Inspector, Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, GDB* Debugger	$1,999	$699	Find a reseller › See all options ›
Compilers and Libraries
Intel® Composer XE for Windows Includes Intel® C++ Composer XE, Intel® Visual Fortran Composer XE	$1,199	$449**	Find a reseller › See all options ›
Intel® Composer XE for Linux Includes Intel® C++ Composer XE, Intel® Fortran Composer XE	$1,449	$499**	Find a reseller › See all options ›
Intel® C++ Composer XE for Windows, Linux, or OS X* Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives 8.0, Intel® Math Kernel Library, Intel® Parallel Building Blocks	$699	$249**	Find a reseller › See all options ›
Intel® Visual Fortran Composer XE for Windows Includes Intel® Visual Fortran Compiler, Intel® Math Kernel Library	$849	$299**	Find a reseller › See all options ›
Intel® Fortran Composer XE for Linux Includes Intel® Fortran Compiler, Intel® Math Kernel Library	$999	$349**	Find a reseller › See all options ›
Intel® Fortran Composer XE for OS X Includes Intel® Fortran Compiler, Intel® Math Kernel Library	$849	$299**	Find a reseller › See all options ›
Intel® C++ Compiler for Android*	$79.95	N/A	Find a reseller › See all options ›
Intel® C++ Compiler Professional Edition for QNX Neutrino* RTOS Support Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives 8.0	$599	$240	See all options ›
Intel® C Compiler for EFI Byte Code	$995	$398	Find a reseller › See all options ›
Intel® Visual Fortran Composer XE for Windows with IMSL 6.0* Includes Intel® Visual Fortran Compiler, IMSL* Fortran Numerical Library, Intel® Math Kernel Library Includes 1 developer and 1 deployment license for the developer.	$2,049	$849**	Find a reseller › See all options ›
Embedded and Mobile System Development
Intel® System Studio for Linux* including JTAG Debugger Intel® Vtune™ Amplifier, Intel® Inspector, Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel® JTAG Debugger, GDB* Debugger	$3,499	$1,299	Find a reseller › See all options ›
Intel® System Studio for Linux* Intel® Vtune™ Amplifier, Intel® Inspector, Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, GDB* Debugger	$1,999	$699	Find a reseller › See all options ›
Performance Libraries
Intel® Integrated Performance Primitives 8.0 for Windows, Linux, or OS X	$199	$69**	Find a reseller › See all options ›
Intel® Math Kernel Library for Windows or Linux	$499	$179**	Find a reseller › See all options ›
Intel® Threading Building Blocks for Windows, Linux, or OS X	$499	$179**	Find a reseller › See all options ›
Performance Profilers
Intel® VTune™ Amplifier XE for Windows or Linux	$899	$349**	Find a reseller › See all options ›
Thread and Memory Checkers
Intel® Inspector XE for Windows or Linux	$899	$349**	Find a reseller › See all options ›
Cluster Tools
Intel® Cluster Studio XE for Windows Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE	$2,949	$1,049**	Find a reseller › See all options ›
Intel® Cluster Studio XE for Linux Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE	$2,949	$1,049**	Find a reseller › See all options ›
Intel® Cluster Studio for Windows Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks	$2,049	$749**	Find a reseller › See all options ›
Intel® Cluster Studio for Linux Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks	$2,049	$749**	Find a reseller › See all options ›
Intel® MPI Library for Windows or Linux	$499	$179**	Find a reseller › See all options ›
System Modeling and Simulation Tools
CoFluent Studio*	N/A	N/A	See all options ›
CoFluent Reader	N/A	N/A	See all options ›
Intel® Graphics Performance Analyzers
Intel® Graphics Performance Analyzers	N/A	N/A	See all options ›

**Lowest Price available if you renew prior to current subscription expiration. For more information on renewals click here.

Intel takes your privacy seriously. Refer to Intel's Privacy Notice and Serial Number Validation Notice regarding the collection and handling of your personal information, the Intel product’s serial number and other information.

↧

Intel® Software Development Emulator Release Notes

June 15, 2012, 8:44 am

Latest and popular articles on Intel Technologies

≫ Next: Exploring Intel® Transactional Synchronization Extensions with Intel® Software Development Emulator

≪ Previous: Buy or Renew Intel® Software Development Products

2013-11-16 version 6.12.0

Added support to Mac OSX version 10.9.
Improved the TSX statistics information.
Various fixes with the emulation of floating-point instructions of Intel AVX-512.
Enabled the alignment checker tool by default for instructions that require alignment.
Fixed mismatch between mix and dynamic mask profiler.
Updated the Intel MPX runtime libraries for Windows.
Performance improvements when modeling a CPU prior to AVX-512.

2013-09-21 version 6.7.0

Debugging with GDB is now supported with Intel® AVX-512. Download the new GDB from here.
Emulation of Intel® AVX2 FMA and Intel AVX-512 FMA uses native FMA instructions when running on Haswell host.
Various fixes with the emulation of floating-point and conversion instructions of Intel AVX-512.
Disassembly of control transfer instructions displays the 'bnd' prefix when used with Intel® MPX.
Updated the XED ISA set names for Intel AVX-512. This is visible in 'mix' statistics output.
This release goes with 2013-08-29 version of the Intel MPX runtime.

2013-07-22 version 6.1.0

Emulation support for the Intel®Advanced Vector Extensions 512 (Intel® AVX-512) instructions present on the Intel Knights Landing microarchitecture.
Emulation support for the Intel® Secure Hash Algorithm (Intel® SHA) extensions present on the Intel Goldmont microarchtiecture.
Emulation support for the Intel® Memory Protection Extensions (Intel® MPX) present on the Intel Skylake and Goldmont microarchitectures.
Support for Hardware Lock Elision introduced on the Intel Haswell microarchitecture
Improved support for Restricted Transactional Memory introduced on the Intel Haswell microarchitecture.
Improved support for the OS X* operating system (Mountain Lion)
The footprint tool now has the ability to compute footprint over time for working-set estimation.
A new tool called the dynamic mask profiler is provided using -dyn_mask_profile knob. The output is in a simple XML format.

The Intel SDE development team has grown to include Michael Berezalsky, Mark Charney, Michael Gorin, Omer Mor, Ariel Slonim and Ady Tal.

2013-01-03 version 5.38

Improvements in RTM emulation stability. Added statistics knobs. Updated knobs.
Support for debugging integration with Microsoft Visual Studio 2012. See main page for information.
Improved multithreaded stability when using the AVX/SSE transition checker
Mac OS X: support for code-signed binaries, simplifying execution. See main page for information about the "taskport".
XED: added elf/dwarf support back to the command line tool
TZCNT ZF flags fix

2012-11-01 version 5.31 - major update

Major update including fixes for the processor codenamed Haswell and introduction of instructions in the processor codenamed Broadwell
First public SDE release for OS X, 10.6 and 10.7. See additional information on the main Intel SDE web page for required permissions.
HSW's RTM mode is supported with the "-rtm-mode full" option. This feature is very new and the Intel SDE implementation might be a little unstable.
Completely new mechanism for handling of CPUID. CPUID values now come from an input file.
SDE's -chip-check feature checks to make sure instructions are valid for the specified chip. See "sde -help" for the various chip options.
Exception handling fixes
Haswell BMI emulation fixes, including flags output.
Debugtrace multithreading safety improvements
Mix top-blocks sorting issues. Mix also has better support for allocating stats to overlapping blocks.
Mix default blocks size is now 1500 instructions to avoid fragmenting large hot blocks.
XED now can emit "dot" graphs for specified regions: path-to-sde-kit/xed -i SOMEEXE -as 0x40316b -ae 0x4031b3 -dot foo.dot; dot -O -Tpdf foo.dot
Mix has prefix a legacy-prefix histogram
Footprint tool can now collect stats about unique memory pages as well as unique cache lines. The footprint tool is now faster as well.
Improved speed of AVX/SSE transition checker by roughly 12%. See the -ast knob in "sde -thelp".
Fixed some numerical errors in our software emulation of the FMA instruction for denormal numbers.
Various stability improvements from using a newer version of Pin.
Better handling of MXCSR exception status bigs for AVX1/2 instructions. We still do not support raising unmasked floating point errors from emulated instructions.
Can now set environment variables from the command line with the -env VAR VALUE option.
The commands for the GDB interface have been updated. See "monitor help sde" when attached as described on the main page. Please use GDB 7.4 or later.
The chip check error message includes the instruction bytes of the offending instruction.
Multiprocess output file handling. You used to have to supply "-i" to get the process id inserted in to the file name to avoid multiprocess applications from overwriting the common output files. Now we attempt to detect the creating of other processes and add the PID to the file names automatically. The parent / child relationship is recorded in the file name.
Better support for unused bits in the VEX encodings in 32b mode.

The Intel SDE development team has grown to include Michael Berezalsky, Mark Charney, Michael Gorin, Omer Mor, Ariel Slonim and Ady Tal.

2011-12-15 version 4.46

Linux* 3.x is supported
Better support for running on Intel® AVX-enabled hosts
All output files now begin "sde-" and end with ".txt" by default
Mix is faster and does more analysis of SIMD operations
Mix has line number support for the top blocks when the information is available in the application
The -ptr-chk option now checks the memory refernces of gather operations
Fixed support for file descriptor leak when exec'ing thousands of threads on Linux*.
Misc other stability improvements.

2011-07-01 version 4.29

Support for the Haswell new instructions in the Intel AVX programmers reference version 11.
Mix now includes category and instruction length histograms automatically so the corresponding knobs were removed.
Many other changes

2010-12-23 verison 3.89 (Linux* only)

Fixed runtime libraries. Version 3.88 accidentally included runtime libraries that require a newer version of glibc than is present on older systems (like RHEL4).

2010-12-21 version 3.88

Support for the post-32nm processor instructions for the processor codenamed Ivy Bridge in the 008 revision of the Intel AVX programmers reference document
Many stability improvements
"sde -thelp" goes to stdout, not stderr
mix has a "-demangle 0" option to turn off demangling
xed disassembler handles uninitialized code sections in windows binaries
xed supports dwarf line number information with the -line knob on Linux*.
mix has improved memory efficiency
To debug on Linux*, you no longer need the -avx-gdb knob but you must use gdb 7.2 or later which supports a new XML remote-debug protocol.

2010-03-11 version 3.09

When pin or sde crashes due to bugs in user applications, the output of the circular buffer use for -itrace-execute (etc.) was not being dumped to disk. It is now.
Fixed circular buffer used for -itrace-execute and -itrace-execute-emulate. It was not initializing the circular buffer when -itrace-lines was used and would just crash immediately. In addition to *actually* making the feature work, I sped it up immensely by reusing allocated string buffers.
Fixed 14 scalar Intel AVX instructions that were referencing too much memory (128b instead of 32b or 64b).
Made the xsave emulator be enabled all the time even when xsave is present on the hardware. One can disable it with '-xsave 0'.
All output log / stats file names now end in .txt by default.
Added a descriptive header to the top of the Intel AVX/Intel SSE transition output file.
debugtrace now print mmx (and x87) register values
vmaskmov* instructions are now implemented in a thread-safe way.
vpmov[sz]x instructions now correctly reference less memory to avoid extra page accesses.
New memory pointer checker. This option check all memory references for accessibility before the user application program is allowed to access memory. There is also a null pointer checker which previously would only check Intel AVX instructions. The null checker writes to stderr (if accessible) and to a file sde-null-check.out.txt. The pointer checker writes to stderr (if accessible) and to a file sde-ptr-check.out.txt. The new knobs are: -null-check and -ptr-check
enforcing VL=128 on any Intel AVX scalar instructions.
fixed for the -no-avx and -no-aes knobs in the sde driver
xed: many corner case bugs fixed after yet another validation review

2010-02-08 version 3.00

Changed output files to have .txt suffix.
debugtrace prints x87 and mmx registers
thread-safety fix for vmaskmov* instructions
reduced amount of memory referenced by vpmov[sz]* instructions.
New memory pointer checker (See -ptr-check and -null-check knobs)
Added VL=128 requirement for Intel AVX scalar instructions.
Fixed knobs -no-avx and -no-aes in the sde front end driver

2009-12-31 version 2.94
Major update.

Better support for recent Linux* distributions, like Ubuntu* 9.10.
Better support for debugging with GDB on Linux*.
Using GDB 7.0.50, and "sde -debug -avx-gdb -- yourapp", gdb can directly obtain Intel AVX register values without requiring "monitor yreg N" or "monitor yregs" commands.
Windows version supports latest dbghelp.dll 6.11.1.404
Fixes for paths with spaces
Using Pin's "safecopy" mechanism to access user memory
Spelling fixes
Tool arguments grouped more sensibly; See the output of "sde -thelp"
Support for Intel AVX unmasked zero divide exceptions on Windows
Intel AVX/Intel SSE transition tracing feature with -ast-trace knob
Intel AVX/Intel SSE transition checker emits previous block information
CPUID leaf-zero emulation support
Alignment checker upgrades
XED disassembler supports windows debugging symbols (via dbghelp.dll)
Fix for Nan case in Intel®SSE4.1 roundss on Linux* only
Fix for Intel® SSE4 PEXTRW gpr,xmm
More CPUID feature knobs for Intel® SSE technologies
Fix for case emulation of FMA single precision that affected accuracy
Support for FZ and DAZ in FMA routines
Data watch point support
Fix for MXCSR.OE and IE for vcomiss/vucomiss an Nan inputs
New chip-check feature to restrict instructions to specific chips. See "sde -thelp"
Fast icounting feature (faster than using mix)
Fixes for Nan issues on windows with sqrt, mul, div, sub and cmp - it was quieting SNANs.
Upgraded pin can execute instructions with illegal instructions and an application-installed handler will be invoked.
New -itrace* knobs
Circular buffer support in debugtrace

2009-01-30 version 1.70

Added VPCLMULQDQ

2009-01-09 version 1.61

Synchronizing with Intel AVX architecture update.

New 3-operand FMA instructions, removed VPERMIL2{PS,PD}, miscellaneous bug fixes.

New footprint feature.

Rearranged mix output, added function summaries.

New version of dbghelp.dll required for windows (See the FAQ).

2008-08-10 version 1.13

Initial Release

Intel Transactional Synchronization Extensions (Intel TSX)

↧

Exploring Intel® Transactional Synchronization Extensions with Intel® Software Development Emulator

November 6, 2012, 6:12 am

Latest and popular articles on Intel Technologies

≫ Next: Building and Simulating an App using the HTML5 Development Environment Beta

≪ Previous: Intel® Software Development Emulator Release Notes

Intel® Transactional Synchronization Extensions (Intel® TSX) is perhaps one of the most non-trivial extensions of instruction set architecture introduced in the 4^th generation Intel® Core™ microarchitecture code name Haswell. Intel® TSX implements hardware support for a best-effort “transactional memory”, which is a simpler mechanism for scalable thread synchronization as opposed to inherently complex fine-grained locking or lock-free algorithms. The extensions have two interfaces: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM).

In this blog I will show how you can write your first RTM code and execute it in an emulated environment now, without waiting until the 4^th generation Intel® Core™ processors become available for purchase.

Before diving in, please make sure you have a basic understanding of the new RTM instructions. I refer you to this blog as an introduction. Check out also the Intel Developer Forum’12 presentation by Ravi Rajwar&Martin Dixon discussing the details of Intel TSX implementation in Haswell hardware and a presentation by Andi Kleen on adding lock elision (also using RTM) to Linux.

My plan was to write a toy bank account processing application using popular C++ thread-unaware data structures from STL with concurrent access to bank records managed by Intel TSX. This way the implementation should be very simple, thread-safe and scalable.

Development Environment

For this experiment one needs the recent version (5.31) of Intel® Software Development Emulator (Intel® SDE) and a compiler that can generate RTM instructions (via intrinsics or direct machine code). Please note that performance measurements with Intel SDE running RTM are of limited value because the overhead of emulating TM in software instead of using real hardware is huge, but as you will see later Intel SDE can already demonstrate important points for RTM usage for concurrency library developers and application programmers.

Since my laptop runs Windows I decided to try Intel SDE/RTM on Windows. I have chosen the C++ compiler from “Microsoft Visual Studio 2012 for Windows Desktop” (there is a free “Express” version that works for my purpose too). With a few clicks I quickly setup a console application project and included immintrin.h header the main .cpp file to use RTM intrinsics.

The Test

As a bank account structure the simple std::vector<int> from C++ standard template library has been chosen. “Accounts[i]” stores current account balance for account number i. This is very simple and popular but thread-unsafe data structure which must be protected by concurrency control mechanisms for parallel access. Usually locks/mutexes are used to limit the number of threads accessing the structure simultaneously. However, for parallel write accesses the whole data structure usually is locked exclusively even if distinct parts of it have to be updated. Intel TSX should help here since it can optimistically execute writes, and if there is no real data conflict happening, the writes are committed without serializing.

To simplify the operations on the accounts I wanted to implement an easy-to-use C++ wrapper for protecting the current C++ scope from unsafe concurrent access to the data:

{
        std::cout << "open new account"<< std::endl;
        TransactionScope guard; // protect everything in this scope
        Accounts.push_back(0);
}
{
        std::cout << "open new account"<< std::endl;
        TransactionScope guard; // protect everything in this scope
        Accounts.push_back(0);
}
{
        std::cout << "put 100 units into account 0"<<std::endl;
        TransactionScope guard; // protect everything in this scope
        Accounts[0] += 100; // atomic update due to RTM
}
{
        std::cout << "transfer 10 units from account 0 to account 1 atomically!"<< std::endl;
        TransactionScope guard; // protect everything in this scope
        Accounts[0] -= 10;
        Accounts[1] += 10;
}
{
        std::cout << "atomically draw 10 units from account 0 if there is enough money"<< std::endl;
        TransactionScope guard; // protect everything in this scope
        if(Accounts[0] >= 10) Accounts[0] -= 10;
}
{
        std::cout << "add 1000 empty accounts atomically"<< std::endl;
        TransactionScope guard; // protect everything in this scope
        Accounts.resize(Accounts.size() + 1000, 0);
}

Legacy applications implement such guards using a lock that allows only a single writer to execute the critical section (read-write locks are more complicated to handle and also do not make much sense here in our case because all accesses are writes/updates):

class TransactionScope
{
        SimpleSpinLock & lock;
        TransactionScope(); // forbidden
public:
        TransactionScope(SimpleSpinLock & lock_): lock(lock_) { lock.lock(); }
        ~TransactionScope() { lock.unlock(); }
};

Implementing and Testing with RTM

A naive RTM implementation for TransactionScope (handling both read/lookup and write/update accesses transparently) would be (changed lines are marked with █):

class TransactionScope
{
public:
        TransactionScope()
{
█               int nretries = 0;
█               while(1)
█               {
█                       ++nretries;
█                       unsigned status = _xbegin();
█                       if(status == _XBEGIN_STARTED) return; // successful start
█                       // abort handler
█                       std::cout << "DEBUG: Transaction aborted "<< nretries <<
█                          " time(s) with the status "<< status << std::endl;
█               }
        }
█       ~TransactionScope() { _xend(); }
};

I have successfully compiled this code and tried to run it through Intel SDE:

./sde-bdw-external-5.31.0-2012-11-01-win/sde.exe -hsw -rtm-mode full -- ./ConsoleApplication1.exe
open new account
DEBUG: Transaction aborted 1 time(s) with the status 0
DEBUG: Transaction aborted 2 time(s) with the status 0
DEBUG: Transaction aborted 3 time(s) with the status 0
DEBUG: Transaction aborted 4 time(s) with the status 0
DEBUG: Transaction aborted 5 time(s) with the status 0
DEBUG: Transaction aborted 6 time(s) with the status 0
DEBUG: Transaction aborted 7 time(s) with the status 0
DEBUG: Transaction aborted 8 time(s) with the status 0
DEBUG: Transaction aborted 9 time(s) with the status 0
DEBUG: Transaction aborted 10 time(s) with the status 0
DEBUG: Transaction aborted 11 time(s) with the status 0
DEBUG: Transaction aborted 12 time(s) with the status 0
DEBUG: Transaction aborted 13 time(s) with the status 0
DEBUG: Transaction aborted 14 time(s) with the status 0
DEBUG: Transaction aborted 15 time(s) with the status 0
DEBUG: Transaction aborted 16 time(s) with the status 0

and so on…

The program went into infinite loop always aborting on the first transaction. The RTM debug log from Intel SDE (emx-rtm.txt) also confirmed that (used option “-rtm_debug_log 2”). Well, a general rule is that failure is more or less expected for any implementation that ignores specification… Intel® Architecture Instruction Set Extensions Programming Reference explicitly mentions that “the hardware provides no guarantees as to whether an RTM region will ever successfully commit transactionally”. Because of that the software using RTM must provide (non-transactional) fall-back path that is executed if (many) aborts are happening (By the way: HLE provides the fall-back automatically, since on the first abort, the same critical section is executed non-transactionally).

Implementing Fall-Back

Here is our second attempt that acquires a fall-back spin lock non-transactionally after specified number of retries.

LONGLONG naborted = 0; // global abort statistics, alternatively use “–rtm_debug_log 2” Intel SDE option
 
class TransactionScope
{
█       SimpleSpinLock & fallBackLock;
        TransactionScope(); // forbidden
public:
█       TransactionScope(SimpleSpinLock & fallBackLock_, int max_retries = 3) :
█               fallBackLock(fallBackLock_)
        {
                int nretries = 0;
                while(1)
                {
                        ++nretries;
                        unsigned status = _xbegin();
                        if(status == _XBEGIN_STARTED)
                        {
█                               if(!fallBackLock.isLocked())
█                                         return; // successfully started transaction
█                               /* started transaction but someone is executing 
█                                  the transaction section non-speculatively (acquired
█                                  the fall-back lock) -> aborting */
█                               _xabort(0xff); // abort with code 0xff
                        }
                        // abort handler
                        InterlockedIncrement64(&naborted); // do abort statistics
                        std::cout << "DEBUG: Transaction aborted "<< nretries <<
                              " time(s) with the status "<< status << std::endl;
█                       // handle _xabort(0xff) from above
█                       if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff
█                            && !(status & _XABORT_NESTED))
█                       {       // wait until the lock is free
█                               while(fallBackLock.isLocked()) _mm_pause();
█                       }
█                       // too many retries, take the fall-back lock
█                       if(nretries >= max_retries) break;
                }
█               fallBackLock.lock();
        }
        ~TransactionScope()
        {
█               if(fallBackLock.isLocked())
█                       fallBackLock.unlock();
█               else
                        _xend();
        }
};

The output looks much better now:

open new account
DEBUG: Transaction aborted 1 time(s) with the status 0
DEBUG: Transaction aborted 2 time(s) with the status 0
DEBUG: Transaction aborted 3 time(s) with the status 0
open new account
put 100 units into account 0
transfer 10 units from account 0 to account 1 atomically!
atomically draw 10 units from account 0 if there is enough money
add 1000 empty accounts atomically

One can see that all transaction except the first one succeeded on the very first attempt. The first one took the fall-back lock after three attempts. It was special since it had to reserve and touch new memory for the vector from the operating system. This is a very complex process involving system calls, privilege ring transitions (ring 3 [application]->ring 0 [OS]), page faults and initialization/zeroing of very big chunks of memory which may not fit into the transactional buffer. All this may cause aborts according to the Intel® Architecture Instruction Set Extensions Programming Reference.

Leveraging RTM Abort Status Bits

A further optimization that I came up with is leveraging the abort status information: in case of such “hard” aborts the “retry” bit (position 1) in the abort status is not set. The bit is set if hardware thinks the transaction may succeed on retry. I added the line marked below in the abort handler to implement it:

 // handle _xabort(0xff) from above
 if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff
      && !(status & _XABORT_NESTED))
 {
        while(fallBackLock.isLocked()) _mm_pause(); // wait until lock is free

█} else if(!(status & _XABORT_RETRY)) break; /* take the fall-back lock
    if the retry abort flag is not set */

The output:

open new account
DEBUG: Transaction aborted 1 time(s) with the status 0
open new account
put 100 units into account 0
transfer 10 units from account 0 to account 1 atomically!
atomically draw 10 units from account 0 if there is enough money
add 1000 empty accounts atomically

Now we see that the program makes faster progress by taking the fall-back lock sooner in the case of a “hard” abort.

As you may notice, the changes so far were isolated within some synchronization interface, TransactionScope. The application code was not changed. As generally available TSX software infrastructure evolves in future you should look for a proven existing library that has (scope) locks with RTM support to avoid pitfalls in your synchronization primitives (we will talk about pitfalls in applicationcode in future blogs). For example a TSX-enabled pthread library for Linux is already available. On the other hand, it is not uncommon for existing applications to use an extended or custom synchronization interfaces, converting them to take advantage of TSX is not a complicated task either if done with care.

Concurrent Accesses from Several Threads Managed by Intel TSX

After basic debugging the time has come to see the real power of Intel TSX: run two worker threads doing random concurrent updates to the central account data structure:

unsigned __stdcall thread_worker(void * arg)
{
        int thread_nr = (int) arg;
        std::cout << "Thread "<< thread_nr<< " started."<< std::endl;
        // create thread-local TR1 C++ random generator from <random>
        std::tr1::minstd_rand myRand(thread_nr); 
        long int loops = 10000;
 
        while(--loops)
        {
                {
                        TransactionScope guard(globalFallBackLock);
                        // put 100 units into a random account atomically
                        Accounts[myRand() % Accounts.size()] += 100;
                }
 
                {
                        TransactionScope guard(globalFallBackLock);
                        /* transfer 100 units between random accounts 
                           (if there is enough money) atomically */
                        int a = myRand() % Accounts.size()
                        int b = myRand() % Accounts.size();
                        if(Accounts[a] >= 100)
                        {
                                Accounts[a] -= 100;
                                Accounts[b] += 100;
                        }
                }
        }
        std::cout << "Thread "<< thread_nr<< " finished."<< std::endl;
        return 0;
}

I built Release build without DEBUG output and see that there are only about 100-300 aborts for the total of 20000 transactions. Debug output says that the abort flag status is 6: retry and “memory access conflict” bits are set. This is exactly what I expected from Intel TSX: almost all updates are done in parallel and only a few have been rolled back due to a conflict.

To double check if my conclusions are right and emulator works as I expected I added an increment/update of a global counter in the transactions to introduce a huge number of conflicting accesses. And yes, it worked: with that change I have seen about 5-15K aborts. Although the absolute numbers obtained from the RTM emulator are not able to exactly predict the execution metrics on future hardware, the orders of magnitude should still indicate possible issues with RTM usage.

Last Words

These were my experiences with RTM and the new Intel® Software Development Emulator. Get prepared for Haswell and check out how your software can use Restricted Transactional Memory with Intel SDE now!

Roman

(the complete source code is attached to the article)

Restricted Transactional Memory (RTM)

Haswell

Intel Software Development Emulator

sde

Icon Image:

Attachments:

http://software.intel.com/sites/default/files/blog/335035/exploringinteltsx.cpp

Intel® Core™ Processors

Microsoft Windows* 8 Desktop

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

↧

Building and Simulating an App using the HTML5 Development Environment Beta

November 26, 2012, 2:10 pm

Latest and popular articles on Intel Technologies

≫ Next: Покупка или продление лицензий на продукты Intel® для разработки программного обеспечения

≪ Previous: Exploring Intel® Transactional Synchronization Extensions with Intel® Software Development Emulator

The HTML5 section within the Intel Developer Zone was updated just before the US Thanksgiving Holiday to release the new Intel® HTML5 Development Environment Beta and I tried out a few of the sample apps. It took me about fifteen minutes to get one of the samples packaged into an .apk file and running on my Android tablet. This is the first cloud based tool set that Intel has provided and it's a basic solution to learn HTML5 and develop real cross-platform apps, that can be submitted to the iOS, Google* Play, or even a Blackberry* app store.

Before you get started.

There are a few things you need to do before you can get to this 15 minute example. Some necessary, some just recommended, some you may already have done, but all free..

#1 You need to login to the the Intel Developer Zone, or register if you are new: Click at the top right of the page you are reading now.

#2 Request an account for the HTML5 Development Environment Beta: Click here to goto the HTML5 page, or goto http://software.intel.com/en-us/form/html5-beta-request

#3 Download Google Chrome: The mobile system emulator in the IDE only runs in Chrome, so you will need this once you get your account.

#4 Get a Github Account: https://github.com/signup/free. Recommended, as this can be used by the HTML5 Development Environment Beta, but also as a login ID for Adobe Phonegap

#5 Get an account with http://build.phonegap.com (or use the github account you setup in step #4 )

If you have any questions or comments about this procedure, or the tools, please post them in the HTML5 Forum section. http://software.intel.com/forums/html5-forum

The Intel® HTML5 Development Environment is enabled with IDZ single-sign-on, so once you get your approval email, crank up Chrome, login to IDZ and click to "Launch the Tool" on the HTML5 page.

Be sure to check out the Mobile Device Emulator included with this tool. You can select from multiple screen sizes and orientations and see how your app will look and run on everything from a small phone to a large tablet. The Emulator also includes support for tablet and phone sensors, so you can determine if you app responds correctly to GPS timeouts, for example. See the screenshot below for more details.

And finally, if you can't wait to get your IDE account, you can play some of the HTML5 games in your Chrome Browser while you are waiting, or browse the accompanying articles for the samples that are included.

Stewart Christie is the HTML5 and Tizen App Community Manager. Follow him on twitter @intel_stewart

IDE Emulator

Icon Image:

News

Registration and Licensing

Geolocation

Sensors

Touch Interfaces

User Experience and Design

Intel® XDK

Microsoft Windows* (XP, Vista, 7)

Intel AppUp® Developers

Android*

Linux*

Microsoft Windows* 8

Tizen*

↧

Покупка или продление лицензий на продукты Intel® для разработки программного обеспечения

January 29, 2013, 4:28 pm

Latest and popular articles on Intel Technologies

≫ Next: Fun with Intel® Transactional Synchronization Extensions

≪ Previous: Building and Simulating an App using the HTML5 Development Environment Beta

Корпорация Intel предлагает различные варианты лицензирования продуктов, предназначенных для разработки программного обеспечения. Различные варианты приобретаемых или продлеваемых лицензий на программы Intel® приведены ниже.

Также можно бесплатно загрузить 30-дневные ознакомительные версии продуктов Intel® для разработки программного обеспечения. Загрузить бесплатные ознакомительные версии наших продуктов можно на сайте центра ознакомительных версий программ.

Все приведенные ниже цены являются ценами на коммерческую лицензию для одного разработчика. Все указанные цены являются розничными, рекомендуемыми производителем, и могут быть изменены без предварительного уведомления. Цены указаны БЕЗ учета НДС и прочих применимых налогов и сборов.

Для получения информации о передаваемых лицензиях, лицензиях с возможностью использования только на определенных узлах и о прочих вариантах лицензирования обратитесь к торговому посредникуили к представителю корпорации Intel по адресу intel.software.sales@intel.com.
Для приобретения лицензии для научных, исследовательских и учебных заведений, выберите нужный продукт, и при оформлении покупки будет показана цена со скидкой. Для получения дополнительных сведений обо всех наших продуктах для учебных заведений посетите сайт центра предложений для учебных заведенийили обратитесь к представителю корпорации Intel по адресу academicdevelopersinfo@intel.com.
Продление поддержки — поддержка в течение одного года с даты истечения срока действия текущего соглашения о поддержке.
Существующим клиентам предоставляются специальные цены на обновление для продуктов Intel® Parallel Studio XE 2013, Intel® C++ Studio XE 2013 и Intel® Fortran Studio XE 2013.Подробнее о возможностях обновления.

Категория	Рекомендуемая розничная цена продукта (один пользователь)	Рекомендуемая розничная цена продления поддержки (один пользователь)	Варианты
Пакеты продуктов
Intel® Parallel Studio XE 2013 для Windows* Включает Intel® Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013	2299 долл. США	799 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Parallel Studio XE 2013 для Linux* Включает Intel® Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013	2299 долл. США	799 долл. США**	Найти торгового посредника › Все варианты ›
Intel® C++ Studio XE 2013 для Windows или для Linux Включает Intel® C++ Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013	1599 долл. США	599 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Visual Fortran Studio XE 2013 для Windows Включает Intel® Visual Fortran Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013	1899 долл. США	699 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Fortran Studio XE 2013 для Linux Включает Intel® Fortran Composer XE 2013, Intel® VTune™ Amplifier XE 2013, Intel® Inspector XE 2013 и Intel® Advisor XE 2013	1899 долл. США	699 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Parallel Studio Включает Intel® Parallel Advisor, Intel® Parallel Amplifier, Intel® Parallel Composer и Intel® Parallel Inspector		320 долл. США	Найти торгового посредника › Все варианты ›
Intel® Cluster Studio XE 2013 для Windows Включает Intel® C++ Composer XE 2013, Intel® Visual Fortran Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1, Intel® MPI Benchmarks, Intel® Inspector XE 2013 иIntel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013	2949 долл. США	1049 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Cluster Studio XE 2013 для Linux Включает Intel® C++ Composer XE 2013, Intel® Fortran Composer XE 2013, Intel® Trace Analyzer and Collector, Intel® MPI Library 4.1, Intel® MPI Benchmarks, Intel® Inspector XE 2013 и Intel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013	2949 долл. США	1049 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Cluster Studio 2013 для Windows Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks	2049 долл. США	749 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Cluster Studio 2013 для Linux Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks	2049 долл. США	749 долл. США**	Найти торгового посредника › Все варианты ›
Компиляторы и библиотеки
Intel® Composer XE 2013 для Windows Включает Intel® C++ Composer XE 2013 и Intel® Visual Fortran Composer XE 2013	1199 долл. США	449 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Composer XE 2013 для Linux Включает Intel® C++ Composer XE 2013 и Intel® Fortran Composer XE 2013	1449 долл. США	499 долл. США**	Найти торгового посредника › Все варианты ›
Intel® C++ Composer XE 2013 для Windows, для Linux или для OS X* Включает Intel® C++ Compiler, Intel® Integrated Performance Primitives 7.1, Intel® Math Kernel Library 11.0 и Intel® Parallel Building Blocks	699 долл. США	249 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Visual Fortran Composer XE 2013 для Windows Включает Intel® Visual Fortran Compiler и Intel® Math Kernel Library 11.0	849 долл. США	299 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Fortran Composer XE 2013 для Linux Включает Intel® Fortran Compiler и Intel® Math Kernel Library 11.0	999 долл. США	349 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Fortran Composer XE 2013 для OS X Включает Intel® Fortran Compiler и Intel® Math Kernel Library 11.0	849 долл. США	299 долл. США**	Найти торгового посредника › Все варианты ›
Intel® C++ Compiler Professional Edition с поддержкой ОС реального времени QNX Neutrino* Включает Intel® C++ Compiler и Intel® Integrated Performance Primitives 7.1	599 долл. США	240 долл. США	Все варианты ›
Компилятор Intel® C для байтового кода EFI	995 долл. США	398 долл. США	Найти торгового посредника › Все варианты ›
Intel® Visual Fortran Composer XE 2013 для Windows с IMSL 6.0* Включает Intel® Visual Fortran Compiler, IMSL* Fortran Numerical Library и Intel® Math Kernel Library 11.0. Включает лицензию на 1 разработчика и 1 лицензию на развертывание, предназначенную для разработчика. При предоставлении приложений, содержащих код IMSL, пользователям, отличным от разработчиков, требуется лицензия на развертывание.	2049 долл. США	749 долл. США**	Найти торгового посредника › Все варианты ›
Лицензии IMSL* на выполнение (также называемые лицензиями IMSL* на развертывание) Вопросы и ответы о лицензировании IMSL*
Коммерческая однопользовательская лицензия на выполнение приложений с кодом IMSL на системах, содержащих не более 16 процессорных ядер	2049 долл. США	685 долл. США	Найти торгового посредника › Все варианты ›
Пакет из 10 коммерческих однопользовательских лицензий на выполнение приложений с кодом IMSL на системах, содержащих не более 16 процессорных ядер	9709 долл. США	1826 долл. США	Найти торгового посредника › Все варианты ›
Коммерческая многопользовательская лицензия на выполнение приложений с кодом IMSL на системах, содержащих не более 64 процессорных ядер	13 592 долл. США	2557 долл. США	Найти торгового посредника › Все варианты ›
Инструменты для процессора Intel® Atom™
Набор инструментов Intel® для разработки ПО для встраиваемых систем на базе процессора Intel® Atom™ Включает Intel® C++ Compiler, Intel® Application Debugger, Intel® JTAG Debugger, Intel® Integrated Performance Primitives 7.1 и Intel® VTune™ Performance Analyzer	1999 долл. США	799 долл. США	Найти торгового посредника › Все варианты ›
Библиотеки для повышения производительности
Intel® Integrated Performance Primitives 7.1 для Windows или для Linux	199 долл. США	69 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Math Kernel Library 11.0 для Windows или для Linux	499 долл. США	179 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Threading Building Blocks 4.1 для Windows, для Linux или для OS X	499 долл. США	179 долл. США**	Найти торгового посредника › Все варианты ›
Анализаторы производительности приложений
Intel® VTune™ Amplifier XE 2013 для Windows или для Linux	899 долл. США	349 долл. США**	Найти торгового посредника › Все варианты ›
Средства проверки работы с памятью и потоками
Intel® Inspector XE 2013 для Windows или для Linux	899 долл. США	349 долл. США**	Найти торгового посредника › Все варианты ›
Средства для работы с кластерами
Intel® Cluster Studio XE 2013 для Windows Включает Intel® C++ Composer XE 2013, Intel® Fortran Composer XE 2013, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE 2013 и Intel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013	2949 долл. США	1049 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Cluster Studio XE 2013 для Linux Включает Intel® C++ Composer XE 2013, Intel® Fortran Composer XE 2013, Intel® Trace Analyzer and Collector, Intel® MPI Library 4.1, Intel® MPI Benchmarks, Intel® Inspector XE 2013 и Intel® VTune™ Amplifier XE 2013, Intel® Advisor XE 2013	2949 долл. США	1049 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Cluster Studio 2013 для Windows Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks	2049 долл. США	749 долл. США**	Найти торгового посредника › Все варианты ›
Intel® Cluster Studio 2013 для Linux Включает Intel® Composer XE 2013, Intel® Trace Analyzer and Collector 8.1, Intel® MPI Library 4.1 и Intel® MPI Benchmarks	2049 долл. США	749 долл. США**	Найти торгового посредника › Все варианты ›
Intel® MPI Library 4.1 для Windows или для Linux	499 долл. США	179 долл. США**	Найти торгового посредника › Все варианты ›
Средства моделирования и имитации работы систем
CoFluent Studio*	Н/д	Н/д	Все варианты ›
CoFluent Reader	Н/д	Н/д	Все варианты ›
Средства для профилирования и отладки графических приложений
Средства для профилирования и отладки графический приложений Intel®	Н/д	Н/д	Все варианты ›

**Вы можете обновить лицензии по минимальной цене только в том случае, если сделаете это до истечения срока предыдущей подписки. Для получения дополнительной информации о продлении лицензий нажмите здесь.

Корпорация Intel прилагает все необходимые усилия для соблюдения конфиденциальности. Ознакомиться с действующими правилами сбора и обработки личных сведений о заказчиках, данных о серийных номерах продуктов Intel и прочих данных можно в нашем уведомлении о конфиденциальностии в уведомлении о проверке серийных номеров.

↧

Fun with Intel® Transactional Synchronization Extensions

July 25, 2013, 1:32 pm

Latest and popular articles on Intel Technologies

≫ Next: Emulation of new instructions

≪ Previous: Покупка или продление лицензий на продукты Intel® для разработки программного обеспечения

By now, many of you have heard of Intel® Transactional Synchronization Extensions (Intel® TSX). If you have not, I encourage you to check out this page (http://www.intel.com/software/tsx) before you read further. In a nutshell, Intel TSX provides transactional memory support in hardware, making the lives of developers who need to write synchronization codes for concurrent and parallel applications easier. It comes in two flavors: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM). If you haven’t read the background, go and do so now, since from here on, I assume that you have that basic knowledge.

I had been developing a PIN-based emulator for Intel TSX for the past few years. The emulator is now integrated into Intel Software Development Emulator. During the development, I had a lot of grins and grimaces with respect to HLE/RTM. I would like to share three such particularly memorable incidents.

The Incidents

Example 1.

The following codelet is a part of a test program a colleague of mine wrote who wanted to learn how to use RTM. With the array ‘data’ containing integer values and the array ‘group’ mapping the data’s elements to the slots in the array ‘sums’, the test program tries to store the sum of the data belonging to a group in the corresponding slot in the array ‘sums’. Since multiple threads may access the same slot simultaneously, each addition is performed in an RTM transaction. When a transaction aborts, the thread re-executes the addition in the critical section along the fallback path (i.e. ‘else’). Do you think it is correct? If you don’t, can you spot what is wrong?

#pragma omp parallel for
    for(int i = 0; i < N; i++){
        int mygroup = group[i];
        if(_xbegin()==-1) {
              sums[mygroup] += data[i];
            _xend();
          } else {
              #pragma omp critical
              {
                  sums[mygroup] += data[i];
              }
          }
      }

Example 2.

I was taught code reuse is imporant when I was in school (sorry, not in the kindergarten ;^)). So, I decided to put to work what I learned when a need arose to write an RTM test. The test was similar to the one in Example 1, except that this test alternates RTM and HLE transactions. (Notice that the test does not have the non-speculative fallback path required for the RTM transaction. Having no fallback path makes the test UNSAFE because Intel TSX does not guarantee forward-progress; i.e., it can abort RTM transactions forever.) The test has two addition statements: one is protected with RTM and the other is protected with HLE. Quite a feat, eh? I felt proud of myself ;-) ... until I started to run the test. The test occasionally printed out incorrect sums. I panicked at first because the test was simple and looked almost identical with other tests, leading me to believe, however briefly, that the emulator had a nasty bug that had hidden unnoticed for a long time. But after a closer look, I realized the test had a flaw. Can you see what I did wrong?

    #define PREFIX_XACQUIRE ".byte 0xF2; "    #define PREFIX_XRELEASE ".byte 0xF3; " 
    class mutex_elided {
      uint8_t flag;
      inline bool try_lock_elided() {
        uint8_t value = 1;
        __asm__ volatile (PREFIX_XACQUIRE "lock; xchgl %0, %1"                : "=r"(value),"=m"(flag):"0"(value),"m"(flag):"memory" );
        return uint8_t(value^1);
      }
    public:
      inline void acquire() {
        for(;;) {
            exponential_backoff backoff;
            while((volatile unsigned char&)flag==1)
                backoff.pause();
            if(try_lock_elided())
                return;
            __asm__ volatile ("pause\n" : : : "memory" );
        }
      }
 
      inline void release() {
        __asm__ volatile (PREFIX_XRELEASE "movl $0, %0"               : "=m"(flag) : "m"(flag) : "memory" );
      }
    };
    ...
 
 
      mutex_elided m;
#pragma omp parallel for
    for(int i = 0; i < N; i++) {
        int mygroup = group[i];
        if( (i&1) ) {
            while(_xbegin()!=-1) ;
            // must have a fallback path
            sums[mygroup] += 1;
            _xend();
        } else {
            m.acquire();
            sums[mygroup] += 1;
            m.release();
        }
    }

Example 3.

A colleague of mine tried to use RTM to improve performance of a benchmark. (I changed function names for clarity.) The following fragment of the benchmark permutes an array of IDs by, for each ID, swapping its value with that of a randomly picked partner. In the fallback path, elements i and j are exclusively acquired in the increasing order of their indices, and then written back in the reverse order. He was running it on the emulator and came back to me with an occasional hang problem. Can you come up with a sequence of events that leads to an indefinite wait?

bool pause( volatile int64_t* l ) {
    __asm__ __volatile__( "pause\n" : : : "memory" );
    return true;
}
 
int64_t read_and_lock( volatile int64_t* loc ) {
    int64_t val;
    while(1) {
        while( pause( loc ) )
            if(  empty_val != (val = *loc) )
                    break;
        assert( val!=empty_val );
        if ( __sync_bool_compare_and_swap( loc, val, empty_val ) )
            break;
    }
    assert( val!=0 );
    return val;
}
 
void write_and_release( volatile int64_t* loc, int64_t val ) {
    while( pause( loc ) )
        if( __sync_bool_compare_and_swap( loc, empty_val, val ) )
            break;
    return;
}
 
...
#pragma omp parallel for num_threads(16)
    for (int i=0; i<n; i++) {
        int j = (int64_t) ( n * gen_rand() );
 
        if( _xbegin()==-1 ) {
            if(i != j) {
                const vid_t tmp_val = vid_values[i];
                vid_values[i] = vid_values[j];
                vid_values[j] = tmp_val;
            }
            _xend();
        } else {
            if (i < j) {
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                write_and_release( &vid_values[j], tmp_val_i );
                write_and_release( &vid_values[i], tmp_val_j );
            } else if (j < i) {
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                write_and_release( &vid_values[i], tmp_val_j );
                write_and_release( &vid_values[j], tmp_val_i );
            }
        }
    }

Analysis

Example 1.

The fallback path has a race with the code in the RTM path. For example, the following interleaving may happen. (Always keep in mind that one should not make any assumption on relative speeds of threads!)

Thread 1	Thread 2
start critical section
read sums[mygroup]
	do transaction that updates sums[mygroup]
write sums[mygroup]

As a result, the example occasionally loses the increment done in the RTM transaction.

Example 2.

Don’t let the HLE transaction fool you. When an HLE transaction gets aborted, it acquires the same mutex non-speculatively. When this happens, the case effectively becomes identical to Example 1.

Example 3.

Again, one should not make any assumption on relative speeds of concurrently executing threads. Even though the fallback path is race free on its own, it has a race with the code in the RTM path. For example, the following sequence of events may occur.

Thread 1	Thread 2
read_and_lock( vid_values[i] )
	do transaction that swaps vid_values[i] and vid_values[k] and makes vid_values[i] non-zero
read_and_lock( vid_values[j] )
write_and_release( vid_values[j] )
Wait for vid_values[i] to become 0

Possible Fixes

Now that we have concrete diagnosis for each of the examples, the fixes are straightforward.

Example 1.

Replacing ‘omp critical’ with an atomic increment such as __sync_add_and_fetch would fix the problem. I.e.,

    __sync_add_and_fetch( &sums[mygroup], data[i] );

A more general solution is to use a mutex in the fallback path and add it to the readset of the RTM transaction to force the transaction to abort if the mutex is acquired by another thread.

mutex fallback_mutex;
 
...
#pragma omp parallel for num_threads(8)
    for(int i = 0; i < N; i++){
        int mygroup = group[i];
        if(_xbegin()==-1) {
            if( !fallback_mutex.is_acquired() ) {
                sums[mygroup] += data[i];
            } else {
                _xabort(1);
            }
            _xend();
        } else {
            fallback_mutex.acquire();
            sums[mygroup] += data[i];
            fallback_mutex.release();
        }
    }

Example 2.

Similarly, we may extend mutex_elided to have the is_acquired() method. Since the lock variable is read inside the RTM transaction, any non-speculative execution of the HLE path which makes the change to the lock variable visible will abort the transaction.

    mutex_elided m;
#pragma omp parallel for num_threads(8)
    for(int i = 0; i < N; i++) {
        int mygroup = group[i];
        if( (i&1) ) {
            while(_xbegin()!=-1) // having no fallback path is
                ;                // UNSAFE
            if( !m.is_acquired() )
                sums[mygroup] += data[i];
            else
                _xabort(0);
            _xend();
        } else {
            m.acquire();
            sums[mygroup] += data[i];
            m.release();
        }
    }

Example 3.

We can also apply the mutex-based approach to this example. Another approach is to read the two ID values in the RTM transaction and check if either of them contains the ‘empty_value’. If so, we abort the transaction and force the thread to follow the fallback path.

#pragma omp parallel for num_threads(16)
    for (int i=0; i<n; i++) {
        int j = (int64_t) ( n * gen_rand() );
        if( _xbegin()==-1 ) {
            if(i != j) {
                const vid_t tmp_val_i = vid_values[i];
                const vid_t tmp_val_j = vid_values[j];
                if( tmp_val_i==0 || tmp_val_j==0 )
                    _xabort(0);
                vid_values[i] = tmp_val_j;
                vid_values[j] = tmp_val_i;
            }
            _xend();
        } else {
            if (i < j) {
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                write_and_release( &vid_values[j], tmp_val_i );
                write_and_release( &vid_values[i], tmp_val_j );
            } else if (j < i) {
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                write_and_release( &vid_values[i], tmp_val_j );
                write_and_release( &vid_values[j], tmp_val_i );
            }
        }
    }

Conclusions

So, what have I learned from these examples? As you may have already noticed, all of these are related to the ‘restricted’ part of RTM. Intel TSX has great potential for improving performance of concurrent/parallel applications. But, the synchronization between the speculative code inside the RTM transaction and the non-speculative fallback path needs to be carefully managed, since the interactions are subtle. I gather most programmers won’t need to worry too much about it because higher-level abstractions in supporting libraries should hide most of agonizing synchronization details. But for those who are willing to get their hands dirty to squeeze out the last drop of performance gain, it always pays to have a watchful eye on the interactions between an RTM code path and its non-speculative fallback. (And we have many tools such as Intel SDE to assist you.)

Disclaimer: The opinion expressed in the blog is the author's own and reflects none of his employer's or his colleagues'.

Intel Transactional Synchronization Extensions (Intel TSX)

Restricted Transactional Memory (RTM)

Transactional memory

hardware lock elision

Icon Image:

Development Tools

Intel® Core™ Processors

Intel® C++ Composer XE

Intel® Threading Building Blocks

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8

Unix*

↧

Emulation of new instructions

August 11, 2008, 10:45 am

Latest and popular articles on Intel Technologies

≫ Next: Licenses for runtime libraries for SDE on Linux

≪ Previous: Fun with Intel® Transactional Synchronization Extensions

How it works

Isolation issues

Debugging

Advanced use options

Program checkers

Also if you have software questions you can post them to the Intel® AVX and CPU forum at:
/en-us/forums/intel-avx-and-cpu-instructions/

Icon Image:

Theme Zone:

IDZone

↧

Licenses for runtime libraries for SDE on Linux

August 12, 2008, 2:31 pm

Latest and popular articles on Intel Technologies

≫ Next: Recent Intel® AVX Architectural Changes

≪ Previous: Emulation of new instructions

Read more about GNU General Public License, version 2

Back to the Intel® Software Development Emulator

For libstdc++:

// Copyright (C) 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006
// Free Software Foundation, Inc.
//
// This file is part of the GNU ISO C++ Library.  This library is free
// software; you can redistribute it and/or modify it under the
// terms of the GNU General Public License as published by the
// Free Software Foundation; either version 2, or (at your option)
// any later version.

// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.

// You should have received a copy of the GNU General Public License along
// with this library; see the file COPYING.  If not, write to the Free
// Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301,
// USA.

// As a special exception, you may use this file as part of a free software
// library without restriction.  Specifically, if other files instantiate
// templates or use macros or inline functions from this file, or you compile
// this file and link it with other files to produce an executable, this
// file does not by itself cause the resulting executable to be covered by
// the GNU General Public License.  This exception does not however
// invalidate any other reasons why the executable file might be covered by
// the GNU General Public License.

And for libgcc_s:

/* Copyright (C) 1989, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
   2000, 2001, 2002, 2003, 2004, 2005  Free Software Foundation, Inc.

This file is part of GCC.

GCC is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2, or (at your option) any later
version.

In addition to the permissions in the GNU General Public License, the
Free Software Foundation gives you unlimited permission to link the
compiled version of this file into combinations with other programs,
and to distribute those combinations without any restriction coming
from the use of this file.  (The General Public License restrictions
do apply in other respects; for example, they cover modification of
the file, and distribution when not linked into a combine
executable.)

GCC is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received a copy of the GNU General Public License
along with GCC; see the file COPYING.  If not, write to the Free
Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA.  */

Linux*

Theme Zone:

IDZone

↧

Recent Intel® AVX Architectural Changes

January 29, 2009, 8:33 am

Latest and popular articles on Intel Technologies

≫ Next: AVX debugging или все-таки как?

≪ Previous: Licenses for runtime libraries for SDE on Linux

VFMADDSS xmm1, xmm2, xmm3, m32, which was xmm1 = xmm2*xmm3 + m32

(and the upper bits of the XMM register - from bit 32 to 127 - were zeroed).

NOW we have three forms:

VFMADD132SS xmm1, xmm2, m32, which is xmm1 = xmm1*m32 + xmm2
VFMADD213SS xmm1, xmm2, m32, which is xmm1 = xmm2*xmm1 + m32
VFMADD231SS xmm1, xmm2, m32, which is xmm1 = xmm2*m32 + xmm1

y0 = x0*c0 + x1
y0 = x0*c0 – x1

Icon Image:

Theme Zone:

IDZone

↧

AVX debugging или все-таки как?

January 29, 2010, 10:25 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Software Development Emulator Download

≪ Previous: Recent Intel® AVX Architectural Changes

TID0: INS 0x00401f4d                     vrcpss xmm7, xmm5, xmm5
TID0:      XMM7 := 00000000_00000000_00000000_3ba57800
XMM7 (doubles) := 0 4.94411e-315
XMM7 (floats) := 0 0 0 0.00504971
TID0: INS 0x00401f51                     vsubss xmm5, xmm1, xmm0
TID0:      XMM5 := 00000000_00000000_00000000_43460000
XMM5 (doubles) := 0 5.57633e-315
XMM5 (floats) := 0 0 0 198
TID0: INS 0x00401f55                     vmulss xmm5, xmm5, xmm7
TID0:      XMM5 := 00000000_00000000_00000000_3f7ff5a0
XMM5 (doubles) := 0 5.26353e-315
XMM5 (floats) := 0 0 0 0.999842

Icon Image:

Open Source

Theme Zone:

IDZone

↧

Intel® Software Development Emulator Download

December 16, 2011, 6:56 am

Latest and popular articles on Intel Technologies

≫ Next: Buy or Renew Intel® Software Development Products

≪ Previous: AVX debugging или все-таки как?

Intel® Software Development Emulator (released July 29, 2014)

DOWNLOAD Intel® SDE for WINDOWS* (sde-external-7.2.0-2014-07-29-win.tar.bz2)
- Note: If you use Cygwin's tar command to unpack the Windows* kit, you must execute a "chmod -R +x" on the unpacked directory.
- DOWNLOAD Intel® SDE debugging integration for WINDOWS* (sde-msvs2013-2.0.1.msi)
DOWNLOAD Intel® SDE for LINUX* (sde-external-7.2.0-2014-07-29-lin.tar.bz2)
DOWNLOAD Intel® SDE for OS X* (sde-external-7.2.0-2014-07-29-mac.tar.bz2)
DOWNLOAD Intel® MPX runtime for LINUX* (2014-02-13-mpx-external-lin.tar.bz2)
DOWNLOAD Intel® MPX runtime for WINDOWS* (2014-02-13-mpx-runtime-external-win.zip)

Intel® Software Development Emulator (released July 20, 2014)

DOWNLOAD Intel® SDE for WINDOWS* (sde-external-7.1.0-2014-07-20-win.tar.bz2)
- Note: If you use Cygwin's tar command to unpack the Windows* kit, you must execute a "chmod -R +x" on the unpacked directory.
- DOWNLOAD Intel® SDE debugging integration for WINDOWS* (sde-msvs2013-2.0.1.msi)
DOWNLOAD Intel® SDE for LINUX* (sde-external-7.1.0-2014-07-20-lin.tar.bz2)
DOWNLOAD Intel® SDE for OS X* (sde-external-7.1.0-2014-07-20-mac.tar.bz2)

Intel® Software Development Emulator (released March 06, 2014)

DOWNLOAD Intel® SDE for WINDOWS* (sde-external-6.22.0-2014-03-06-win.tar.bz2)
- Note: If you use Cygwin's tar command to unpack the Windows* kit, you must execute a "chmod -R +x" on the unpacked directory.
- DOWNLOAD Intel® SDE debugging integration for WINDOWS* (sde-msvs2012-1.0.5.msi.zip)
DOWNLOAD Intel® SDE for LINUX* (sde-external-6.22.0-2014-03-06-lin.tar.bz2)
DOWNLOAD Intel® SDE for OS X* (sde-external-6.22.0-2014-03-06-mac.tar.bz2)

Previous versions of the Intel® Software Development Emulator

DOWNLOAD Intel® SDE for WINDOWS* (sde-external-6.20.0-2014-02-13-win.tar.bz2)
- Note: If you use Cygwin's tar command to unpack the Windows* kit, you must execute a "chmod -R +x" on the unpacked directory.
- DOWNLOAD Intel® SDE debugging integration for WINDOWS* (sde-msvs2012-1.0.5.msi.zip)
DOWNLOAD Intel® SDE for LINUX* (sde-external-6.20.0-2014-02-13-lin.tar.bz2)
DOWNLOAD Intel® SDE for OS X* (sde-external-6.20.0-2014-02-13-mac.tar.bz2)
DOWNLOAD Intel® SDE for WINDOWS* (sde-external-6.12.0-2013-11-16-win.tar.bz2)
- Note: If you use Cygwin's tar command to unpack the Windows* kit, you must execute a "chmod -R +x" on the unpacked directory.
- DOWNLOAD Intel® SDE debugging integration for WINDOWS* (sde-msvs2012-1.0.5.msi.zip)
DOWNLOAD Intel® SDE for LINUX* (sde-external-6.12.0-2013-11-16-lin.tar.bz2)
DOWNLOAD Intel® SDE for OS X* (sde-external-6.12.0-2013-11-16-mac.tar.bz2)
DOWNLOAD Intel® MPX runtime for LINUX* (2013-10-29-mpx-external-lin.tar.bz2)
DOWNLOAD Intel® MPX runtime for WINDOWS* (2013-10-29-mpx-runtime-external-win.zip)

Please take a moment to register with Intel® DZ to participate in forum discussions.

Back to the Intel® Software Development Emulator page.

What If Pre-Release License Agreement

License Agreement:

Protected Attachments:

Attachment	Size
Download sde-external-7.2.0-2014-07-29-lin.tar.bz2	15.27 MB
Download sde-external-7.2.0-2014-07-29-mac.tar.bz2	8.79 MB
Download sde-external-7.2.0-2014-07-29-win.tar.bz2	8.75 MB
Download sde-msvs2013-2.0.1.msi	1.27 MB
Download sde-external-7.1.0-2014-07-20-lin.tar.bz2	15.18 MB
Download sde-external-7.1.0-2014-07-20-mac.tar.bz2	8.79 MB
Download sde-external-7.1.0-2014-07-20-win.tar.bz2	8.77 MB
Download sde-msvs2012-1.0.5.msi.zip	405.59 KB
Download sde-external-6.22.0-2014-03-06-lin.tar.bz2	13.5 MB
Download sde-external-6.22.0-2014-03-06-win.tar.bz2	7.93 MB
Download sde-external-6.22.0-2014-03-06-mac.tar.bz2	7.78 MB
Download 2014-02-13-mpx-runtime-external-win.zip	23.92 KB
Download 2014-02-13-mpx-runtime-external-lin.tar.bz2	17.87 KB
Download sde-external-6.20.0-2014-02-13-win.tar.bz2	8.1 MB
Download sde-external-6.20.0-2014-02-13-mac.tar.bz2	8.03 MB
Download sde-external-6.20.0-2014-02-13-lin.tar.bz2	13.73 MB
Download 2013-10-29-mpx-runtime-external-win-0.zip	58.07 KB
Download 2013-10-29-mpx-runtime-external-lin.tar.bz2	17.86 KB
Download sde-external-6.12.0-2013-11-16-lin.tar.bz2	13.65 MB
Download sde-external-6.12.0-2013-11-16-mac.tar.bz2	8.02 MB
Download sde-external-6.12.0-2013-11-16-win.tar.bz2	7.88 MB

↧

Buy or Renew Intel® Software Development Products

March 27, 2012, 8:04 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Software Development Emulator Release Notes

≪ Previous: Intel® Software Development Emulator Download

Intel offers several licensing options for our software development products. Review the choices below to buy or renew Intel® software. You may use the Product Support Renewal/Upgrade Options page to determine renewal and upgrade options for your product(s).

All prices listed below are for named-user licenses. All prices are Manufacturer Suggested List Prices (MSRP) and subject to change without notice. Prices do NOT include Value Added Taxes (VAT) or any other state or local taxes or charges.

For floating licenses, node-locked licenses, or other licensing options, contact a reseller, or contact an Intel representative at intel.software.sales@intel.com.
To purchase an academic research license, please select your desired product and the discounted price will be displayed during check out. For additional information on all of our education offerings, visit our Education Offerings Center, or contact an Intel representative at academicdevelopersinfo@intel.com.
Support Renewal extends your support for one year from the expiration date of your current support agreement.
Existing customers can take advantage of special upgrade prices for Intel® Parallel Studio XE, Intel® C++ Studio XE or Intel® Fortran Studio XE .See details of upgrade offer.

Category Name	Product MSRP (Named-User)	Support Renewal MSRP (Named-User)	Options
Product Suites
Intel® Parallel Studio XE for Windows* Includes Intel® Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE	$2,299	$799**	Find a reseller › See all options ›
Intel® Parallel Studio XE for Linux* Includes Intel® Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE	$2,299	$799**	Find a reseller › See all options ›
Intel® C++ Studio XE for Windows or Linux Includes Intel® C++ Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE	$1,599	$599**	Find a reseller › See all options ›
Intel® Visual Fortran Studio XE for Windows Includes Intel® Visual Fortran Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE	$1,899	$699**	Find a reseller › See all options ›
Intel® Fortran Studio XE for Linux Includes Intel® Fortran Composer XE, Intel® VTune™ Amplifier XE, Intel® Inspector XE, Intel® Advisor XE	$1,899	$699**	Find a reseller › See all options ›
Intel® Cluster Studio XE for Windows Includes Intel® C++ Composer XE, Intel® Visual Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE	$2,949	$1,049**	Find a reseller › See all options ›
Intel® Cluster Studio XE for Linux Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE	$2,949	$1,049**	Find a reseller › See all options ›
Intel® Cluster Studio for Windows Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks	$2,049	$749**	Find a reseller › See all options ›
Intel® Cluster Studio for Linux Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks	$2,049	$749**	Find a reseller › See all options ›
Intel® System Studio including JTAG Debugger Includes: Intel® C++ Compiler for Embedded OS Linux, Intel® C++ Compiler for Android, Intel® Math Kernel Library, Intel® Integrated Performance Primitives, Intel® VTune™ Amplifier for Systems, Intel® Energy Profiler, Intel® Inspector for Systems, Intel® GPA System Analyzers, Intel® JTAG Debugger, SVEN SDK, GDB - The GNU* Project Debugger	$2,399	$849	Find a reseller › See all options ›
Intel® System Studio Includes: Intel® C++ Compiler for Embedded OS Linux, Intel® C++ Compiler for Android, Intel® Math Kernel Library, Intel® Integrated Performance Primitives, Intel® VTune™ Amplifier for Systems, Intel® Energy Profiler, Intel® Inspector for Systems, Intel® GPA System Analyzer, SVEN SDK, GDB - The GNU* Project Debugger	$1,649	$599	Find a reseller › See all options ›
Compilers and Libraries
Intel® Composer XE for Windows Includes Intel® C++ Composer XE, Intel® Visual Fortran Composer XE	$1,199	$449**	Find a reseller › See all options ›
Intel® Composer XE for Linux Includes Intel® C++ Composer XE, Intel® Fortran Composer XE	$1,449	$499**	Find a reseller › See all options ›
Intel® C++ Composer XE for Windows, Linux, or OS X* Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel® Parallel Building Blocks	$699	$249**	Find a reseller › See all options ›
Intel® Visual Fortran Composer XE for Windows Includes Intel® Visual Fortran Compiler, Intel® Math Kernel Library	$849	$299**	Find a reseller › See all options ›
Intel® Fortran Composer XE for Linux Includes Intel® Fortran Compiler, Intel® Math Kernel Library	$999	$349**	Find a reseller › See all options ›
Intel® Fortran Composer XE for OS X Includes Intel® Fortran Compiler, Intel® Math Kernel Library	$849	$299**	Find a reseller › See all options ›
Intel® C++ Compiler for Android*	$79.95	N/A	Find a reseller › See all options ›
Intel® C++ Compiler Professional Edition for QNX Neutrino* RTOS Support Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives	$599	$240	See all options ›
Intel® C Compiler for EFI Byte Code	$995	$398	Find a reseller › See all options ›
Intel® Visual Fortran Composer XE with Rogue Wave* IMSL* Fortran Numerical Library 7.0 for Windows Includes Intel® Visual Fortran Compiler, IMSL* Fortran Numerical Library, Intel® Math Kernel Library Includes 1 developer and 1 deployment license for the developer.	$1,749	$649**	Find a reseller › See all options ›
Embedded and Mobile System Development
Intel® System Studio including JTAG Debugger Includes: Intel® C++ Compiler for Embedded OS Linux, Intel® C++ Compiler for Android, Intel® Math Kernel Library, Intel® Integrated Performance Primitives, Intel® VTune™ Amplifier for Systems, Intel® Energy Profiler, Intel® Inspector for Systems, Intel® GPA System Analyzer, Intel® JTAG Debugger, SVEN SDK, GDB - The GNU* Project Debugger	$2,399	$849	Find a reseller › See all options ›
Intel® System Studio Includes: Intel® C++ Compiler for Embedded OS Linux, Intel® C++ Compiler for Android, Intel® Math Kernel Library, Intel® Integrated Performance Primitives, Intel® VTune™ Amplifier for Systems, Intel® Energy Profiler, Intel® Inspector for Systems, Intel® GPA System Analyzer, SVEN SDK, GDB - The GNU* Project Debugger	$1,649	$599	Find a reseller › See all options ›
Performance Libraries
Intel® Integrated Performance Primitives for Windows, Linux, or OS X	$199	$69**	Find a reseller › See all options ›
Intel® Math Kernel Library for Windows or Linux	$499	$179**	Find a reseller › See all options ›
Intel® Threading Building Blocks for Windows, Linux, or OS X	$499	$179**	Find a reseller › See all options ›
Rogue Wave* IMSL* Fortran Numerical Library 7.0 for Windows	$999	$499	Find a reseller › See all options ›
Performance Profilers
Intel® VTune™ Amplifier XE for Windows or Linux	$899	$349**	Find a reseller › See all options ›
Thread and Memory Checkers
Intel® Inspector XE for Windows or Linux	$899	$349**	Find a reseller › See all options ›
Cluster Tools
Intel® Cluster Studio XE for Windows Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE	$2,949	$1,049**	Find a reseller › See all options ›
Intel® Cluster Studio XE for Linux Includes Intel® C++ Composer XE, Intel® Fortran Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks, Intel® Inspector XE and Intel® VTune™ Amplifier XE, Intel® Advisor XE	$2,949	$1,049**	Find a reseller › See all options ›
Intel® Cluster Studio for Windows Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks	$2,049	$749**	Find a reseller › See all options ›
Intel® Cluster Studio for Linux Includes Intel® Composer XE, Intel® Trace Analyzer and Collector, Intel® MPI Library, Intel® MPI Benchmarks	$2,049	$749**	Find a reseller › See all options ›
Intel® MPI Library for Windows or Linux	$499	$179**	Find a reseller › See all options ›
System Modeling and Simulation Tools
CoFluent Studio*	N/A	N/A	See all options ›
CoFluent Reader	N/A	N/A	See all options ›
Visual Computing Tools
Intel® Graphics Performance Analyzers	N/A	N/A	See all options ›
Intel® Media SDK for Servers	$499	N/A	See all options ›

**Lowest Price available if you renew prior to current subscription expiration. For more information on renewals click here.

↧

Intel® Software Development Emulator Release Notes

June 15, 2012, 8:44 am

Latest and popular articles on Intel Technologies

≫ Next: Exploring Intel® Transactional Synchronization Extensions with Intel® Software Development Emulator

≪ Previous: Buy or Renew Intel® Software Development Products

2014-07-29 version 7.2.0

Added -p4 (Pentium4) and -p4p (Pentium4-Prescott) knobs to SDE.
Updated CPUID definition files.

2014-07-20 version 7.1.0

Added support for additional Intel® AVX-512 instructions.
Support for debugging integration with Microsoft Visual Studio 2012 and Visual Studio 2013.
New controller implementation.
Improved TSX statistics.

2014-03-06 version 6.22.0

Exclude PAUSE from chip-check because it is a NOP on older CPUs and on quark.

2014-02-13 version 6.20.0

Added support for XSAVEC and CLFLUSHOPT.
Disabled TSX CPUID bits when TSX emulation is not requested.
Improved disassembly for MPX instructions.
Added an option for running chip-check only on the main executable.
Added support for -quark (Pentium ISA).
Added application debugging for Mac OSX with the lldb debugger.

2013-11-16 version 6.12.0

Added support to Mac OSX version 10.9.
Improved the TSX statistics information.
Various fixes with the emulation of floating-point instructions of Intel AVX-512.
Enabled the alignment checker tool by default for instructions that require alignment.
Fixed mismatch between mix and dynamic mask profiler.
Updated the Intel MPX runtime libraries for Windows.
Performance improvements when modeling a CPU prior to AVX-512.

2013-09-21 version 6.7.0

Debugging with GDB is now supported with Intel® AVX-512. Download the new GDB from here.
Emulation of Intel® AVX2 FMA and Intel AVX-512 FMA uses native FMA instructions when running on Haswell host.
Various fixes with the emulation of floating-point and conversion instructions of Intel AVX-512.
Disassembly of control transfer instructions displays the 'bnd' prefix when used with Intel® MPX.
Updated the XED ISA set names for Intel AVX-512. This is visible in 'mix' statistics output.
This release goes with 2013-08-29 version of the Intel MPX runtime.

2013-07-22 version 6.1.0

Emulation support for the Intel®Advanced Vector Extensions 512 (Intel® AVX-512) instructions present on the Intel Knights Landing microarchitecture.
Emulation support for the Intel® Secure Hash Algorithm (Intel® SHA) extensions present on the Intel Goldmont microarchtiecture.
Emulation support for the Intel® Memory Protection Extensions (Intel® MPX) present on the Intel Skylake and Goldmont microarchitectures.
Support for Hardware Lock Elision introduced on the Intel Haswell microarchitecture
Improved support for Restricted Transactional Memory introduced on the Intel Haswell microarchitecture.
Improved support for the OS X* operating system (Mountain Lion)
The footprint tool now has the ability to compute footprint over time for working-set estimation.
A new tool called the dynamic mask profiler is provided using -dyn_mask_profile knob. The output is in a simple XML format.

The Intel SDE development team has grown to include Michael Berezalsky, Mark Charney, Michael Gorin, Omer Mor, Ariel Slonim and Ady Tal.

2013-01-03 version 5.38

Improvements in RTM emulation stability. Added statistics knobs. Updated knobs.
Support for debugging integration with Microsoft Visual Studio 2012. See main page for information.
Improved multithreaded stability when using the AVX/SSE transition checker
Mac OS X: support for code-signed binaries, simplifying execution. See main page for information about the "taskport".
XED: added elf/dwarf support back to the command line tool
TZCNT ZF flags fix

2012-11-01 version 5.31 - major update

Major update including fixes for the processor codenamed Haswell and introduction of instructions in the processor codenamed Broadwell
First public SDE release for OS X, 10.6 and 10.7. See additional information on the main Intel SDE web page for required permissions.
HSW's RTM mode is supported with the "-rtm-mode full" option. This feature is very new and the Intel SDE implementation might be a little unstable.
Completely new mechanism for handling of CPUID. CPUID values now come from an input file.
SDE's -chip-check feature checks to make sure instructions are valid for the specified chip. See "sde -help" for the various chip options.
Exception handling fixes
Haswell BMI emulation fixes, including flags output.
Debugtrace multithreading safety improvements
Mix top-blocks sorting issues. Mix also has better support for allocating stats to overlapping blocks.
Mix default blocks size is now 1500 instructions to avoid fragmenting large hot blocks.
XED now can emit "dot" graphs for specified regions: path-to-sde-kit/xed -i SOMEEXE -as 0x40316b -ae 0x4031b3 -dot foo.dot; dot -O -Tpdf foo.dot
Mix has prefix a legacy-prefix histogram
Footprint tool can now collect stats about unique memory pages as well as unique cache lines. The footprint tool is now faster as well.
Improved speed of AVX/SSE transition checker by roughly 12%. See the -ast knob in "sde -thelp".
Fixed some numerical errors in our software emulation of the FMA instruction for denormal numbers.
Various stability improvements from using a newer version of Pin.
Better handling of MXCSR exception status bigs for AVX1/2 instructions. We still do not support raising unmasked floating point errors from emulated instructions.
Can now set environment variables from the command line with the -env VAR VALUE option.
The commands for the GDB interface have been updated. See "monitor help sde" when attached as described on the main page. Please use GDB 7.4 or later.
The chip check error message includes the instruction bytes of the offending instruction.
Multiprocess output file handling. You used to have to supply "-i" to get the process id inserted in to the file name to avoid multiprocess applications from overwriting the common output files. Now we attempt to detect the creating of other processes and add the PID to the file names automatically. The parent / child relationship is recorded in the file name.
Better support for unused bits in the VEX encodings in 32b mode.

The Intel SDE development team has grown to include Michael Berezalsky, Mark Charney, Michael Gorin, Omer Mor, Ariel Slonim and Ady Tal.

2011-12-15 version 4.46

Linux* 3.x is supported
Better support for running on Intel® AVX-enabled hosts
All output files now begin "sde-" and end with ".txt" by default
Mix is faster and does more analysis of SIMD operations
Mix has line number support for the top blocks when the information is available in the application
The -ptr-chk option now checks the memory refernces of gather operations
Fixed support for file descriptor leak when exec'ing thousands of threads on Linux*.
Misc other stability improvements.

2011-07-01 version 4.29

Support for the Haswell new instructions in the Intel AVX programmers reference version 11.
Mix now includes category and instruction length histograms automatically so the corresponding knobs were removed.
Many other changes

2010-12-23 verison 3.89 (Linux* only)

Fixed runtime libraries. Version 3.88 accidentally included runtime libraries that require a newer version of glibc than is present on older systems (like RHEL4).

2010-12-21 version 3.88

Support for the post-32nm processor instructions for the processor codenamed Ivy Bridge in the 008 revision of the Intel AVX programmers reference document
Many stability improvements
"sde -thelp" goes to stdout, not stderr
mix has a "-demangle 0" option to turn off demangling
xed disassembler handles uninitialized code sections in windows binaries
xed supports dwarf line number information with the -line knob on Linux*.
mix has improved memory efficiency
To debug on Linux*, you no longer need the -avx-gdb knob but you must use gdb 7.2 or later which supports a new XML remote-debug protocol.

2010-03-11 version 3.09

When pin or sde crashes due to bugs in user applications, the output of the circular buffer use for -itrace-execute (etc.) was not being dumped to disk. It is now.
Fixed circular buffer used for -itrace-execute and -itrace-execute-emulate. It was not initializing the circular buffer when -itrace-lines was used and would just crash immediately. In addition to *actually* making the feature work, I sped it up immensely by reusing allocated string buffers.
Fixed 14 scalar Intel AVX instructions that were referencing too much memory (128b instead of 32b or 64b).
Made the xsave emulator be enabled all the time even when xsave is present on the hardware. One can disable it with '-xsave 0'.
All output log / stats file names now end in .txt by default.
Added a descriptive header to the top of the Intel AVX/Intel SSE transition output file.
debugtrace now print mmx (and x87) register values
vmaskmov* instructions are now implemented in a thread-safe way.
vpmov[sz]x instructions now correctly reference less memory to avoid extra page accesses.
New memory pointer checker. This option check all memory references for accessibility before the user application program is allowed to access memory. There is also a null pointer checker which previously would only check Intel AVX instructions. The null checker writes to stderr (if accessible) and to a file sde-null-check.out.txt. The pointer checker writes to stderr (if accessible) and to a file sde-ptr-check.out.txt. The new knobs are: -null-check and -ptr-check
enforcing VL=128 on any Intel AVX scalar instructions.
fixed for the -no-avx and -no-aes knobs in the sde driver
xed: many corner case bugs fixed after yet another validation review

2010-02-08 version 3.00

Changed output files to have .txt suffix.
debugtrace prints x87 and mmx registers
thread-safety fix for vmaskmov* instructions
reduced amount of memory referenced by vpmov[sz]* instructions.
New memory pointer checker (See -ptr-check and -null-check knobs)
Added VL=128 requirement for Intel AVX scalar instructions.
Fixed knobs -no-avx and -no-aes in the sde front end driver

2009-12-31 version 2.94
Major update.

Better support for recent Linux* distributions, like Ubuntu* 9.10.
Better support for debugging with GDB on Linux*.
Using GDB 7.0.50, and "sde -debug -avx-gdb -- yourapp", gdb can directly obtain Intel AVX register values without requiring "monitor yreg N" or "monitor yregs" commands.
Windows version supports latest dbghelp.dll 6.11.1.404
Fixes for paths with spaces
Using Pin's "safecopy" mechanism to access user memory
Spelling fixes
Tool arguments grouped more sensibly; See the output of "sde -thelp"
Support for Intel AVX unmasked zero divide exceptions on Windows
Intel AVX/Intel SSE transition tracing feature with -ast-trace knob
Intel AVX/Intel SSE transition checker emits previous block information
CPUID leaf-zero emulation support
Alignment checker upgrades
XED disassembler supports windows debugging symbols (via dbghelp.dll)
Fix for Nan case in Intel®SSE4.1 roundss on Linux* only
Fix for Intel® SSE4 PEXTRW gpr,xmm
More CPUID feature knobs for Intel® SSE technologies
Fix for case emulation of FMA single precision that affected accuracy
Support for FZ and DAZ in FMA routines
Data watch point support
Fix for MXCSR.OE and IE for vcomiss/vucomiss an Nan inputs
New chip-check feature to restrict instructions to specific chips. See "sde -thelp"
Fast icounting feature (faster than using mix)
Fixes for Nan issues on windows with sqrt, mul, div, sub and cmp - it was quieting SNANs.
Upgraded pin can execute instructions with illegal instructions and an application-installed handler will be invoked.
New -itrace* knobs
Circular buffer support in debugtrace

2009-01-30 version 1.70

Added VPCLMULQDQ

2009-01-09 version 1.61

Synchronizing with Intel AVX architecture update.

New 3-operand FMA instructions, removed VPERMIL2{PS,PD}, miscellaneous bug fixes.

New footprint feature.

Rearranged mix output, added function summaries.

New version of dbghelp.dll required for windows (See the FAQ).

2008-08-10 version 1.13

Initial Release

Intel Transactional Synchronization Extensions (Intel TSX)

Theme Zone:

IDZone

↧

Exploring Intel® Transactional Synchronization Extensions with Intel® Software Development Emulator

November 6, 2012, 6:12 am

Latest and popular articles on Intel Technologies

≫ Next: Building and Simulating an App using the HTML5 Development Environment Beta

≪ Previous: Intel® Software Development Emulator Release Notes

Development Environment

For this experiment one needs the newest version (later than 5.3.1) of Intel® Software Development Emulator (Intel® SDE) and a compiler that can generate RTM instructions (via intrinsics or direct machine code). Please note that performance measurements with Intel SDE running RTM are of limited value because the overhead of emulating TM in software instead of using real hardware is huge, but as you will see later Intel SDE can already demonstrate important points for RTM usage for concurrency library developers and application programmers.

The Test

To simplify the operations on the accounts I wanted to implement an easy-to-use C++ wrapper for protecting the current C++ scope from unsafe concurrent access to the data:


{

        std::cout << "open new account"<< std::endl;

        TransactionScope guard; // protect everything in this scope

        Accounts.push_back(0);

}

{

        std::cout << "open new account"<< std::endl;

        TransactionScope guard; // protect everything in this scope

        Accounts.push_back(0);

}

{

        std::cout << "put 100 units into account 0"<<std::endl;

        TransactionScope guard; // protect everything in this scope

        Accounts[0] += 100; // atomic update due to RTM

}

{

        std::cout << "transfer 10 units from account 0 to account 1 atomically!"<< std::endl;

        TransactionScope guard; // protect everything in this scope

        Accounts[0] -= 10;

        Accounts[1] += 10;

}

{

        std::cout << "atomically draw 10 units from account 0 if there is enough money"<< std::endl;

        TransactionScope guard; // protect everything in this scope

        if(Accounts[0] >= 10) Accounts[0] -= 10;

}

{

        std::cout << "add 1000 empty accounts atomically"<< std::endl;

        TransactionScope guard; // protect everything in this scope

        Accounts.resize(Accounts.size() + 1000, 0);

}


class TransactionScope

{

        SimpleSpinLock & lock;

        TransactionScope(); // forbidden

public:

        TransactionScope(SimpleSpinLock & lock_): lock(lock_) { lock.lock(); }

        ~TransactionScope() { lock.unlock(); }

};

Implementing and Testing with RTM

A naive RTM implementation for TransactionScope (handling both read/lookup and write/update accesses transparently) would be (changed lines are marked with █):


class TransactionScope

{

public:

        TransactionScope()

{

█               int nretries = 0;

█               while(1)

█               {

█                       ++nretries;

█                       unsigned status = _xbegin();

█                       if(status == _XBEGIN_STARTED) return; // successful start

█                       // abort handler

█                       std::cout << "DEBUG: Transaction aborted "<< nretries <<

█                          " time(s) with the status "<< status << std::endl;

█               }

        }

█       ~TransactionScope() { _xend(); }

};

I have successfully compiled this code and tried to run it through Intel SDE:


./sde-bdw-external-5.31.0-2012-11-01-win/sde.exe -hsw -rtm-mode full -- ./ConsoleApplication1.exe

open new account

DEBUG: Transaction aborted 1 time(s) with the status 0

DEBUG: Transaction aborted 2 time(s) with the status 0

DEBUG: Transaction aborted 3 time(s) with the status 0

DEBUG: Transaction aborted 4 time(s) with the status 0

DEBUG: Transaction aborted 5 time(s) with the status 0

DEBUG: Transaction aborted 6 time(s) with the status 0

DEBUG: Transaction aborted 7 time(s) with the status 0

DEBUG: Transaction aborted 8 time(s) with the status 0

DEBUG: Transaction aborted 9 time(s) with the status 0

DEBUG: Transaction aborted 10 time(s) with the status 0

DEBUG: Transaction aborted 11 time(s) with the status 0

DEBUG: Transaction aborted 12 time(s) with the status 0

DEBUG: Transaction aborted 13 time(s) with the status 0

DEBUG: Transaction aborted 14 time(s) with the status 0

DEBUG: Transaction aborted 15 time(s) with the status 0

DEBUG: Transaction aborted 16 time(s) with the status 0

and so on…

Implementing Fall-Back

Here is our second attempt that acquires a fall-back spin lock non-transactionally after specified number of retries.


LONGLONG naborted = 0; // global abort statistics, alternatively use “–rtm_debug_log 2” Intel SDE option

 

class TransactionScope

{

█       SimpleSpinLock & fallBackLock;

        TransactionScope(); // forbidden

public:

█       TransactionScope(SimpleSpinLock & fallBackLock_, int max_retries = 3) :

█               fallBackLock(fallBackLock_)

        {

                int nretries = 0;

                while(1)

                {

                        ++nretries;

                        unsigned status = _xbegin();

                        if(status == _XBEGIN_STARTED)

                        {

█                               if(!fallBackLock.isLocked())

█                                         return; // successfully started transaction

█                               /* started transaction but someone is executing 

█                                  the transaction section non-speculatively (acquired

█                                  the fall-back lock) -> aborting */

█                               _xabort(0xff); // abort with code 0xff

                        }

                        // abort handler

                        InterlockedIncrement64(&naborted); // do abort statistics

                        std::cout << "DEBUG: Transaction aborted "<< nretries <<

                              " time(s) with the status "<< status << std::endl;

█                       // handle _xabort(0xff) from above

█                       if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff

█                            && !(status & _XABORT_NESTED))

█                       {       // wait until the lock is free

█                               while(fallBackLock.isLocked()) _mm_pause();

█                       }

█                       // too many retries, take the fall-back lock

█                       if(nretries >= max_retries) break;

                }

█               fallBackLock.lock();

        }

        ~TransactionScope()

        {

█               if(fallBackLock.isLocked())

█                       fallBackLock.unlock();

█               else

                        _xend();

        }

};

The output looks much better now:


open new account

DEBUG: Transaction aborted 1 time(s) with the status 0

DEBUG: Transaction aborted 2 time(s) with the status 0

DEBUG: Transaction aborted 3 time(s) with the status 0

open new account

put 100 units into account 0

transfer 10 units from account 0 to account 1 atomically!

atomically draw 10 units from account 0 if there is enough money

add 1000 empty accounts atomically

Leveraging RTM Abort Status Bits


 // handle _xabort(0xff) from above

 if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff

      && !(status & _XABORT_NESTED))

 {

        while(fallBackLock.isLocked()) _mm_pause(); // wait until lock is free

 

█} else if(!(status & _XABORT_RETRY)) break; /* take the fall-back lock

    if the retry abort flag is not set */

The output:


open new account

DEBUG: Transaction aborted 1 time(s) with the status 0

open new account

put 100 units into account 0

transfer 10 units from account 0 to account 1 atomically!

atomically draw 10 units from account 0 if there is enough money

add 1000 empty accounts atomically

Now we see that the program makes faster progress by taking the fall-back lock sooner in the case of a “hard” abort.

Concurrent Accesses from Several Threads Managed by Intel TSX

After basic debugging the time has come to see the real power of Intel TSX: run two worker threads doing random concurrent updates to the central account data structure:


unsigned __stdcall thread_worker(void * arg)

{

        int thread_nr = (int) arg;

        std::cout << "Thread "<< thread_nr<< " started."<< std::endl;

        // create thread-local TR1 C++ random generator from <random>

        std::tr1::minstd_rand myRand(thread_nr); 

        long int loops = 10000;

 

        while(--loops)

        {

                {

                        TransactionScope guard(globalFallBackLock);

                        // put 100 units into a random account atomically

                        Accounts[myRand() % Accounts.size()] += 100;

                }

 

                {

                        TransactionScope guard(globalFallBackLock);

                        /* transfer 100 units between random accounts 

                           (if there is enough money) atomically */

                        int a = myRand() % Accounts.size()

                        int b = myRand() % Accounts.size();

                        if(Accounts[a] >= 100)

                        {

                                Accounts[a] -= 100;

                                Accounts[b] += 100;

                        }

                }

        }

        std::cout << "Thread "<< thread_nr<< " finished."<< std::endl;

        return 0;

}

Last Words

Roman

(the complete source code is attached to the article)

Restricted Transactional Memory (RTM)

Haswell

Intel Software Development Emulator

sde

Icon Image:

Attachments:

https://software.intel.com/sites/default/files/blog/335035/exploringinteltsx.cpp

Intel® Core™ Processors

Microsoft Windows* 8 Desktop