Lost Interrupt: 2010

Thursday, December 30, 2010

x86 encoding

Dammit, x86. There are FPU escape codes, 1-byte, 2-bytes and 3-bytes opcodes, prefixes to specify which command does specific opcode refer to, sometimes this define a group of commands and specific command is determined using reg/opc field in ModR/M byte (so these are only instructions with one argument or those which have an immediate as a second argument), but this isn't an end: since first argument can refer either to memory or to register some commands sharing the same opcode and reg/opc are distinguished using this information (mod field of ModR/M). And the hell is there are two pairs of operations in a regular opcodes tables which one can separate only by the type (memory or register) of the second argument:

0x0f 0x12 movlps VqMq or movhlps VqUq
0x0f 0x16 movhps VqMq or movlhps VqUq

arm carry flag vs x86 carry flag

For add operation arm_cf and x86_cf are the same. For sub they differ. This is because in arm sub rd, rn, rm is equivalent to add rd, rn, -rm. Carry flag is set to 1 after add operation if the sum of operands is greater than 2^32. So on arm carry flag after sub is set if rn + (2^32 - rm) > 2^32 equivalently rn > rm. But on x86 carry would be set for sub if rn < rm which is exactly the opposite.

intel x86 far ret

if after return stack size is 16 then esp[31:16] (higher half of esp) is not touched
if after return stack size is 32 and before return operand size is 16 then read 16-bit sp is zero extended to 32-bit esp

Saturday, December 25, 2010

Understanding CRC

There are some documents on the internet describing CRC32 and related algorithms. A Painless Guide to CRC Error Detection Algorithms by Ross Williams explains CRC as an engineer would do explicitly describing what is happening with bits. It also covers reasons of great variety of CRC types. A mathematician would probably be more happy reading an article The iSCSI CRC32C Digest and the Simultaneous Multiply and Divide Algorithm by Luben Tuikov and Vicente Cavanna. CRC is presented rigorously as a division over F_2 field. The second article can lead to an idea how to split CRC calculation over different threads. Same thoughts appear in article Fast Parallel CRC Algorithm and Implementation on a Configurable Processor by H. Michael Ji and Earl Killian.

Thursday, November 25, 2010

on even path problem

There is a natural problem: find a simple even s-t path in an undirected graph.

There is an algorithm in [1]. The following was proposed by M. Babenko as an attempt to get a linear time solution: find 2-connected components (can be done by finding articulation points, see [2] or Chapter 2, exercise 16 in [3]), then each block is either biparate or contains an odd cycle which can be used to produce either odd or even path connecting articulation points.

For the shortest even s-t path problem there is a polynomial algorithm described in [4], page 274: first an auxilary graph is constructed (take two copies of G: G_s, G_t, remove t from G_s, remove s from G_t, connect all other vertices in G_s with their copies in G_t using edges of weight zero, all edges corresponding to edges in G have weight one) then a minimum weighted perfect matching is to be found on auxilary graph. Edges of weight one correspond to a shortest simple even path in G. In [4] the origin of this algorithm is referensed to GLS88 (probably this is [5] but I can't check this since google books hides reference page). Similar algorithm for a shortest odd path is presented in [6] (pp 247-248) and is said to be due to Edmonds (unpublished result), the reader is then asked to reduce an even path problem to matching as an exercise. An O(n^3) algorithm for minimum weight perfect matching problem is described in [3], chapter 11. In abstract to [7] it is pointed out that one can suffice by finding only by finding a single shortest augmenting path (it seems with respect to unweighted matching consisting only of edges connecting different copies of vertexes).

Just as a note a shortest even s-t path (not necessarily simple) can easily be found in linear time: take two copies of V: V',V''. For each edge (u, v) in new graph there are cross edges: (u',v'') and (u'',v'). Then find a shortest s'-t' path.

[1] Andrea S. Lapaugh, Christos H. Papadimitriou. The even-path problem for graphs and digraphs. Networks Volume 14, Issue 4, pages 507–513, Winter 1984. http://onlinelibrary.wiley.com/doi/10.1002/net.3230140403/abstract

[2] http://www.boost.org/doc/libs/1_41_0/libs/graph/doc/biconnected_components.html

[3] Bernhard Korte, Jens Vygen. Combinatorial Optimization Theory and Algorithms 4th ed., 2008

[4] Naum Zuselevich Shor, Nondifferentiable Optimization and Polynomial Problems (Nonconvex Optimization and Its Applications), 1998

[5] Grotschel M., Lovasz L., Schrijver A. Geometric Algorithms and Combinatorial Optimization. 1988
[6] L. Lovasz, M. D. Plummer. Matching Theory, 1986
[7] Ulrich Derigs. An efficient Dijkstra-like labeling method for computing shortest odd/even paths // Information Processing Letters Volume 21, Issue 5, 18 November 1985, Pages 253-258

Wednesday, September 29, 2010

why division is not a shift

For signed values division by a power of two is not equal to shift. This can be easily showed mathematically: suppose x is negative then x>>1 is ceil((x+1)/2)) - 1 = ceil(x/2 + 1/2) - 1 = floor(x/2) which is not the x/2 = ceil(x/2).

Friday, September 10, 2010

llvm + qemu

Seems like everybody who tried to use llvm in qemu came to conlcusion that llvm should be invoked only for hot regions but noone actually tried this.

here are some llvm+qemu attempts:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2008-April/013689.html a story about Google Summer of Code 2007.

http://code.google.com/p/llvm-qemu

http://infoscience.epfl.ch/record/149975 technical report (huge performance degradation)

http://gitorious.org/qemu-async-llvm probably is a bachelor work: Using the LLVM Compiler Infrastructure for Optimised, Asynchronous Dynamic Translation in Qemu. Andrew Jeffery. 2009

Some articles on binary translation

On Static Binary Translation and Optimization for ARM based Applications. Jiunn-Yeu Chen, Wuu Yang, Tzu-Han Hung, Charlie Su, Wei Chung Hsu. 2008

ARM instructions are translated into MIPS. Some statistics of arm instructions in EEMBC benchmark is given. Nearly all optimizations described in the article are for flags and conditional operations. Experemental results are given. Most performance growth (35%) is caused by cross-block selective flag updates (each flag is updated only if it will be used). Common optimizations (dead code elimination, common sub-expression elimination, constant folding, constant propagation, copy propagation) give 5%. Since it is static translator the translation time itself is not benchmarked.

Binary Translation Using Peephole Superoptimizers. Alex Aiken, Sorav Bansal. 2008

Another static binary translator. They are using some terrific peephole rules generator (probably described in other articles http://www.cse.iitd.ernet.in/~sbansal/publications.html) which allows them to perform binary translation based only on peephole rules.

Walkabout - A Retargetable Dynamic Binary Translation Framework. Cristina Cifuentes, Brian Lewis, David Ung. 2002

Interesting part is they try to use some languages and tools to generate front-ends and back-ends. Some general optimizations (mostly at basic blocks level: code layout, chaining) are discussed.

Specification-Driven Dynamic Binary Translation. Jens Troger. 2004

A dissertation about Yirr-Ma binary translator (walkaboot extension)

Retargetable and Reconfigurable Software Dynamic Translation. K. Scott, N. Kumar, S. Velusamy, B. Childers, J. W. Davidson, and M. L. Soffa. 2003

An article about Strata binary translator. They try to evaluate performance overhead of dynamic binary translation (which includes context save/restore) when translating to the same instruction set (sparc, mips, x86).

Jello: a retargetable Just-In-Time compiler for LLVM bytecode. Chris Lattner, Misha Brukman, Brian Gaeke. 2002

An article about early days of lli utility from LLVM. Some general target-independent passes are described (register allocaton, live variable analysis) as well as some x86-related optimizations (fpu, stack).