Sunday, May 26, 2013

Quote



( The INEXACT flag mandated by IEEE Standard 754 for Binary Floating–Point Arithmetic would, if MATLAB granted us access to that flag, help us discriminate between solutions x that are exactly right and those, perhaps arbitrarily wrong, whose residuals vanish by accident.) - http://www.eecs.berkeley.edu/~wkahan/ieee754status/baleful.pdf

Useful commands



leaks
heap
fs_usage
lsof
vm_stat
netstat
tcpdump
sc_usage
otool
xxd
pbcopy
pdfgrep
du -sh /private/var/vm/sleepimage

Friday, May 3, 2013

Not having to specify DYLD_LIBRARY_PATH for your app when using Intel TBB (on Darwin/Mac OS X)

When using Intel Threading Building Blocks,  you have to link against their dynamic library. They do not provide a way for static linking, because when dynamic linking is used, their library has a shared state and they use that to prevent contention from over-threading.

However, the only way they provide to load their library at run-time is by having to provide a path to their library (libtbb.dylib) through the
DYLD_LIBRARY_PATH environment variable. Before your application launches you are going to have to set it in the shell. They provide a handy script to do it for you too. It may be inconvenient but it is doable (especially in non-production environments like scientific experiments).

However, although Valgrind is able to load their library correctly (and therefore run your program), GDB (7.5) can not do it. I am not able to debug my program because GDB says it can not load libtbb and dies.

So I used my Google-fu and found a solution. Just run the following command in the tbb/lib directory:

$ install_name_tool -id /absolute/path/to/libtbb.dylib libtbb.dylib


There you go. The difference is that without doing it, LD linker only puts the library name in the binary. I.e., otool -L shows only libtbb.dylib, no path, no nothing. After the command, LD puts the absolute path, and therefore it gets encoded in the binary; no need for DYLD_LIBRARY_PATH again.

I found the useful information at https://dev.lsstcorp.org/trac/wiki/LinkingDarwin.

Tuesday, April 30, 2013

Segmentation fault

Getting a segmentation fault out of no where after a long execution can be frustrating. One reason is that you usually do not get any information on the source of error. Another reason is that using valgrind or gdb to debug will make the program run even slower until you hit the error, if you hit it again. Frustrating indeed. But not all hope lost. You can modify your program to catch the segmentation fault and generate the needed debugging information for you (only for POSIX-compliant operating systems – that is like every operating system in the world, except Windows).

Now when my program generates a segmentation fault that's what I get as output:

Segmentation fault: Address not mapped to object 
Invalid read of the address 0x0

Stack trace: 
 1: 0x103432c37 <attack_dataset int="" x2b7=""> (/Users/malaggan/Desktop/mac-only-ws/SVDAttack/jni/./svdattack)
 2: 0x103498f8b <main x9b=""> (/Users/malaggan/Desktop/mac-only-ws/SVDAttack/jni/./svdattack)

Line number: 
got symbolicator for svdattack, base address 100000000
attack_dataset(int) (in svdattack) (attack_dataset.cpp:104)

How it is done is fairly simple. When a segmentation fault happens, the operating system sends a signal to the program. The program, in turn, can handle this signal and act based on it, via a signal handler. The signal handler is a function that is called by the operating system, to which it also provide useful information for debugging (and a lot more, like user-space threading, and copy-on-write, and other things). The entire state of the machine, including its general registers, and floating point registers, are recorded and sent to the signal handler. The one register we are interested about is the instruction pointer register, which contains the address of the instruction which caused the segmentation fault. Along with that also goes the address that the program was trying to access, and whether the program was trying to read it or write it. The operating system also gives information about that address; whether it is a totally wrong address (not mapped in the program's address space), or a valid address but the program has no permission to read (or write) it.

The code I wrote to generate the message above goes next (the portion about the stack trace was taken from stackoverflow). It is written in C++11 and the machine-specific part is for Mac.


#include <sys/ucontext.h>
#include <dlfcn.h>
#include <cxxabi.h>
#include <unistd.h>


void print(int sig, siginfo_t *info, void *c)
{
    ucontext_t *context = reinterpret_cast<ucontext_t*>(c);

    // defined in mach/i386/_structs.h
    __darwin_mcontext64* mc = context->uc_mcontext;
    __darwin_x86_thread_state64* ss = &mc->__ss;
    __darwin_x86_exception_state64* es = &mc->__es;
    __darwin_x86_float_state64* fs = &mc->__fs;

    bool write_fault = (es->__err & 2) ? true : false;

    fprintf(stderr,"\n%s: %s \n"
        "Invalid %s of the address %p\n",
        sys_siglist[sig],
        (info->si_code == SEGV_MAPERR) ? "Address not mapped to object" : "Invalid permissions for mapped object",
        (write_fault)?"write":"read",
        es->__faultvaddr
    );

    // http://stackoverflow.com/questions/5397041/getting-the-saved-instruction-pointer-address-from-a-signal-handler
    void **ip = reinterpret_cast<void**>(ss->__rip);
    void **bp = reinterpret_cast<void**>(ss->__rbp);

    fprintf(stderr,"\nStack trace: \n");

    Dl_info dlinfo;
    int f = 0;
    while(bp && ip) 
    {
        if(!dladdr(ip, &dlinfo))
            break; 

        const char *symname = dlinfo.dli_sname;
 
        int status;
        char * tmp = abi::__cxa_demangle(symname, nullptr, 0, &status);

        if (status == 0 && tmp)
            symname = tmp;

        fprintf(stderr,"% 2d: %p <%s+0x%lx> (%s)\n",
                 ++f,
                 ip,
                 symname,
                 reinterpret_cast<unsigned long>>ip) - reinterpret_cast<unsigned long>(dlinfo.dli_saddr),
                 dlinfo.dli_fname);

        if (tmp)
            free(tmp);

        if(dlinfo.dli_sname && std::string(dlinfo.dli_sname) == "main")
            break;

        ip = reinterpret_cast<void**>(bp[1]);
        bp = reinterpret_cast<void**>(bp[0]);
    }

    fprintf(stderr,"\nLine number: \n"); 
    
    auto load_addr = dlinfo.dli_fbase;

    char str_pid[100]={0},
         str_rip[100]={0}, 
         str_load_addr[100]={0};

    sprintf(str_pid, "%d", getpid());
    sprintf(str_rip, "%p", ss->__rip);
    sprintf(str_load_addr, "%p", load_addr);

    char *argv[]={"atos","-o", "./svdattack", "-l", str_load_addr, str_rip, nullptr};
    char *envp[] = { nullptr };
    execve("/usr/bin/atos", argv, envp);

    fprintf(stderr,"\n\n");

    std::exit(1);
}



/*this is how the signal was registered*/
struct sigaction sa;
sigemptyset (&sa.sa_mask);
sa.sa_flags = SA_SIGINFO;
sa.sa_sigaction = print;

sigaction(SIGSEGV, &sa, nullptr);
sigaction(SIGBUS, &sa, nullptr);


Monday, April 29, 2013

OpenMP provides a keyword called THREADPRIVATE that allow thread-local storage for certain variables. That means that each thread will have its own copy of that variable. For instance:

int x = 0;
#pragma omp threadprivate(x)
#pragma omp parallel
{
    x++;   
    // here x will have value 1, for all threads, regardless of how many threads there are.
    // the next line will print n different addresses (n is the number of threads) ! 
    // there is actually n different variables at n different memory locations.
    cout << hex << &x << endl; 
}
// here x will be 0, untouched by the threads.

C++11 provides a standard way to do this, the keyword thread_local. Now things get complicated if the thread local variable has non-trivial constructors or destructors. For example:

struct A {
    int x;
    A(int _x) : x(_x) { x++; /* then print x */ }
    void some_method() { /* do something */ }
    ~A() { x--; /* then print x */ } 
};

thread_local A a(3);

int main() {
    #pragma omp parallel
    {
        a.some_method();
    }
}

If the number of threads is 2, then program will print:

4
4
3
3

Each thread executes the constructor of (with the parameter 3 !) when it starts, and executes the destructor when it terminates. This behavior is specified in the standard. However, that is not what happens with the current implementation of OpenMP. What actually happens is that only the constructors are executed. I looked at the code and it appears that they use some sort of thread pool. However, even if they do, it should be expected that when the program terminates, all the threads are also terminated, even if they are in a thread pool. 

When I tired an equivalent code with pthreads, the correct behavior happened. So I compared both the disassemblies, and it appears that there is no special difference, hinting that the problem is with OpenMP binaries instead of the compiler, which is weird. I found a mention [1] in the GCC implementation of OpenMP (libgomp), and it appears they do not support dynamic construction of thread local storage (although they use pthreads as their implementation). 

The destruction of the threads depend on a call to __cxa_thread_atexit [2], and for some reason this is called with pthreads but not with OpenMP. I do not know why, but if the behavior of thread_local depends on external libraries, beyond the compiler itself, then this is a serious threat to its adoption. 

Conclusion: if you plan to use thread_local with classes having non-trivial destructors, don't (yet).

Saturday, April 27, 2013

Templates in C++ do not hinder optimizations, such as auto-vectorization, contrary to what I expected.

Example:


// /dev/bin/g++ -m64 -S -std=c++11 -march=native -O3 test.cpp 
#include 

constexpr int SIZE = (1L << 16);
/*
void test1(float *__restrict__ a, float *__restrict__ b)
{
  int i;

  float *x = (float*)__builtin_assume_aligned(a, 16);
  float *y = (float*)__builtin_assume_aligned(b, 16);

  for (i = 0; i < SIZE; i++)
    {
      x[i] += y[i];
    }
}
*/

template
void test1(T *__restrict__ a, T *__restrict__ b)
{
  int i;

  T *x = (T*)__builtin_assume_aligned(a, 16);
  T *y = (T*)__builtin_assume_aligned(b, 16);

  for (i = 0; i < SIZE; i++)
    {
      x[i] += y[i];
    }
}

template void test1(float * __restrict__ a, float * __restrict__ b);
template void test1(double * __restrict__ a, double * __restrict__ b);
template void test1(int * __restrict__ a, int * __restrict__ b);
template void test1(char * __restrict__ a, char * __restrict__ b);




The generated assembly file (by GCC 4.9) will be:
 .text
 .align 4,0x90
 .globl void test1(float*, float*)
void test1(float*, float*):
LFB237:
 xorl %eax, %eax
 .align 4,0x90
L2:
 movaps (%rdi,%rax), %xmm0
 addps (%rsi,%rax), %xmm0
 movaps %xmm0, (%rdi,%rax)
 addq $16, %rax
 cmpq $262144, %rax
 jne L2
 rep; ret
LFE237:
 .align 4,0x90
 .globl void test1(double*, double*)
void test1(double*, double*):
LFB238:
 xorl %eax, %eax
 .align 4,0x90
L6:
 movapd (%rdi,%rax), %xmm0
 addpd (%rsi,%rax), %xmm0
 movapd %xmm0, (%rdi,%rax)
 addq $16, %rax
 cmpq $524288, %rax
 jne L6
 rep; ret
LFE238:
 .align 4,0x90
 .globl void test1(int*, int*)
void test1(int*, int*):
LFB239:
 xorl %eax, %eax
 .align 4,0x90
L9:
 movdqa (%rdi,%rax), %xmm0
 paddd (%rsi,%rax), %xmm0
 movdqa %xmm0, (%rdi,%rax)
 addq $16, %rax
 cmpq $262144, %rax
 jne L9
 rep; ret
LFE239:
 .align 4,0x90
 .globl void test1(char*, char*)
void test1(char*, char*):
LFB240:
 xorl %eax, %eax
 .align 4,0x90
L12:
 movdqa (%rsi,%rax), %xmm0
 paddb (%rdi,%rax), %xmm0
 movdqa %xmm0, (%rdi,%rax)
 addq $16, %rax
 cmpq $65536, %rax
 jne L12
 rep; ret
LFE240:
 .section __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support
EH_frame1:
 .set L$set$0,LECIE1-LSCIE1
 .long L$set$0
LSCIE1:
 .long 0
 .byte 0x1
 .ascii "zR\0"
 .byte 0x1
 .byte 0x78
 .byte 0x10
 .byte 0x1
 .byte 0x10
 .byte 0xc
 .byte 0x7
 .byte 0x8
 .byte 0x90
 .byte 0x1
 .align 3
LECIE1:
LSFDE1:
 .set L$set$1,LEFDE1-LASFDE1
 .long L$set$1
LASFDE1:
 .long LASFDE1-EH_frame1
 .quad LFB237-.
 .set L$set$2,LFE237-LFB237
 .quad L$set$2
 .byte 0
 .align 3
LEFDE1:
LSFDE3:
 .set L$set$3,LEFDE3-LASFDE3
 .long L$set$3
LASFDE3:
 .long LASFDE3-EH_frame1
 .quad LFB238-.
 .set L$set$4,LFE238-LFB238
 .quad L$set$4
 .byte 0
 .align 3
LEFDE3:
LSFDE5:
 .set L$set$5,LEFDE5-LASFDE5
 .long L$set$5
LASFDE5:
 .long LASFDE5-EH_frame1
 .quad LFB239-.
 .set L$set$6,LFE239-LFB239
 .quad L$set$6
 .byte 0
 .align 3
LEFDE5:
LSFDE7:
 .set L$set$7,LEFDE7-LASFDE7
 .long L$set$7
LASFDE7:
 .long LASFDE7-EH_frame1
 .quad LFB240-.
 .set L$set$8,LFE240-LFB240
 .quad L$set$8
 .byte 0
 .align 3
LEFDE7:
 .constructor
 .destructor
 .align 1
 .subsections_via_symbols

Wednesday, April 17, 2013

Accessing private members in C++


#define private public
#include "A.h"
#undef private
 
void accessPrivateState()
{
    A a;
    a.x = 3;
}

Source: http://www.drdobbs.com/cpp/testing-complex-c-systems/240147275?pgno=2

Monday, April 15, 2013

تعلم الرياضيات

حأقول رأيي بس خليك دايما عارف أن كل واحد له طريقته وكلامي مجرد اللي أشتغل معايا وممكن يختلف بأختلاف الفرد. في رأيي أن في حاجتين بيساعدوا كبداية: أول حاجة قراءة النقد لطرق التدريس الحالية، زي A mathematician's Lament، وما شابهها، عشان الواحد يطهّر تفكيره من الركام اللي تراكم بعد كل السنين ديه. وتاني حاجة قراءة نوعين مهمين من المقالات الرياضية: تاريخ الرياضيات (http://en.wikipedia.org/wiki/History_of_mathematics) والرياضيات الpopular (http://en.wikipedia.org/wiki/Popular_mathematics) (ممكن تبدأ تدور من المراجع المذكورة في الصفحات ديه مثلاً) . تاريخ الرياضيات لأنه مثلا اللي عمل حاجة معينة في الرياضيات من ألف سنة كان خلفيته قريبة جداً من خلفيتنا، (لأن الرياضيات اللي درسها تقريباً هي هي نفس الرياضيات اللي درسناها!!!!!)، وسهل نشوف أيه المشكلة اللي قابلتهم وفكروا أزاي في حلها، وهكذا، لحد ما نوصل لليوم الحالي. أما الرياضيات الشعبية (مش عارف أترجمها أزاي :D)، فتخدم هدفين: أول هدف هو تشغيل المخ في حاجات مش محتاجة خلفية كبيرة (وعادة بتبقى بعيدة عن الرموز والحاجات ديه فبتعبر عن روح الرياضيات أكتر!)، وفي علماء رياضيات كتير كانت ديه بدايتهم كأطفال، وتاني هدف مش الاستمتاع في حد ذاته، ولكن تدريب النفس على أنها تعمل ارتباط شرطي بين الرياضيات وبين المتعة، بحيث لما الواحد يشوف رياضيات بعد كدة حتى لو متقدمة أو صعبة حيحس ببعض المتعة فتكون عون له وتصبيره على ما يفهم. زائد الاستمتاع طبعاً.
الفيديو ده (https://www.facebook.com/photo.php?v=4565631532665) مثال للحاجات اللي حتلاقيها في تاريخ الرياضيات !
أهم حاجة بلا استثناء أنك متعملش حاجة مضايقاك، يعني تقرا اللي بيشدك بس، ولو لقيت مقالة معينة خانقاك سيبها! عن علماء رياضيات كتير بيقولك أيه: القارئ بيستمتع بالمواضيع اللي مخه جاهز أنه يفهمها. فعشان تعرف تقرا أيه، جرب حاجات كتير واللي بيشدك يبقى هو اللي المفروض تكمل فيه. وتاني حاجة عشان لو قريت حاجات بتخنقك فده حيطور ارتباط شرطي بين الرياضيات وبين الخنقة، فتبقى مخنوق من الرياضيات كدة علطول من غير سبب، وعلى أيه :)  !
وعامة لو ملقتش أي حاجة بتشدك خالص عادي برضه، يبقى مش شرط تبقى الرياضيات مجالك.. ممكن نفس الأفكار تطبق في أي مجال تاني يشدك!
ومش شرط لو قريت حاجة ومفهمتش أنك تسيب المقالة كلها، ممكن تبقى المقدمة مملة أو ملهاش علاقة بأهتماماتك، فممكن تسيب الجزء الممل وتقرا من الآخر، أنا بأعمل كدة كتير، والأسلوب ده بينفع جداً في الرياضيات لأن غالبا في المقدمة بيبقوا مخبيين حاجات عن القارئ، وبيبقى الجمال كله في الآخر فعلاً!
حاجة تانية مهمة.... متفكرش في اللي بتقراه أنه "رياضيات" أو "كيمياء" أو "فيزياء" أو "علم نفس" أو اي حاجة. التقسيم ده مصطنع، عشان المدارس والجامعات أضطروا يقسموهم بس مش أكتر لأسباب إدارية. العلم كله حاجة واحدة، وهدفه فهم الحياة والتفكير والإنسان، وإعمار الكون والتعرف على الخالق. وفي الواقع كل المجالات ديه مهما بُعدت فهي متصلة ببعضها البعض. مفيش فرع منعزل عن الباقي. مثلاً كنت بأقرا مقالة من اسبوع عن مزيج بين الهندسة التفاضلية (فرع من الرياضيات) وعلم التشريح (بتاع الطب ده)!!! أي حاجة تقراها فكر فيها كقطعة من العلم، مش مهم هي رياضيات ولا حاجة تانية، في الآخر أنت بتتعلم عن نفسك أكتر من عن الحاجة اللي بتقراها. القراءة لإرواء الفضول وإثراء التفكير، ملهاش علاقة بمجال معين. وده يعود بينا لنقطة قراءة اللي يشدك، كأن العلم بستان وبتقطف الزهور اللي عاجباك.

Saturday, February 9, 2013

The future of C++, as I see it

The C++ syntax is too terse with no reason; very simple ideas which can be written down simply are forced to be put in an ugly and incomprehensible form. 
Concepts (a feature to be included in the next standard), is just a hack to put some meaningful compile-time type checking on templates, BEFORE they are instantiated.
The funny things is that they go again and introduce "late_check" to allow bypassing the "BEFORE" part and coming back to the original template type-checking "AT" instantiation. 
Going back and forth, several times, just means they missed the target by a few centimeters (say, to the left) and need to "overshoot" to the other direction (the right) to hit the target. The problem is that they still miss it, again! So they overshoot to the left again! And so on. Even if they eventually converge (I doubt that), C++ will be overloaded with irrelevant historical hacks that no one knows why they were designed in this weird way.
One part of the problem is the unnecessarily terse syntax, but another part is also the perceived need to support legacy code. C++ will be much better if they stopped trying to be backwards compatible (the D language tries to do this, but is not widely adopted yet). The way things are done currently, will turn C++ into a collection of ad-hoc hacks that need deep awareness of the historical reasons to understand: reasons which will not be relevant a few years later, if not already the case now. 
People stick to C++ because of its pass-by-value semantics, const-correctness, compile-time generics, interoperability with C, and wide support, but they are going to abandon it to the next language with same features and better syntax and cleaner slate (with no historical burden). D turns out to be such a candidate (see a comparison).