Fast Base 64

I found a nice paper online to encode/decode Base 64 data using AVX instructions. Check out the download area to download the package.

mrMath Update

A new rolling median (like moving average) function was introduced. Fixed many things and now the package supports an Pascal only path that allows to use the library for ARM targets.

mrMath updates

I began the transition to more lapack based Eigenvalue functions. New is the Hessenberg decomposition as a start. And just for fun I implemented a new random number generator method based on the ChaCha (salsa20) stream cipher optimized by SSE and AVX instructions. The latter is actually able to create 2 instead of one ChaCha matrix in one step.

mrAI updates

There are now a few bugfixes for the multithreaded decission Stump and SVM classfier.

mrMath updates

The mrMath package now includes a unit for polynom root finding and function root finding.
in addition the check for CPU features has changed.

mrMath new subspace method

There is now a new analysis method called Singular Spectrum Analysis available for the mrMath library.
In addition there are new row and column based selectors available that makes subranges easier to create.
A bug in scaling matrices as a prestep for calculating eigenvalues was fixed. This happend on matrices with Rank 0.

mrAI small updates

Thanks to "zwjchin163" who made me aware of a typing I changed some unit names

new optimized matrix initialization

based on optimized SSE and AVX code the initialization is now twice as fast as the standard loop.

multithreaded C4.5

the C4.5 algorithm has been extended by multi core features and a new stopping criteria that allows to stop if the branch has lower than a certain limit of examples. There is also a big speedup due to changes in the algorithm

new thread pool

For windows exists now a new thread pool based on IO Completion port which seems to be a bit faster and has less overhead than the conventional thread pools.

In addition there is now a very basic Kernel PCA algorithm available in the library.

some updates

A few mminor updates where more interfaces and less direct objects are used to reduce the amount of code. And... added a few minor auxilary functions.

mrMath Matrix mult

A new threading strategy on matrix multiplicaton may have an acceleration of up to 3 times (30% on larger matrices) on my CoreI7. Note that the mult may be slower on hyperthreading cpu's (my older I7) but generally it's faster.

mrMath 32bit assembly revisited

Completely rewrote the 32 bit part (sse, avx, fma) for the standard register(delphi)/assembler(fpc) calling conventions. Spared a few cycles here and there ;). The testing for AVX/FMA OS support was not there. That was introduced with Win7 SP1 -> OS's below that would create troubles.
And as a last step I changed the directory structures so the source is on the base directory, packages in designated package directory.

mrMath updates unix x64 calling conventions

I struggled quite much to get the proper registers in place so it works with the "assembler" directive in x64 too (sometimes I forgot to directly use the register but instead used the variable name. No problem on windows but on unix it's translated to the "wrong" anticipated register). And woohooo 2^8 commits ;) .

mrMath updates part 2

I started changing the calling convention for the assembler routins and started with the init and swap routines.

mrMath updates:

The x64 assembler routines have been updated for FPC to ensure it's treated as "assembler" procedures. When stack checking was set on FPC sometimes the compiler inserted code that destroyed the content of the registers that were used in the routine as paramters.
A base simple convolution method was introduced.

updates on the optimized function

A recurring procedure that occurs in many of these algorithms is to remove (or add) the mean or any other vector from a matrix. So I took the time to implement an assembler optimized version of adding or substracting a vector row or columnwise from a matrix. A speedup of 3 was achieved in large matrices.

introdocuded a more modern geometric median algorithm

In perfect "Ginglish": "The Weiszfeld geometric median has come into years". So I looked for a more modern approach and found one that has a steering parameter one can use for a per problem based "learning param". By tuning this param one can reduce the numbers iterations used to find the geometric median quite a bit. In addition the algorithm does not suffer from the regularization problems that the Weiszfeld algorithm had (hence the introduced param beta there).

In addition the sum function now allows to not destroy the matrix memory itself but rather stores the data in the first column/row and setting the submatrix properties. This allows in quite some cases to avoid any memory allocations.

And again... multithreading is (still) hard (quoting Primož Gabrijelčič here ;)). The windows thread pool actually serialized at least one thread in the multithreading calculations. The reason is that the winpool threads are bound to a cpu to avoid thread resource reallocation over different cpu's. The thing is that there was no check if the threadpool that creates the tasks (and which is used in the calculus as well) does not check if the current thread runs on the same cpu than the released task. Since the current thread also executes math code at least one thread was serialized.

fixing mrMath AVX/FMA

The x64 ABI defines the data behind the stack pointer is voliatile and may be overwritten by OS/Debugger. I didn't encounter any problems but anyway I changed the stack behaviour such that rsp/esp is handled as intendent without the small optimizations. This results in a few more stack operations and data movements.

base class for distance calculations

The base class is meant to be used with an initial point cloud and allows to calculate "distances" the given distribution:
* Mahalonobis distance from a given covariance estimation
* L1 (absolute) distance. Initialized with the geometric Median
* Euclidian distance from the mean of the given point distribution.

Note that the new class can be used to be initialized with given data (covariance or mean) or the different distance measurements can be initialized by a given distribution of points. The initialization is either:
* the mean for the euclidian distance.
* The geometric median (by the Weiszfeld algorithm) for the L1 measure
* The mean + covariance matrix for the mahalonobis distance.
Also extended the matrix class with new resize and append methods as well a fix to the some stack issues within the 64 bit assembler routines.

progress on the math library

* Added FMA support for all multiplication algorithms.
* Added Robust B-Splines. Features overdetermined and robust spline parameter estimation.
* Added Expectation Maximization

major updates on the math library

* Added AVX support. All SSE optimized functions has been optimized for AVX(1) support.

updates on matrix class

* Added a faster (but not so robust) and simple linear regression
* Some updates on the Linux support - took back the 64 bit changes that caused problems.

added linux support

* Added 64bit Linux support for CodeTyphoon
* Some minor changes that spares a few cycles and registers

updated neural network learning methods

* new initialization types
* neural network now supports weighted learning by applying the weighting factor on the backpropagation step

added a dynamic time warp method and correlation

A fast dynamic time warp based on the well known divide an conquer algorithm has been added to the library. Check out this wikipedia article and the fast algorithm "FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space".
In addition I introduced an intermediate class on the global subspace methods that allows to steere which matrix class is used in the calculations (threaded or single core).

new global subspace method: Partial Least Squares

Thanks to Gustav Kaiser the library now includes this new global subspace method.

a new method for least squares equation solving is there

Based on the QR decomposition there now exists a least squares optimal linear equation solver method. Check out the matrix class and the associated unit test. Of course there also exists a multithreaded version of this.

New cumulative sum and elementwise differentiation as well as repeat matrix functions have been introduced in the matrix interface. This makes it often easier to convert octave/matlab scripts directly to the code.

Added a neat Delphi example on how to use the regression method. Check out the download page ;)

SVD and others have been updated

By throwing more threads at the problem of matrix updates in the SVD another speed gain could be achieved. e.g. a SVD on a 1024x2048 matrix takes on my test computer approx. 3.7s multithreaded and 8.7s single threaded whereas on my reference Matlab 2009 system it takes 10.8s.

All linear algebra functions are now assembler optimized.

speed update on the SVD implementation

amazing... implementing the plane rotation routines of the SVD in assembler got a real boost to the whole implementation. You can expect about 40% faster execution (e.g. svd of 543x500 took at best before 1.6s now its 1.1s)

new singular value decomposition implementation

A new tile based and multithreaded implementation based on lapacks svd implementation is now available in this library. In addition the matrix vector routines have been revisited and quite a speedup was achieved by reducing the number of memory accesses. Of course there are also high speed ASM versions for the new matrix vector mult routines available.

speed updates to the generic matrix operations

Where applicable I changed the pointer arrithmetic to standard array access via PConstDoubleArr casts. This gives quite a speedup since the compiler can optimize these accesses quite well

faster cholesky decomposition

A rank k symmetric updated routine solely used in the cholesky decomposition has been updated such that the matrix access is more optimal for row major matrixes (like the ones used in this library) resulting in around 10% faster large cholesky decompositions.

new persistence reader write class for JSON formated data

The library is now able to export/import the data not only in a home brewn binary format but also in JSON.

neural network classifier

There is now a new simple feed forward neural network class avaialable.

matrix sorting

A new auxilary function for sorting matrices (column or row wise) has arrived - including a threaded version that splits the sorting into independent pieces. E.g. 2 cores will split a column wise sort call into an upper and a lower matrix.

new cholesky decomposition implementation

A new block based algorithm was introduced to perform a cholesky decomposition. In addition there exists a threaded version which utilizes a the multithreaded matrix multiplication in the cholesky decomposition. There has also been a few minor bugfixes in one of the scaling functions (wrong assertion).

random number generation

The latest update features a new class to generate pseudo random numbers and also random numbers created by the operating system. The pseudonumber generators implemented are the standard Delphi random number generator as well as the Mersenne Twister engine and utilization of the Intel RDRAND instruction (if avail).

new subspace method: independent component analysis

The implementation features the most of the functionality of the matlab fast ica package. (ICA is still experimental).

kmeans classifer

The mrAI package now includes a kmeans classifier featuring normal, median update steps as well as kmeans++ initial center searching.

radial basis functions

There is now a simple Radial Basis Function classifier for Delphi AI package available. In addition the matrix class has been extended by a median function (row and column wise) as well as a vector indexing property.

small bufix to mrMath

Fixed the incompatible exit codes so the project is again compilable under Delphi 2007. In addition the eigenvalue test case failed because of an uninitialized local variable - the code expected the matrix to be initialized with zeros.
As usual you can download the library from here as a zip file or from github!

projects moved to github

Google announced to shutdown their services for so I needed to move my projects to github. You can find the subversion/github repositories on:
* mrMath:
* mrImageUtils:
* mrAI:


Thanks to some mindful usrs a few bugs were reported lately which I resolved in the latest release. There was a wrong size initialized matrix when calling the normalize function as well as calling the symmetric eigenvalue procedure. Also I revised the unsymmetric eigenvalue procedure which had some bugs in it and added a few new test routines.

active appearance models

The image processing library has been extended by an Active Appearance Models framework based on the work of "Active Appearance Models" from Cootes et al. The framework extends the basic principles of this paper by allowing a different warping mechanism. A simple annotator project and a nice testing project has been included. Currently the project is only compilable under Delphi.
As usual you can get the library from the download section or fetch it from google code.

full QR decomposition

Until now the implementation of of the matrix always lacked the feature of a full QR (full economy size Q and R matrices) decomposition. With the current update this feature has been added. Of course the implementation is based on a blocked thus cache oblivious algorithm and there is a multi threaded version too.
As usual you can download that version from the download page or check it out from the google svn repository.

new global subspace method added: Non-Negative matrix factorization

The non-negative matrix factorization is implemented by 3 different iterative approaches:
* Divergence updates
* Euclidean updates
* Alternating Least Squares

Of course the new class supports persistences as well as all the others.

new cache oblivious QR decomposition

The mrMath library now implements Lapacks QR decomposition which creates an economy size QR matrix. Only subtle differences to the outcome on e.g. Matlab can be seen when it comes to rank deficient matrix decoposition. The QR decomposition also comes in a multithreaded flavour where the big matrix multiplication updates are multithreaded (approx. 20% speedup on my testing machine). The matrix class has been extended by new matrix multiplication methods implementing automatic matrix transposition on one of the input matrices. e.g. dest = mt1'*mt2 or dest = mt1*mt2'.
The updated sources can of course be downloaded on the download section or from google code.

A long planned new library has arrived!

The mrAI library is a collection of state of the art classification algorithms. To name a few: Support Vector Machines, Fisher Linear Discriminant Analysis for face recognition (also in a robust fashion), Parts of Viola Jones AdaBoost algorithm for face detection and others. The library is an extension to the mrMath and mrImageUtils library also found on this server. Beside the zipped version of this library the code can also be downloaded from google code:
Check out the unit testing project within the library to see how to use it.

A new neat utility function: Variance calculation

Since quite a few problems need the calculation of the variance of a vector the latest version adds column or row wise assembler optimized variance calculation.
Check out the download page or get the code from

FPC x64 compatibility

The library was made x64 Freepascal compatible. The complete stack handling had to be rewritten since FPC does not know the pseudoinstractions (.savemm, .pushnv) Delphi knows. Removed some unused units from the library.

FreePascal compatibility

A new Unit test project for FreePascal has been added and the 32 bit source has been made FreePascal ready. You will notice quite a few new ifdefs in the source and a few different syntactic approaches when it comes to references on procedures. Also a few minor bugs were discovered and fixed like a memory leak in the threaded version of the LU decomposition.

bugfix and new helper function added

Thanks to Mr. Kaiser a wrong assertion was found in the rowwise matrix sum function. This has been fixed in the latest version. As a goodie there exists an elementwise divide function which divides eache element of matrix 1 with matrix 2.

New Subspace Method Added

The latest version of the matrix library is know able to do the Canonical Correlation Analysis (CCA). In addition the latest package package contains a few new functions: calculating the elementwise absolute value in a matrix and adding a constant value to each element.


A problem was found within the svd function in the matrix class when the width was bigger than the height.


There was a problem with the Strassen matrix multiplication when multiplying a matrix with a small matrices (or a vector). A memory corruption was happening due to a memory allocation which was too small.

MacOS support

Special thanks to William Cantrall who coded the MacOS support for the mrMath library. This new version supports also native multithreading on both platforms via the Grand Central Dispatch on MacOs or simple Windows thread pools.

Incremental PCA

The library got a new additional class to calculate the PCA in an incremental fashion. This may result in faster calculation and less memory usage. Of course there are robust PCA implementations as well. The library also had a problem with many cores cpu's and the threaded implementations. This was fixed in the latest version.


Special thanks to "sutetemp" who discovered a flaw in the multiplication routine. The problem actually occurs if you multiply to matrices which result in a matrix with width=1. The bugfix is already applied to the lib and can be checked out via google's svn or directly on the downloads page.

A plea for more unit testing....

So it happend again - I found a small bug in the element wise multiplication functionality (fixed of course) but it wouldn't have happend if there would be more unit testings which would test every aspect of the library. I added some unit tests (I just don't have the time to add all of them) which do some checks for the found bug. Anyone out there adding some unit tests?

A new image processing library has arrived

Good news everyone - the mrMath library has been extended by a few nice image processing features. The new mrImageUtils library contains simple rgb to matrix and vice versa functions, a triangulation class which is able to create Delauny tringulations, a class for procrustes analysis and a methods for mapping point clouds - via a triangulations and linear interpolation or Thin Plate Splines - to new coordinate systems. Check out the new library on the download page!
The code is also available at google code as subversion download.

Small bugfixes

A new version was uploaded to the site an can be downloaded now. The new version fixes a memory leak which was found by Joshua W and a problem in the matrix determinant calculation where the result was only correct for even matrix widths. As usual the fix was also applied to the subversion version on google

Algorithmic change in LU decomposition and some bugfixes

The LU decomposition is now implemented using a recursive algorithm which is in nature cache oblivious and thus has a great speedup in contrast to the old numerical recipe algorithm. In addition there is also a threaded version of this algorithm which implements threaded version of the backsubstitution and matrix multiplication part. Of course there are a few bugfixes and some small other enhancements like a progress callback for linear algebra functions. Of course one can download the latest sources either from this servers download section or from google code.

Bugfix update and a new goodie: A simple non linear fitting class

For all out there who are using the matrix scaling operations (AddAndScale and such) - please update as soon as possible. There has been a nasty copy paste bug in these routines when it came to scaling and large matrices.
A new goodie was introduced within the latest update as well - there exists now a simple class for non linear fitting. Simple because it only estimates the derivatives from the data but anyway it's worth to try!

New PCA class and faster Matrix multiplication

I'm proud to announce an update to the Delphi Matrix library available on this page. The update includes a speedup in the common matrix multiplication using techniques used in the current BLAS 3  implementations. For all people who do Image processing I also uploaded a Principal Component Analysis (PCA) class, an algorithm widly used as in many algorithms either as preprocessing step or for compression. In contrast to common PCA  implementations my algorithm features a fast robust algorithm which includes a fast outlier detection. Have fun!

New Download page available

Hey Delphi Matrix fans - the Matrix library has a second home on google code:

New Delphi website

This is a new Delphi site which shows the work I've done over the last time. So I decided to share them and release them under an open source licence. In the download section you will find a matrix library including some advanced operations like singular value or LU decomposition, pseudo inversion and others as well as a large set of assembler hand optimized matrix primitive functions. The assembler version are available for x64 code as well by the way. Quite a few of these matrix primitives are also available as multi threaded versions.  All functions are encapsulated in an easy to handle matrix class (or interface).

The library is released under the apache licence meaning that it may also be integrated into commercial products.

Update: There was a bug in the 64 pointer arithmetic in some of the matrix class methods.