Extreme Asterisk Cloud Performance – Part I

*** This post was originally posted at http://www.greenfieldtech.net

Here’s a challenging question for the Asterisk technical savvy of you… What is the top performance you can squeeze out of an Asterisk box, running on Amazon EC2 – or to that extent, a cloud infrastructure? If you scout the Internet, you may find various answers – however, most of them aren’t backed up by real numbers or real information,made accessible in a normal readable form.
Recently, we’ve become heavily involved in a project requiring massive usage of cloud based infrastructure. I won’t go into details as to what the project is or what we are doing, however, I felt that some interesting facts about Asterisk 11.0.1 and Cloud infrastructure can be shared with the rest of you.

Before we dig deep into the actual results, let’s talk about the various measurements usually associated with performance assessments of an Asterisk box, mainly, the machines load average. In order to continue, we must first understand what the Linux Load Average actually is. Most of you know load average as the below:

Load Average Example

Most people know the load average as those 3 numbers, ranging from 0 to anything higher, and if the numbers reach a certain level – it’s bad. But the question is: “What is a good number? and what makes a number bad?” First, let’s understand what the number represents. Load average is an exponential average of all your machines processes. Running processes, sleeping processes, waiting processes and on Linux, also processes currently waiting for I/O access. Now, these number are directly correlated to the number of processors/cores your server has. In general terms, a machine with a single core, any number higher than 1 is considered bad – where 1 represents 100% of the resources being consumed. So, if your machine has 4 cores, the number 4 is your top most number – and from there it’s linear. Now, can we calculate HyperThreading into the equation, multiple CPU pipelines, SSD access – in Linux, all these come into play into that equation. In other words, we’ll never know what is the actual top limit, but working with a rule of thumb based upon the number of cores is a good practice – specifically if your operational environment is a virtualized one.

Now, there are 3 numbers in there – a 1 minute average, a 5 minute average and a 15 minute average. Technically speaking, the 1 minute average isn’t really interesting – as it is highly affected by context switches and process bootstrapping, thus, there is a good chande that its number will be higher than the “advised” number. The numbers that are more interesting are the 5 minute and 15 minute average. Technically speaking, if your machine’s load average is considerably higher than the advised at these, something is definitely wrong.

Updates, Astricon, Allo.com and more

Ok, before all of you jump at my throat for not posting for a long time – I’m sorry. I can’t believe the last time I posted was around June, much has changed since then. So, let’s start with some updates… well, there are no big news updates. Since I left Humbug, I’ve been doing my best to keep busy with GreenfieldTech projects. We’ve successfully completed 10 different custom development projects since June and we’ve started a brand new services branch at GreenfieldTech – the VoIP Security Audits branch – but that’s a different post altogether.

So, I spoke at this years’ Astricon, which took place in Atlanta, Georgia!. This is my first time to the southern parts of the USA and might I add, Southern Hospitality is something you need to experience (I’ll write about that later). Much had changed at Digium in the past 2 years. Many of the long time project members had left to new grounds, Asterisk-SCF is now officially no longer in active development, in other words – this years’ Astricon was a fresh breeze of wind, specifically with new people at the driver seat and new ideas springing about.

So, let’s talk PRI cards. Once every often a company approaches me to evaluate their products. I do my best to be as impartial as I can – after all, Digium products are my favorite. However, I’m always happy to see a product that can compete nicely with Digium, simply because I believe it will make Digium products better and stronger. So, a couple of months ago, Allo.com approached me to evaluate their PRI card. I agreed, and they’ve sent me their PCI-e version of a quad span E1/T1 card.

Allo.Com PRI Card

So, let’s start with what I didn’t like about the card. People, it’s 2012, technology had progressed a great deal since the old Zaptel days, not to talk about the old Xilinx Spartan chips – comm’on, even the crappy Chinese boards don’t use that anymore – move on. Ok, let’s put aside the issue of the actual chips being used on the board – can someone PLEASE explain to me why I need to patch my DAHDI modules to support these cards? How shall I put it, patching DAHDI/ZAPTEL is so 2004. Make your card fully compatible with DAHDI, no patching, stream line the card with the DAHDI stock kernel module – OpenVOX did it, Yeastar did it (do an extent) – you want me to use your hardware, make it easy to install and simple to update and maintain.

OK, regardless of my somewhat reluctant feelings regarding drivers compatibility, I had the unit installed in a test gateway. It performed as I expected from a low-cost compatible. It held up nicely with normal traffic, but when I tried pushing 30 call initiations per second on the card, it heated up slightly and CPU spikes could be observed here and there. Now, in an office scenario – sure this card will do nicely – in a service provider scenario – I’ll think twice. Now, in the past I’ve received similar performance from other clone cards, so my estimate is that there is a group of engineers passing from one company to another, coming in with the know-how for a single design and they wrap that into a card.

Final word regarding the Allo.com card – not my favorite, but definitely a possibility for office environments. I have no idea how their other equipment holds up, but I hope that it holds as well, so that the office/smb market has a new option to choose from.

I will post my Astricon update later on.

Can you trust your integrator with Fraud Analysis?

As some of you know, over the past 9 months, I’ve been heavily involved in the establishment of Humbug. For those who may not know, Humbug is a Call Analytics and Fraud Analysis SAAS. Now, differing from many of the current telephony SAAS projects, we are not based on Amazon EC2 or some other public cloud infrastructure, we build our own cloud environment. Why do we build our own cloud? simple, we need to keep your data secured and confidential. At Humbug, we see ourselves as a cross between Google Analytics – in our ability to analyze and handle data and Verisign – in our security and confidentiality requirements and methodologies.

Question be asked, why do people trust Verisign to provide SSL certificates around the world. What makes Verisign’s CA better than a privately owned CA – the answer is simple, it’s a third party 2 entities can entrust at the same time. Humbug aims to provide the same lever of trust, simply because we regard your data as sacred and valuable.

Since about 2 months ago, we’ve been contacting various Asterisk integrators around the world, inviting them to evaluate Humbug services. Now, while some integrators and vendors were somewhat reluctant, others were more than happy to join. We now have over 250 monitored systems around the world, with system being monitored and analyzed in Israel, USA, UK, Brazil and more.

The thing that amazed me in regards to some of the integrators who decided not to participate was that they claimed: “we provide our customers our own brew of fraud analysis service, we don’t require your SAAS”. Now, while I can accept the fact that an integrator would offer such a SAAS as an in-house service, I can’t see why a customer would rely on these services. In my view, relying on your integrator to provide fraud analysis services is like relying on the integrator of your alarm system to provide hired guard services – it just doesn’t make any sense to me. Why doesn’t it make sense? in Hebrew we say: “Go prove that you have a sister”. Imagine that your PBX integrator offer you such a service, then, in some obscure manner, your PBX gets hijacked and you get slammed with 50K$ worth of phone calls to Somalia. Now, your integrator would say: “Hmmmmm… that’s odd, we didn’t even get those CDR events to our system… you really got hacked bad…” – sure, if you only rely on CDR records to do your analysis (which is what 99.9% of integrators do). There is much much much much more to fraud analysis than just CDR analysis – if it all began and finished with CDR analysis, then by far Cvidya, Verint, NICE and many others would have been made redundant.

Allowing your integrator to provide you with fraud analysis SAAS is like putting the fox to guard the hen house, when things louse up (and they may), he’s the first one to bail out saying: “It’s not my fault”.

Humbug takes a totally different approach to fraud analysis, specifically, in the way we regards the various PBX systems and integrators. We are vendor agnostic and integrator agnostic – we will provide you with the clear and concise information you require in order to make an educated decision as to how you were de-frauded (if de-frauded) and provide you a faster alerting and response time. Our recent adventures had lowered our fraud alert response time from 60 minutes, down to 14 minutes in some cases. Most fraud analysis system carry a 24-36 hour turn around time, by that time, you can be out of 50K$ – our aim is to lower that number to no more than a 100$ in the worst case. Ambitious? yes, down right crazy? probably so, but we always say: “Aim for the moon, you’ll land on a star!” – so we know we’ll get there.

Asterisk, Greed and Revenue Shares

Revenue sharing is one of the oldest methods of earning profits, actually, I believe it may just be right up there with trading of goods and food. For those of you not in the know, I’ll explain what revenue sharing is:

  1. A content provider wishes to distribute a certain type of content – charging for it.
  2. The content provider has not ability to charge the consumers directly, thus he partners with another party – the transport maintainer.
  3. The transport maintainer charges the consumer, while keeping a certain percentage in his pocket.
  4. Everybody’s is happy.

In general, this model works really well in many markets – specifically those that are driven by unique content – for example the mobile content market (ringtones, screen savers, games, apps) – the Apple App store is a wonderful example of how this works.

In the telecom industry, the revenue shares business is very common – however, in many cases it is highly guarded as a secret – main reason is that now one wants anybody else to know how they do it. This hiding of information, usually results in some problems – as when there is hiding of information, only those in the know are able to access it. Those in the know are called “mediators” or in Herbew “Machers”. In this entire ordeal, the mediator also takes a small percentage – leaving the content provider with slightly less. So, now it looks like this:

  1. A content provider wishes to distribute a certain type of content – charging for it.
  2. The content provider has not ability to charge the consumers directly, thus he contacts a mediator to find him a transport partner.
  3. The mediator engages the prospective transport maintainer.
  4. The transport maintainer charges the consumer, while keeping a certain percentage in his pocket and passing some funds to the mediator as well.
  5. Everybody’s is happy.

So, if everybody’s so happy – why am I bitching about it? very simple – people are Greedy and always want more – putting the entire model into a frenzy. In order to give an example, let’s imagine the following scenario:

  1. Company A provides IVR based content utilizing Asterisk server, connected to the internet.
  2. The mediator engages a premium number company, getting the total revenue of 0.08$ for every inbound minute of traffic.
  3. The premium number company leaves 0.01$ in its pocket and also pays the mediator a fee of 0.01$ per minute.
  4. The content provider gets 0.06$ of the 0.08$ – 75% of the net profit goes to the content provider.
  5. Content provider says: “Hell, I want the mediators 0.01$ as well, and I think the premium company should only get 0.005$, so I would get 0.075$ at the end”
  6. Content provider contacts the premium provider and starts complaining
  7. Premium provider negotiates and strikes a deal for 0.07 to the content provider, leaving the premium provider with 0.005$ and the mediator with 0.005$
  8. Premium provider says: “I’m not making enough money on this, actually, I’m loosing money – I’ll find a better alternative service for that access number”
  9. Premium provider asks mediator to bring in a new customer, providing similar content – mediator has sure incentive here
  10. Premium provider gets new customer and transfers the access number to the new customer – returning back to previous profits
  11. Original content provider is left with no profits and only greed in his hands
Screenshot of a GPL screensaver
Image via Wikipedia

Over the past 10 years, I’ve seen this vicious cycle happen over and over and over again, in various formats and scenarios – but always ending in the same outcome – the content provider always suffers. If you’re a content provider and you provide IVR based services, let the people that provide you the access make their cut and the people in the middle, without them, you will have a service with no access – which means no service at all. Don’t go about thinking you can keep all the profits to yourself, you will break the equilibrium of this business, and eventually, no one will want to do business with you.

Reblog this post [with Zemanta]