post

Statisics, Analytics – stop whacking off

Managers! Project Managers, Sales Managers, Marketing Managers, Performance Managers – they are all obsessed! – Why are they obsessed, because we made them that way – it’s our fault! Since the invention of the electronic spreadsheet, managers relied on the same tools for making decisions – charts, tables, graphs – between us, I hate Excel or for that respect, any other “spreadsheet” product. Managers rely on charts to translate the ever complex world we live in, into calculable, simple to understand, dry and boring numbers.
Just to give a rough idea, I have a friend who’s a CEO of a high-tech company in the “social media” sector. He knows how to calculate how much every dollar he spent on Google adwords, is translated back into sales. He is able to tell me exactly how much a new customer cost him and he is very much capable of telling these number just like that. When we last met he asked me: “Say, how do you determine your performance on your network? is there proper and agreed upon metric you use?” – it got me thinking, I’ve been using ASR and ACD for years, but, have we been using it wrong?

So, the question is: What is the proper way of calculating your ASR and ACD? and is MoS a truly reputable measure for assessing your service quality.

Calculating ACD and Why is MoS so biased

ACD stands for Average Call Duration (in most cases), which means that it is the average call duration for answered calls. Normally, an ACD is a factor to determine if the quality of your termination is good – of course, in very much empirical manner only. Normally, if you ask anyone in the industry he will say the following: “If the ACD is over 3.5 minutes, your general quality is good. If your ACD is under 1 minute, your quality is degraded or just shitty. Anything in between, a little hard to say. So, in that respect, MoS comes to the rescue. MoS stands for Mean Opinion Score – in general terms in means, judging from one side of the call, how does that side see the general quality of the line. MoS is presented as a float number, ranging from 0 to 5. Where 0 is the absolute worst quality you can get (to be honest, I’ve never seen anything worse than 3.2) and 5 represents the best quality you can get (again, I’ve never seen anything go above 4.6).

So, this means that if our ACD is anything between 1 minute and 3.5 minutes, we should consult our MoS to see if the quality is ok or not. But here is a tricky question: “Where do you monitor the quality? – the client or the server? the connection into the network? or the connection going out of the network? in other words, too many factors, too many places to check, too much statistical data to analyse – in other words, many graphs, many charts – no real information provided.

If your statistical information isn’t able of providing you with concise information, like: “The ACD in the past 15 minutes to Canada had dropped 15 points and is currently at 1.8 minutes per call – get this sorted!”, then all the graphs you may have are pointless.

Calculating ASR and the Release Cause Forest

While ISDN (Q.931) made the question of understanding your release cause fairly simple, VoIP made the once fairly clear world into a mess. Why is that? Q.931 was very much preset for you at the network layer – SIP makes life easier for the admin to setup his own release causes. For example, I have a friend who says: “I translate all 500 errors from my providers to a 486 error to my customers” – Why would he do that? why in gods name would somebody deliberately make his customers see a falsified view of their termination quality – simple: SLA’s and commitments. If my commitment to a customer would be for a 90% success service level, I would make sure that my release causes to him won’t include 5XX errors that much. A SIP 486 isn’t an error or an issue, the subscriber is simply busy – what can you ask more than that?

As I see it, ASR should be calculated into 3 distinct numbers: SUCCESS, FAILURE and NOS (None Other Specified). NOS is very much similar to the old Q.931 release of “Normal, unspecified” – Release Cause 31. So what goes where exactly?

SUCCESS has only one value in to – ANSWER, or Q.931 Release cause 16 – Normal Call Clearing

FAILURE will include anything in the range of 5XX errors: “Server failure”, “Congestion”, etc.

NOS will include the following: “No Answer”, “Busy (486)”, “Cancel (487)”, “Number not found (404)”, etc

Each one of these should get a proper percentage number. You will be amazed at your results. We’ve implemented such a methodology for several of our customers, who were complaining that all their routes were performing badly. We were amazed to find out that their routes had 40% success, 15% failure and 45% NOS. Are we done? not even close.

The NOS Drill Down

Now, NOS should drilled down – but that analysis should not be part of the general ASR calculation. We should now re-calculate our NOS, according to the following grouping:

“BUSY GROUP” – Will include the number of busy release codes examined

“CANCEL GROUP” – Will include the number of cancelled calls examined

“NOT FOUND” – Will include any situation where the number wasn’t found (short number, ported, wrong dialing code, etc)

“ALL OTHERS” – Anything that doesn’t fall into the above categories

This drill down can rapidly show any of the below scenarios:

  • BUSY GROUP is not proportional – Normally will indicate a large amount of calls to similar destinations on your network. Normally, may indicate one of the following issues:
    • It’s holiday season and many people are on the phone – common
    • You have a large number of call center customers, targeting the same locations – common
    • One of your signalling gateway is being attacked – rare
    • One or more of your termination providers is return the wrong release code – common
  • CANCEL GROUP is not proportional – Normally will indicate a large number of calls are being canceled at the source, either a routed source of a direct source. Normally, may indicate one of the following issues:
    • You have severe latency issues in your network and your PDD (Pre Dial Delay) had increased – rare
    • Your network is under attack, causing a higher PDD – common
    • You have a customer originating the annoying “Missed Call” dialing methodology – common
    • One of your termination providers has False Answer Supervision due to usage of SIM gateways – common when dialing Africa
  • NOT FOUND GROUP is not proportional – Normally will indicate a large number of calls are being rejected by your carriers. Normally, may indicate one of he following issues:
    • One of your call center customers is using a shitty data list to generate calls – common
    • One of your call center customers is trying to phish numbers – common
    • One of your signalling gateways is under attack and you are currently being scanned – common
    • One of your upstream carriers is returning the wrong release code for error 503 – common

So, now the ball is in the hands of the tech teams to investigate the issue and understand the source. The most dangerous issues are the ones where your upstream carrier will change release causes, as these are the most problematic to analyse. If you do find a carrier that does this – just drop them completely, don’t complain, just pay them their dues and walk away. Don’t expect to get your money’s worth out of them, the chances are very slim for that.

 

 

post

Don’t Replicate – Federate

For many years, the question of high availability had always circled the same old subject of replication – how do we replicate data across nodes? how do we replicate the configuration to stay unified across nodes? Is active-active truly better than active-passive? and most importantly, what happens beyond the two node scenario?

Since the inception of the Linux-HA project (and I do believe it’s been around for years now – over 15 years), it has been the pivotal tool for creating Linux based high-availability clusters. Heartbeat, Stonith and Mon will take care of floating the IP numbers and services across – no biggy there, making sure the data is consistent across the board, that’s something completely different. Recently, one of the better known Asterisk Commercial offerings had launched an Asterisk-HA solution – it’s been long due – it’s just a shame it’s a commercial offering without an Open Source derivative, after all, it is Open Source based (I hope).

However, being a high availability solution on one hand, doesn’t mean you are truly a clustered solution – it is an active-passive solution, with a major caveat (at least as I see it), that if your data sync fails for some reason, you end up with a split-brain issue – and your entire solution is now made moot. Don’t get me wrong here, I think that for now, the solution is the next best thing to sliced bread, simply because there is no other solution out there. However, the fact this is the only solution, doesn’t make it the right solution.

What does federating mean in this respect? it means that data doesn’t need to be replicated across the board, it is automatically trickled across the network, making sure all nodes in the network have clear visibility for it. If a node fails inside the cluster, client automatically redirect themselves to a new node, no need for floating IP numbers. Call routing is automatically determined upon request and are never preset for the entire platform. And most importantly, the amount of data traversed between the nodes is as minimal as possible, preventing excessive usage of network resources and I/O.

What would it mean to federate the configuration of a PBX system? first of all, make sure each unit is capable of working on its own, information should be trickled across the nodes via two methodologies: A multicast/broadcast mechanism (for local LAN connected nodes) and a Published/Subscriber relation (for externally connected nodes). When a change is made to any of the systems, that change is then replicated to all the systems. The configuration is never fully transmitted between nodes (apart from a new node joining the cluster). Routing decisions are dynamically made across the network, they are not predetermined or preconfigured. There is no need to keep the cluster nodes in perfect physical alignment, mixing hardware specifications should be considered the norm. External devices should be able to “speak” to the cluster, without being aware of its existence.

Once we achieve all of the above, we’ll truly get to a point where we’ve clustered Asterisk (or another open source project) the right way.

post

Federating Asterisk – truth or myth?

During this years’ Asterisk Developers’ Conference, one of the subjects I’ve raised an issue for Asterisk is: “Federating Multiple Asterisk Instances”. Now, for the seasoned Asterisk user/developer, the answer would be simple – use Kamailio/OpenSIPS for that scalability, and use Asterisk as a Media Gateway or application server.

But I ask the following: “What if we could federate Asterisk without the need for an external component? What if we could federate Asterisk in such a way where our users aren’t event aware of the federation process, and it’s fully autonomous? What would actually be required in order to do that?”

I’m normally confronted with these questions on a day to day basis, looking at the problem from different angles – thinking to myself: “Ok, I know the normal box here – but where are the outer limits? what can I do to make it more robust on one hand, without truly making a mess out of it.”

A federated database is defined as: “A federated database system is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network and may be geographically decentralized. Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the (sometimes daunting) task of merging several disparate databases. A federated database, or virtual database, is a composite of all constituent databases in a federated database system. There is no actual data integration in the constituent disparate databases as a result of data federation.” – http://en.wikipedia.org/wiki/Federated_database_system

So, we would like to virtually create a “map-reduce” functionality for Asterisk? can we truly create a map-reduce’ish functionality for Asterisk? should it be internal? should it be external?

In order to accomplish this, we are required to create a federator – a device capable of handling the information regarding each users, device, trunk, provider and other wise SIP/IAX2 entity connected to our system. The federator for all practical purposes is a data store, be it a key-value store, a database, a shared memory environment or some other form of data distribution layer.

Here are some key issues that true federation may be required to tackle:

  1. Geo-Position Agnostic – A truly federated system should render services identically across the board, regardless of where the user is located.
  2. Services Agnostic – A truly federated system doesn’t care if the user is connected to an Asterisk server version 12 or 13, it should behave identically.
  3. Version Agnostic – A truly federated infrastructure can leverage older version and even other software, without changing the underlying federation layer.
  4. Predictable Scalability – A truly federated infrastructure will allow for growth to be planned linearly, with discrete measure methods.

So, you want a tip on how to start federating your systems? here’s step number 1 – there is no central registry, there is no SIP proxy, there is only the cloud and the services it renders. Start thinking from this point and see where you go.

post

Asterisk ARI – What AGI/AMI should have been

Asterisk ARI – for a seasoned AGI/AMI developer like myself, ARI is a serious mind warp. Why is it a mind warp? simple, it’s all the things we wanted AGI to be, and the reliability we wanted AMI to have, minus all the work around we needed to do – in order to get similar functionality in the past.

So, is ARI truly a replacement for AGI/AMI? well… I think the true answer will be NO. Is a replacement for the Asterisk dialplan? well… I think the answer to that is NO as well. “Say, are you messed in the head? first you say “What AGI/AMI should have been”, and then you say it’s not a replacement? – are you mental?” – well, there are a few reasons why I claim it’s not a direct replacement, and I’ll detail these here.

In order to explain, I’ll give a few examples, using the “in-development” PHP ARI wireframe that I’m developing, called Stanley.

Synchronous vs. Asynchronous

ARI by definition is asynchronous. Keeping that in mind, in means that that any command you give it will get queued or spooled in some manner, and return back an immediate result. Just to illustrate it, let’s examine the following code segment:

$this->stasisLogger->notice("Stasis Start");
$lastResult = $this->channels->channel_playback($this->ari_endpoint, $messageData->channel->id, "sound:hello-world");
$this->stasisLogger->notice("Last result: " . $lastResult);
$lastResult = $this->channels->channel_playback($this->ari_endpoint, $messageData->channel->id, "sound:demo-congrats");
$this->stasisLogger->notice("Last result: " . $lastResult);

For all practical purposes, you should regard $this->stasisLogger as a simple logging object, and $this->channels as a model to initiate ARI Channel requests. If you use the above the code, and activate it from with a Stasis application, you would listen to the “hello-world” and “demo-congrats” segments. Now, let us examine the following code segment:

$this->stasisLogger->notice("Stasis Start");
$lastResult = $this->channels->channel_playback($this->ari_endpoint, $messageData->channel->id, "sound:hello-world");
$this->stasisLogger->notice("Last result: " . $lastResult);
$lastResult = $this->channels->channel_playback($this->ari_endpoint, $messageData->channel->id, "sound:demo-congrats");
$this->stasisLogger->notice("Last result: " . $lastResult);
$this->channels->channel_delete($this->ari_endpoint, $messageData->channel->id);

The only difference here is the last line. If you activate this code, you will hear the world “Hello”, immediately followed by a disconnect. “Wait a minute, what just happened? – wasn’t I supposed to hear everything?” – that’s exactly the point, the answer is NO! The asynch nature of ARI will simply queue the first 2 playback requests, while the hangup is performed almost immediately – the playback simply never get to be executed.

In other words, if you need something to be synchronous within the dialplan, you may need to work differently about it. If you are familiar with the Node.JS framework, you are fairly familiar with this issue.

ARI is for writing applications, not IVRs

When the Asterisk team created ARI, their idea was simple: “Don’t manage the queue application, simply write your own”. Same applies for managing multi party conference calls, call origination, etc. In 2009 I wrote a book about AGI programming, where I’ve explained the methodology for “Atomic AGI development“. The concept behind Atomic AGI was to contain small logic units in AGI scripts, and leave most of the heavy lifting to the dialplan. This methodology enables to create scaleable Asterisk platforms at fair ease, and introduce additional technologies, without going about and adding odd things into Asterisk.

ARI is meant to do something similar, in the form where you can go about and create your own logic, contain it into a singular application and activate when you require – for example, rewriting the queue application. One of the first applications that I’ve decided to re-write using ARI was a Radio broadcasting system that I’ve developed in 2006. The problem with that application was that I need to hold about 600 callers in a single queue, and attach them over to the broadcasting booth as required. Of course I needed to enable full call control, caller management, UI and more. Initially, I used MeetMe, MySQL, and AMI to do this. Later on it changed to MeetMe, Redis, AstManProxy and some other tools – but it never seemed to please me. The fact that I needed to maintain 2 MeetMe bridges, one for holding people and one for the actual broadcasting really bugged me. Yes, when Asterisk 1.8 came out I migrated to the Bridge application and yes, I updated bits and pieces here and there, but it was never what I wanted it to be.

When I started playing around with ARI, I said to myself – this is the perfect application to migrate to ARI. The only thing I needed was a simple Stasis application to read my state correctly, and that would be activated once the called is put into the waiting area – so in terms, I’ve developed a very simple queue application.

IVR heavy lifting was done using dialplan, but the actual service was done with ARI.

Blades and Bleeding Edge

Now, before you go about migrating all your existing code to ARI – you must remember this: If you walk on the bleeding edge, expect the blade to cut you here and there. Currently, I hadn’t yet seen any proper ARI wireframe available. I’ve seen some work done with Node.JS and Ruby, but I can’t say that I’ve taken a fancy to any of those. Honestly, my comfort zone is very much PHP and C/C++, what can I say, I’m old school.

When I started building the Stanley wireframe, it was fairly frustrating – simply because not everything was that much clear and clean. In addition, as Asterisk advances, ARI will change and advance as well. What ever you write, make sure it’s modular enough so you can change it as required.

 

post

Mobile VoIP OTT is Dead! – Long Live Mobile VoIP OTT!

What do the following have in common: Skype, Viber, Whatsapp, Line2, Tango and Kakao? Yes, there are all OTT apps for your mobile phone that enable you to communicate with your peers. Skype, Viber, Line2, Tango and Kakao actually enable you to call one another. Each one dominates a section of the world, where Kakao and Line2 are dominant in the far east, Viber dominates Japan and Eastern Europe and Skype kind’a says: “Look at me bit**es, I’m all of you combined”.

What do the following have in common: VoipDiscount, Nymgo, WiCall, VoIPstunt, Vox Mobile, Cloud Roam, Skuku? All of these are VoIP Mobile OTT apps, similar to the above and yet – no one truly heard about these or is using them. Each one of the above is more or less a replica of the previous one, maybe with one or more added features – but all in general are the same pitch and bit**, make cheap calls over VoIP via our service.

So, what does it all mean? it means one simple thing, no one truly cracked the formula to make money on the Mobile VoIP OTT business – everybody is still looking for the killer business model/VoIP OTT Application. What is the right way? providing low cost calls? providing business oriented services? providing simple roaming solutions? maybe bundling roaming data plans and SIM cards? or maybe, all of these are sooooooo passe that the world just says: “Stop fu**ing about and create some truly new, change how think and how we work completely. Paying 1 or 2 dollars more per month, I’m not gonna change my service for that – it’s pointless.”

So, what are the true killer apps that will truly say: “this is a game changer, from this point onward, VoIP OTT will no longer be the same!” – Here is a list that I believe will make the difference:

1. Make calls completely social – Phone numbers are so 18th century, they are pointless

2. Make your phone aware – Presence and availability is key

3. Drop the stupid things – call recording, visual voicemail, funny sounds, funky tones – stop the bullshit, give me proper services than stupid features

4. Make your service reliable – stop behaving like a website operator and thing like Ebay, every minute your service is down or affected by bad service you are loosing money

5. Make work, then make pretty – application design is important, product design is important, but not more than the product itself

6. Invest in support and monitoring – relying on your suppliers to do it for you is stupid and childish

7. Only blame yourself! – when something fu**s up, it means that you did your job wrong and you cut corners. Don’t start blaming your colleagues or your contractors, they are only doing what you asked them to do

And most importantly, remember the following statement: “I’ve seen the furthest, because I sat on the shoulders of giants.” – don’t tell the world how you’re going to obliterate Whatsapp and Skype, look at them, strive to be them, and then do it better.

I wish all of you good luck.