1/30/13

My Ideal Greenfield Development Platform: Now vs. 5 Years ago


As I've grown as a developer (sometimes two steps forward and one step back - ha ha!) I've had the privilege to work with some very smart people - not just "IQ" intelligent but "savvy" people who saw where things were going technology wise. So I've learned a lot in the last five years about my technology preferences - sometimes by choice, sometimes by necessity - in some cases just playing catch up (e.g. JavaEE vs. Spring) and bleeding edge in others (e.g. Memcached, JBehave).


As a result of that experience, every so often I dream of starting a project from scratch and imagine what technologies I would choose to use for my Java stack based on what I know now. Admittedly what's below is a very Java centric stack and I need to work on looking into Ruby on Rails / Python & Django to broaden my skill set too etc.  


Anyway I hope you find the following table an interesting comparison of technology choices now vs. 5 years ago. In many cases the "Then" technology is still around and still viable. Typically what makes the "Now" technology more attractive is a fantastic combination of lower price and better / faster (and free) support - after that features and speed of execution are winning attributes.




Component
Then 
(5 Years Ago)
Now
(2013)
Why?
Middle-Tier FrameworkJavaEE (JBoss)SpringIn 2005-2006 I went from WebSphere to preferring JBoss because IBM could not move fast and the App Serve was slow. Then I found JBoss also had some issues - less of them - but enough. Just figuring out which JMS messaging solution they would implement in a release was a chore (JBossMQ? JBoss Messaging? HornetQ?). Spring is so much faster in terms of performance, has fewer issues and has faster release schedules and no "two tier" system GA vs. supported
Deployment EnvironmentYour Environment / Your Data CenterThe Cloud (AWS)You need to scale on demand these days and pay only for what you need - and that includes Ops folks, DBAs etc. And why AWS as opposed to someone else - simple - maturity of their offering. 
Build ToolAntMavenWhy? "Convention over configuration" although I have found Maven and its plugins a bit more buggy than those for Ant. So its not clear cut.
Build EnvironmentLocal BuildsJenkinsWhat do you live under a rock? If you aren't doing continuous builds with all your tests automated and hooked in (unit, integration, acceptance, performance) you're crazy! :-)
Relational DatabaseOracleMySQLThere's no reason to pay for a database anymore - Facebook runs on MySQL for crying out loud - InnoDB too!
Key-Value Store / NoSQL NoneMySQL or DynamoDB (or Couchbase)Sometimes you need a Relational DB and sometimes you just need a KV Store. Well since I'm all about AWS you gotta start with DynamoDB. But with AWS you gotta implement your inter-region replication yourself - however a up and comer that impressed me and one to look at is Couchbase.
Caching TierRoll your own / TerracottaMemcachedFast, cheap and great support on the web. 
Web ContainerTomcatTomcat or NGINXTomcat still rocks and I haven't built an app yet where Tomcat was the bottleneck but NGINX is getting some play and is worth a look
Unit TestingJunitJUnitJUnit always rocks! Always will!
Functional TestingWinRunner/LoadRunner etc.JBehaveAgain in the spirit of "Fast, Cheap" and hooks into JUnit JBehave is emerging as a great BDD tool. Hopefully it will keep emerging and develop some snazzy reports.
Bug TrackingRoll Your Own / ClearQuestJIRAJIRA just works . . . very well. I wish the Scrum interactions were a bit better (Grasshopper I find non-intuitive at times).
Source Code ControlPerforceGitGit is free, fast and branches are easy. It's a bit hard to get used to but once you do you'll never look back. Oh and GitHub.
IDEEclipseEclipseAs much as I'd love to switch to IntelliJ the plugins and support just don't match Eclipse (esp. for AWS)
UIWeb (HTML/Javascript/CSS)Web AND
Native Mobile
I still love the web and Javascript but HTML5 is still in its infancy. With Mobile web and app usage skyrocketing the best way to get good performance is (for now) to go Native. Still perhaps in 3 or 4 years HTML5 support will be better.
Messaging PlatformJMSSQSAgain since I love AWS I gotta go with SQS but what I really love with it is it's failover capabilities (backed by 3 copies in S3) and it's separation of read vs. delete. Brilliant. Their Pub-Sub solution (SNS) is equally great.
Testing50% Manual, 50% Automated100% AutomatedSpeed matters - automation + continuous integration (and ideally continuous release) is critical to that end.

I could go into more stuff like Team communication / projects (e.g. Sharepoint vs. Wikis vs. SaaS providers) but I haven't seen anything that makes me go "Wow" - although after my experience with JIRA I'd probably start with Atlassian stuff.



I'm sure in 5 years time I will be making newer and more informed choices. The great thing about Software is all parts of it are always on the move. The hard thing about Software is trying to keep up with the same! :-)

I'd like to hear people's thoughts on the above, their own personal experiences, preferences and if they are aware of anything I've forgotten or if they have questions about technology X?

9/16/12

I wish Steve McConnell was on Twitter . . . . Book review of "Software Estimation"

Published over 6 years ago "Software Estimation" by Steve McConnell is a great read.
As a practitioner of the agile arts I must say in reading it now this book seems like the last great attempt to "fix" waterfall and "big design up front" (BDUF) methodologies which were known for their very distinct big phases of requirements, design, development, testing and release. The kicker for these techniques was often that the development and testing estimates were VERY far off. If they followed McConnell's advice as he laid out in this book they'd have had more success.

Agile basically works around many of the problems McConnell tries to solve by focusing on short iterations (of < 4 weeks) with new releasable and functional software produced at the end. Basically it avoids many of the risks inherent in Waterfall/BDUF by making the software cycle "too small to fail".

That said there are a great many things still to be learned from Steve's great book. Waterfall and BDUF are by no means dead and even so there are some lessons here about the inherent nature of (errors in) estimations by developers.  Even in agile I have experienced serious under-estimation (by a factor of 2x or 3x) by developers - where a story that should have taken 1 sprint takes 3 or more.  So we still have much to learn.  However the key theme this book drove home to me was pretty much that "Software estimation is so hard that we pretty much gave up and are doing short iterations because that's the most we can estimate". I am sure that wasn't Steve's point but that was my inference because since the book's  release estimation has taken the back-burner to story points, burn down charts, stand-ups and sprints.
More people are becoming Scrum masters and fewer are taking PMP

In this long blog article I try to capture some of the key learnings I made from this book

Part 1: Critical estimation concepts

CHAPTER 1: What Is an "Estimate"?

Tip #1: Distinguish between Estimates, Targets and Commitments

  • When business people are asking for an "estimate", they're really often asking for a commitment or a plan to meet a target
  • Estimation is not planning
  • When you see a "single point estimate" ask whether the number is an estimate or whether its really a target
  • A common assumption is that the distribution of software project outcomes follows a bell curve. The reality is much more skewed.
  • What is a good estimate?
  • The approach should provide estimates that are withing 25% of the actual results 75% of the time 
  • Events that happen during the project always invalidate the assumptions that were used in the estimate.
  • The primary purpose of software estimation is to determine whether a project's targets are realistic enough to allow the project to be controlled to meet them.  The executives want a plan to deliver as many features as possible by a certain date.
  • "A good estimate is an estimate that provides a clear enough view of the project reality to allow the project leadership to make good decisions about how to control the project to hit its targets"

CHAPTER 2: How Good an Estimator are you?
  • Studies have confirmed that most people's intuitive sense of "90% confident" is really comparable to something closer to "30% confident" 
  • Where does the pressure to use narrow ranges come from? You or an external source?

CHAPTER 3: Value of Accurate Estimates?
  • Is it better to overestimate or underestimate?
  • If a project is overestimated, stakeholders fear that Parkinson's law will kick in - work will expand to fill the time allotted.
  • Another concern is given too much time, developers will procrastinate until late in the project.
  • A related motivation for underestimating is the desire to instill a sense or urgency
  • Figure 3.1: The penalties for underestimation are more severe than the penalties for overestimation
  • The Software Industry's Estimation Track Record
    • Failure rates
    • 1000 LOC   2%
    • 10,000 LOC   7%
    • 100,000 LOC 20%
    • 1,000,000 LOC 48%
    • 10,000,000 LOC 65%
  • The Software industry has an underestimation problem.
  • What top executives value most is predictability - business need to make commitments to customers, investors, suppliers, the marketplace and other stakeholders

CHAPTER 4: Where Does Estimation Error Come From
  • Four Generic sources
    • Inaccurate information about the project being estimate
    • Inaccurate information about the capabilities of the organization that will perform the project
    • Too much chaos IN the project to support accurate estimation (i.e. try to estimate a moving target)
    • Inaccuracies arising from the estimation process itself
  • Simple example of a Telephone Number checker and the requirements questions / uncertainties that could result in very different design approaches.
  • The cone of uncertainty
    • Initial Concept  0.25x to 4x  (Range = 16x)
    • Approved Product Definition  0.5x to 2x (Range = 4x)
    • Requirements Complete 0.67x to 1.5x
    • UI Design Complete 0.8x to 1.25x
    • Detailed Design Complete 0.9x to 1.1x
  • The cone of uncertainty is the BEST-case accuracy possible to have. It isn't possible to be more accurate - it's only possible to be more lucky.
  • The cone does not narrow itself - if a project is not well controlled you can end up with a cloud of uncertainty that contains even more estimation error. 
  • "What you give up with approaches that leave requirements undefined until the beginning of each iteration if long-range predictability"
  • Sources of project chaos
    • Requirements that were not investigated very well
    • Poor designs leading to lots of code rewrite
    • Poor coding practices leading to extensive bug fixing
    • Inexperienced personnel
    • Incomplete or unskilled project planning
    • Prima Donna team members
    • Abandoning planning under pressure
    • Developer gold-plating
    • Lack of source code control software
  • In practice, project managers often neglect to update their cost and schedule assumptions as requirements change.
  • Omitted Activities (pp.44)
    • Missing Requirements
      • Non functional requirements: Accuracy, modifiability, Performance, Scalability, Security, Usability etc.
    • Missing software-development activities
      • Ramp-up time for team members
      • Mentoring
      • Build & Smoke Test support
      • Requirements clarification
      • Creating test data
      • Beta program management
      • Technical reviews
      • Integration work
      • Attendance at meetings
      • Performance tuning
      • Learning new tools
      • Answering questions
      • Reviewing technical documentation etc.
    • Missing non-software-development activities
      • Vacations, Holidays, Sick days, Training, Weekends(!?!?)
      • Company meetings, department meetings, setting up new workstations
    • Developer estimates tend to contain an optimism factor of 20 to 30%. Although managers complain that developers sandbag their estimates - the reverse is true.  Boehm also found a "fantasy factor" of 1.33

CHAPTER 5: Estimate Influences

  • The obvious one: Project Size
  • Diseconomies of scale a 1M LOC project takes more than 10x the effort of a 100k LOC project.
  • The basic issues is that in larger projects coordination among larger groups of people requires more communication. As Project size increases, the number of communication paths among people increases as a SQUARED function of the number of people on the project.
Lines of code per staff per year
  • 10k LOC project --> 2k to 25k
  • 100k LOC project --> 1k to 20k
  • 1M LOC project --> 700 to 10k
  • 10M LOC project --> 350 to 5k
  • Other influences: The kind of software being developed
  • Personnel factors
    • According to Cocomo II on a 100k LOC project the combined effect of personnel factors can swing a project estimate by as much as a factor of 22!
    • The KEY personnel decision: Requirements Analyst Capability and only THEN the programmer
    • The magnitude of these factors has been confirmed in numerous other studies
  • Other influences: Programming Language
  • Lots of other adjustment factors: See table 5-4 on page 66
  • Key Learning: Small and Medium-sized projects can succeed largely on the basis of strong individuals. Large projects however still need strong individuals but project management, organizational maturity and how well the team coalesces are just as significant.

PART II: Fundamental Estimation Techniques

CHAPTER 6: Introduction to Estimation Techniques

Considerations in choosing estimation techniques

  1. What's being estimated - features, schedule, effort
  2. Project Size 
    1. Small: < 5 total technical staff. Best estimates are usually "botom-up" techniques created by individuals who will do the actual work
    2. Large: 25+ people that lasts 6 to 12 months or more. For these teams the best estimation approaches tend to be "top-down" approaches in the early stages. As the project progresses more bottom-up techniques are introduced and the projects own historical data will provide more accurate estimates.
    3. Medium: 5 to 25 people lasting 3 to 12 months. Can use any of the techniques above.
  3. Software Development Style: Iterative vs. Sequential
    1. Evolutionary Prototyping
    2. Extreme Programming
    3. Evolutionary Delivery
    4. Staged Delivery
    5. RUP
    6. Scrum
CHAPTER 7: Count, Compute, Judge
  • Count first
  • Count if at all possible, compute when you can't cout. Use judgement alone ONLY as a last resort
  • What to count? Find something to count that's highly correlated with the size of the software you are estimating. And find something to count that is available sooner rather than later in the development.
  • Historical data
    • Average effort hours per requirement for development
    • Average total effort hours per use case / story
    • Average dev/test/doc effort per change request 

CHAPTER 8: Calibration and Historical Data
  • Used to convert counts to estimates - lines of code to effort, user stories to calendar time, requirements to number of test cases
  • Your estimates can be calibrated using any of three kinds of data
    • Industry data
    • Historical data
    • Project data
  • Using data helps avoid subjectivity, unfounded optimism and some biases.
  • It also helps reduce estimation politics
  • Start with a small set of data
    • Size (LOC)
    • Effort (Staff months)
    • Time (Calendar months)
    • Defects (classified by severity)
  • Be careful how you measure e.g. 8 hour work days? How about vacations? Overtime?
  • It is surprisingly difficult in many organizations to determine how long a particular project lasted
CHAPTER 9: Individual Expert Judgment
  • To create the task-level estimates, have the people who will actually do the work create the estimates
  • When estimating at the task level decompose estimates that will require no more than about 2 days of effort. Tasks larger than that will contain too many places that unexpected work can hide. Ending up with estimates that are 0.25 to 0.5d of granularity is appropriate.
  • Use Ranges to help identify risks and where things can (and often do) go wrong
    • Best Case
    • Most Likely Case
    • Worst Case
    • Expected Case
  • Estimate Checklist
    • Is what's being estimated clearly defined?
    • Does the estimate include all the KINDS of work needed to complete the task?
    • Does the estimate include all the FUNCTIONALITY AREAS needed to complete the task?
    • Is the estimate broken down into enough detail to expose hidden work?
    • Have you looked at notes from past work rather than estimating from pure memory?
    • Is the estimate approved by the person who will actually do the work?
    • Is the productivity assumed in the estimate similar to what has been achieved on similar assignments in the past
    • Does the estimate include a Best Case, Worst Case and Expected Case?
    • Have the assumptions in the estimate been documented?
    • Has the situation changed since the estimate was prepared?
  • Compare actual performance to estimated performance so that you can improve estimates over time.

CHAPTER 10: Decomposition and Recomposition
  • The key is if you create several smaller estimates some of the estimation errors will be on the high side and some will be on the low side. The errors will tend to cancel each other out to some extent. Research has found that summing task durations was negatively correlated with cost and schedule overruns.
  • Since developers tend to give near-Best case estimates, schedule overruns often compound on one another since the chance of each of the estimates coming in as scheduled is so very low.

CHAPTER 11: Estimation by Analogy
  1. Get detailed size, effort and cost results for a similar previous project
  2. Compare the size of the new project to a similar past project
  3. Build up the estimate for the new project's size as a percentage of the old project's size
  4. Create an effort estimate based on the size of the new project compared to the previous project
  5. Check for consistent assumptions across the old and new projects

CHAPTER 12: Proxy-based Estimates
  • Fuzzy Logic
    • Very Small
    • Small
    • Medium
    • Large
    • Very Large
  • As a rule of thumb the differences in size between adjacent categories should be at least a factor of 2
  • Story Points e.g. Fibonacci sequence. 
  • Cautions about rating scales - the use of a numeric scale implies that you can perform numeric options on the numbers: multiplication, addition, subtraction and so on. But if the underlying relationships aren't valid - that is a story worth 13 points doesn't really require 13/3 as much effort as a story worth 3 points - then performing numeric operations on the 13 isn't any more valid than performing numeric operations on "Large" or "Very Large"
  • T-Shirt Sizing
    • Remember that the goal of software estimation is not pinpoint accuracy but estimates that are accurate enough to support effective project control
    • In this approach developers classify each feature's size relative to other features as Small, Medium, Large, XL etc.
    • This allows the business to trade-off and look for features with the most business value and lowest development cost.
CHAPTER 13: Expert Judgment in Groups
  • Group Reviews
    • Have each team member estimate pieces of the project individually, and then meet to compare estimates
    • Don't just average estimates - discuss the differences
    • Arrive at a consensus estimate that the whole group accepts
  • Individual estimates have a Magnitude of Relative Error (MRE) of 55%.
  • Group-reviewed estimates average an error of only 30%
  • Studies have found that the use of 3 to 5 experts with different backgrounds seems to be sufficient.
  • Wideband-Delphi Technique

CHAPTER 14: Software Estimation Tools
  • Allows you to simulate different project outcomes
  • Data you'll need to calibrate tools
    • Effort in staff months
    • Schedule, in elapsed months
    • Size, in lines of code
  • Summary of available tools - see pp.163 (valid as of 2006)

CHAPTER 15: Use of Multiple Approaches
  • Use multiple estimation techniques and look for convergence or spread among the results

CHAPTER 16: Flow of Software Estimates on a Well-Estimated Project
  • When you reestimate in response to a missed deadline base the new estimate on the project's ACTUAL progress not on the project's planned progress.

CHAPTER 17: Standardized Estimation Procedures

Estimation should be fit into a Stage-Gate process
  • Discovery
    • Approved preliminary business case
  • Scoping
    • Approved product vision
    • Approved marketing requirements
  • Planning
    • Approved software development plans
    • Approved budget
    • Approved final business case
  • Development
    • Approved software release plan
    • Approved marketing launch plan and operations plan
    • Approved software test plan
    • Pass release criteria
  • Testing and Validation 
    • Pass release criteria
  • Launch
The process should
  • Emphasize counting and computing rather than use of judgement
  • Calls for the use of multiple estimation approaches
  • Communicates a plan at predefined points 
  • Contains a clear description of an estimate's inaccuracy
  • Defines when an estimate can be used as the basis for a project budget
  • Defined when as estimate can be used as the basis for internal or external commitments.

PART III: Specific Estimation Challenges

CHAPTER 18: Special Issues in Estimating Size
  • Using Lines of Code in Size estimation (data is easily collected but translation into "staff months" of effort is error prone)
  • Function-Point Estimation
    • The number of function points in a program is based on the number and complexity of
    • External inputs (e.g. screens, forms, dialog boxes)
    • External outputs (e.g. screens, reports, graphs etc)
    • External queries
    • Internal Logical files
    • External interface files

CHAPTER 19: Special Issues in Estimating Effort
  • Productivity variations among different kinds of software projects can show very different effort estimates (per LOC) and cost (per LOC)
CHAPTER 20: Special Issues in Estimating Schedule
  • Basic Schedule Equation
    • Schedule In Months = 3.0 x cubeRoot(StaffMonths)
  • Schedules compress and the shorted schedule
    • If the feature set is not flexible, shortening the schedule depends on adding staff to do more  work in less time
    • Numerous estimation researchers have investigated the effects of compressing a nominal schedule.
    • All researchers have concluded that shortening the nominal schedule will increase total development effort.
    • There is also an impossible zone and you can't beat it - the consensus of researchers is that schedule compression of more than 25% nominal is not possible
    • Similarly you can reduce costs by lengthening the schedule and conducting the project with a smaller team
    • Lawrence Putnam conducted fascinating research on the relationship between team size, schedule and productivity.
    • Schedule decreases (and effort increases) as you add team members - until you hit 5-7 on a team. After  that the effort goes up very much more quickly and schedule ALSO starts to get longer.
    • Thus a team size of 5 to 7 people appears to be economically optimal for medium-sized business system projects.
CHAPTER 21: Estimating Planning Parameters
  • Estimating Architecture, Requirements, Management effort for projects of different sizes. The larger the project the more the architecture, test, requirements and management costs.
  • Developer-to-test ratio is settled more by planning than by estimation - that is it is determined more by what you think you SHOULD do than by what you predict you will do.
  • Good analogy about ideal time and planned time: football game - 60 minutes vs. 2 to 4 hours elapsed time.
  • Defect Removal
    • Formal Design Inspections: 55% rate of removal (mode)
    • Informal design review: 35%
    • Formal code inspection: 60%
    • Informal code review: 25%
    • Low Volume (< 10 sites) Beta Test: 35%
    • High Volume (> 1,000 sites): 75%
    • System Test: 40%
  • Other rules of thumb
    • To go from one-company, one-campus development to multi-company, multi-cit: allow for 25% increase in effort.
    • To go from one-company, one campus development to international outsource, allow for a 40% increase in effort.
CHAPTER 22: Estimate Presentation Style
  • Communicating Estimate Assumptions
    • Which features are in scope
    • Which features are out of scope 
    • Availability of resources
    • Dependencies on 3rd-parties (and their performance)
    • Unknowns
  • Expressing Uncertainty
  • Try to present your estimate in units that are consisten with the estimate's underlying accuracy
  • Ranges are the most accurate way to reflect the inherent uncertainty in estimates at various points in the Cone of uncertainty.
  • Do not present a commitment as a range, a commitment needs to be specific

CHAPTER 23: Politics, Negotiation and Problem Solving
  • Estimate negotiations tend to be between introverted and more junior technical staff and seasons professional negotiators.
  • Understand that executives are assertive by nature and by job description and plan your estimation discussions accordingly.
  • You can negotiate the commitment but do NOT negotiate the estimate
  • Educate nontechnical stakeholders about effective software estimation practices
  • Treat estimation discussions as problem solving, not negotiations. Recognize that all project stakeholders are on the same side of the table. Everyone wins, or everyone loses.
  • Getting to Yes
    • Separate the people from the problem
    • Focus on interests, not positions
    • Invent options for mutual gain
    • Insist on using object criteria


Frank's Summary
This is a great book I should have read a few years ago. Everyone should. Even if you are doing agile development there are tons of great tips and tricks (e.g. effectiveness of design & code inspections, using best/worst case estimates, negotiation techniques) that are useful regardless.

Like I said above I think the rise of agile techniques pretty much indicates, at least to me, that most software practitioners do not have the patience, determination and doggedness to follow the practices McConnell outlines above.  Because of that their estimates (and thus their perceived performance by their customers) is poor.  Agile methods especially Scrum and Kanban have achieved success by trying to limit the cognitive planning and estimating load - keeping the process simple and light and "result" focused (ship!). What I like to call "Too Small To Fail". Even so a lot of organizations have trouble adopting agile and need help.   There are various reasons for this but they are the same reasons their other processes were flawed - the problem is not in the process itself but how it is being executed and the ability of those trying to do the execution.

I just wish Steve McConnell was on twitter though - I could do with a daily dose of the knowledge and wisdom he puts into his books.







9/1/12

Securing your data in the cloud- key management hell

Invariably when you start migrating to the cloud you'll find you need to encrypt some or all of your data that you store there. Apart from the performance hit this seems easy right?
Except when you start to think of how you are going to manage keys used to encrypt / decrypt.
Because your Data and your business logic (app servers) are no longer in your control you can't just leave your keys on your app server (EC2) instances. If a hacker compromises those data keys then they can access your data. The same is true in normal in-house environments but at least you can trust the folks who run your data center - or at least there is direct accountability - you can fire them or pursue them legally.  If an AWS person in Tokyo for example goes rogue what is your recourse?

So your data encryption keys THEMSELVES need to be encrypted. OK not a big deal.
But now where do you store the "master" keys to decrypt the data keys . . . . and so on. Maybe you store the master keys in your non-cloud environment and call out from EC2 to get them (but now you could be subject to another type of attack).  So far I haven't heard a good architectural solution (barring something like human based Two factor authentication required when starting up an EC2 instance? But now your auto-scaling is hosed).

Anyone have any ideas or see any workable reasonable solutions?

3/20/12

Architecting in the Cloud

So I am starting to investigate architecting an app for AWS. I have the basics of EC2, S3, SQS, RDS, Simple DB (now replaced by DynamoDB). But I had some lingering questions about architecting solutions for the cloud as I am always a believer that "There ain't no such thing as a free lunch"

So I came across the following wonderful talk on YouTube and the associated slides on Slideshare.
I figured I'd share all the links here for easy access. Enjoy!


Architecting in the Cloud by Simone Brunozzi
AWS Cloud Tour, July 2011

Part 1

Part 2

Part 3

Part 4

Part 5


Slides on Slideshare

1/29/12

The Perils of Asynchrony

Every time you come across anything more than a rudimentary system that has some moderately serious performance needs, someone somewhere on the team considers using asynchronous processing to help reduce (perceived) response time. That person needs to be identified and quickly locked in a padded room . . . . Just kidding!

Often that person is me and often I end up writing the code and relearning why doing things in an asynchronous way (that also meets a certain "near guarantee" SLA including DR and HA needs) is very very hard.

So it was with some chagrin that I was tasked with coding some infrastructure components to implement a Task Queue for my current team. The goal was to satisfy a need to improve response time of the application when it was doing some back-end tasks that required 100ms or more. The tasks themselves aren't time critical but they are important e.g. replicating data to a remote site.

Fortunately this time I'm not writing financial systems software - there you typically need to guarantee that although a task is asynchronous that it will be done within some SLA (e.g. 2-3 seconds). So at least I didn't have that worry. That said, for most apps asynchronous doesn't mean performing the task "whenever" - it means a little bit later than now. In addition you still have to consider that your solution is Highly Available and has built in Disaster (or just plain VM crash) Recovery.

Anyway the default solutions in Java for Asynchronous processing are
1) Threads (and java.util.concurrent - which is awesome)
2) JMS (Java Message Service)

Why Threads aren't the solution
Java.util.concurrent is great - a really great step up from Threads and writing your own infrastructure to start, stop and generally manage and handle threads and thread pools. If you haven't dug into this package yet do so now. It's a life saver.

The downside to Threads / Java.util.concurrent is that when you launch a job to be executed in the same JVM and same VM slide then
1) If your JVM or your VM die without some persistence mechanism your job is lost forever.
2) You never have just one asynchronous job and you'd like to be able to balance the load
of asynchronous jobs across your entire farm of virtual machines and their JVMs.

JMS in a Nutshell
So that's where people start to invoke JMS (Java Message Service). JMS proposes two
core models of asynchronous communication.
1) Queues - which are a point-to-point communication system. VM #3 writes a message to a queue and some other thread in some other VM reads that message from the queue. Only one writer - one reader.
2) Topics - which are a "broadcast" communication system. VM #3 broadcasts a message to a topic and multiple threads on multiple VMs (each with potentially different processing goals) read the thread.

Here come the problems . . . .
The key problem with Queues is ensuring that although there is only one reader that the job, once it is read from the queue, actually gets executed.

The key problem with Topics is often ensuring that although there are multiple readers - every individual type of processing (e.g. store a record in a database table) happens only once - otherwise you are wasting resources.

There are other problems too. JMS doesn't have any built-in specification for Disaster recovery or High availability. Therefore each JMS provider (and there are many e.g. MQSeries, Progress Sonic, Active MQ) provides their own mechanisms.

Persistence
The first other problem is ensuring that if a message is written (to a topic or queue) and one of the JMS processes (a broker) crashes or there's a network hiccup that the message is not lost. For that to occur you need a persistence mechanism. The persistence mechanisms either use the local filesystem of the broker or a relational DB. Either way you are just pushing your DR problem a bit further away from your JMS broker to now worrying about the local disk or your database. IMO it's not exactly a solution to replace one DR nightmare with another one. In addition to doing this another MAJOR downside to persistence is that it typically reduces throughput by an order of magnitude or more. See for example this link for Active MQ alone but this link comparing ActiveMQ and JBoss MQ and JBoss Messaging..

Guaranteed Messaging - who's guarantee?
The other problem is ensuring that once a message is read by a thread that it executes the task successfully. That guarantee is impossible - VMs crash, exceptions are thrown etc. Ideally if the thread can't handle the task you would want to put it back on the queue for someone else to attempt to handle it. This is hard with JMS although there is a way to "trick" JMS into supporting this that involves using message acknowledgement. That is you do not use the AUTO_ACKNOWLEDGE default - only acknowledging receipt of the message to the broker after the message has been successfully process. However please check how the Broker does this as there is some evidence that doing this just blocks the broker - giving you more guaranteed message handling at a huge cost in scalability.

Another problem with this approach is that you need to handle the possibility of messages being handled more than once. The reason for this is that the thread that originally picked up the message may have processed it successfully but failed JUST before it was going to acknowledge the message. Thus hopefully your message processing is either idempotent or can handle duplicate processing of messages without too much trouble.

High Availability
The other side of the coin for persistence is high availability. Although you want a broker storing data in a file system or to database for persistence, you need some number of other brokers (at least one) up and running and ready to take over. Ideally they are on a different VM, a different blade and perhaps in a different data center. None of which is easy.

Just a side note real quick - one thing I really liked about Amazon's Simple Queue Service (SQS) is how it handles persistence. Unlike JMS which has simple "Send Message" and "Read Message" semantics. SQS has "Send Message" and "Read Message" and "Delete Message". Unlike in JMS, reading a message in SQS does not remove the message from the queue. It "locks" it temporarily and if the message consumer does not call to SQS to "delete" the message after a certain time (30 seconds is the default), SQS itself puts the message back on the queue and assumes the original consumer failed to process it. So why not use SQS? Well frankly all my processing is local and I'd rather not take the Boston -> Virginia network hit (on the send and the receive). But it would make sense if my app was deployed entirely inside an EC2 instance.

Another problem: message ordering
Oh wait here's another Gem for you to wonder. Say your app has some basic CRUD features - Create, Read, Update, Delete. Now you start to put CRUD related activities on the queue. Queues do not implicitly guarantee ordering. So a thread may begin processing an Update event BEFORE the related Create event was finished. Awesome eh!!! So what do you do in this case?

Sometimes you can detect that failure (you can't update something that wasn't created) and just put the message back on the queue or retry a small number of times before failing. Others make the choice to have a single reader for these critical cases to ensure proper ordering.
The downside to THAT of course is
1) Scalability sucks
2) How are you going to handle Disaster recovery / failover.

For #2 I've seen folks configure 1 thread in N separate JVMs across a linux VM farm.
Only one thread is truly active at a time - the others are on "hot standby" using some form of ping / heartbeat functionality to detect if the "one true" thread dies and then some mechanism for become the active thread and telling all the other threads. MAN was that complicated and buggy code.

Oh just one more problem - queue backups!
Even after solving all these problems
- Persistence
- High Availability
- Messages handled more than once
- Messages arrive out of order.

You get the problem of what if your asynchronous task processing slows down and your thread pool becomes maxed out and your queue starts to back up? In one case we have to replicate data from the Europe to Asia. Although it's over a high speed network, the latency is at LEAST 250 ms. And it could be worse. You can't just add threads - you might max out your JVM's memory. I don't live in an EC2 world (yet) where I can just spin up new Linux VMs - and
even if I did the cost might be prohibitive at some point.

I could start to persist some of the queue to somewhere - disk / DB but again I'm going to have to write code to handle all of this plus testing is going to be a bear and the edge cases are killer.
And have a thread read from that storage area at a later point and retry.

Asynchronous - don't unless you have to!
So here's the problem with asynchronous solutions - they are powerful but introduce HUGE complexities in testing, scalability, disaster recovery, failover and high availability.
Even if you get the code right the first time - you better automate all your scenarios to ensure any future "fix" doesn't regress on your edge case handling.

In general my advice is avoid asynchronous solutions unless ABSOLUTELY necessary. And when you do watch for the "tail wagging the dog" - when you design and architect solutions to handle asynchronous processing cases that start to overwhelm your normal design/architecture. Beware introducing unnecessary complexity - but with asynchronous processing much of this complexity is sadly necessary.

If you are reading this and are aware of any good solutions to these asynchronous problems please let me know. Thanks!

References
Some good references I use when designing asynchronous solutions (to make sure I'm not reinventing the wheel)