Communication errors with Amazon SWF - Ruby Flow
We are having an issue with the new Ruby-Flow wrapper for the Amazon SWF.
The issue is that Workflow and Activity workers will (several times an
hour) be unable to correctly communicate with the SWF server. This
manifests in various ways:
workflows or activities fail to register when workers for new versions are
started
workflow or activity workers will crash
activity workers will finish a task and then get an error when reporting
that they are done, so the entire execution fails.
For worker crashes (either kind), we see the following:
andy@Andy-MBP:Crucible $RAILS_ENV=development rake
crucible:swf:ingress_wf_start
rake aborted!
execution expired
/Users/andy/.rvm/gems/ruby-1.9.3-p448@rails3/gems/aws-sdk-1.11.1/lib/aws/core/http/connection_pool.rb:301:in
`start_session'
/Users/andy/.rvm/gems/ruby-1.9.3-p448@rails3/gems/aws-sdk-1.11.1/lib/aws/core/http/connection_pool.rb:125:in
`session_for'
/Users/andy/.rvm/gems/ruby-1.9.3-p448@rails3/gems/aws-sdk-1.11.1/lib/aws/core/http/net_http_handler.rb:52:in
`handle'
/Users/andy/.rvm/gems/ruby-1.9.3-p448@rails3/gems/aws-sdk-1.11.1/lib/aws/core/client.rb:238:in
`block in make_sync_request'
When the failure involves failing to update the server that a task was
finished, the backtrace is pretty similar.
This doesn't seem to be an SWF issue per se (that is, it's not a timeout
on the activity execution); it's a Ruby HTTP communication issue. There
are similar issues on SO for communicating with the Twitter API.
Again, it's not an issue with an SWF timeout expiring; the workflow has a
timeout of a day and each activity has a timeout of an hour. The failures
occur well within that boundary.
Unfortunately, it mostly works, I can usually start workflow executions, I
just get this sort of error frequently enough that that we cannot finish
anything other than trivial jobs. The errors are random enough that
troubleshooting is extremely difficult.
We have reproduced this on different machines and from different networks.
We're still trying out the SWF in development, so none of the failing
workers are located on EC2 instances.
Is there an underlying cause that I should investigate?
Is there a pattern or setting that will allow me to retry these
communications?
No comments:
Post a Comment