Debugging a race condition in HTTP and TCP
About
I developed a terraform provider for the Anki Connect plugin so that I can author and manage my flash cards programatically using Terraform.
Hint: The Anki Connect is an Anki add-on that deploys a web API to interact with the data in Anki.
After a certain number of cards, the provider began experiencing intermittent connection errors:
╷
│ Error: Error reading note deck
│
│ with anki_note.what_is_the_inversion_after_63,
│ on chords.tf line 14, in resource "anki_note" "what_is_the_inversion_after_63":
│ 14: resource "anki_note" "what_is_the_inversion_after_63" {
│
│ get decks: failed to execute request: Post "http://localhost:8765": read tcp 127.0.0.1:49802->127.0.0.1:8765: read: connection reset by peer
│Critically thinking about logs
The Anki Connect plugin is the server, and it listens on port 8765.
The provider is the client, so when we read connection reset by peer, we
should consider that Anki Connect plugin is resetting the connection.
We should also consider the possibility of an intermediary (a system between the client and the server) is resetting the connection.
Tracing connections in the server
The Anki Connect plugin implements its own web server so I was able to trace
the close() method calls with some logs.
I asked Claude Code to implement the tracing and ran the server while I
executed terraform plan a few times.
I searched for the connection reset event in the traces and found that the connection was successfully closed.
[anki-connect] closing client peer=('127.0.0.1', 49802) readBuff=0 writeBuff=0 reason=response fully sentThis leaves the kernel as remaining culprit for resetting the connection!
Kernel culprit
Since I had a running conversation with Claude, I pasted this message into the chat.
[Provider error] is what i see on terraform's side, but anki connect says it closed it gracefully in the logs. [Anki connect traces]
I thought that I would get a fluff answer from as it tries to reason this contradiction, but I was surprised to learn it had such good knowledge of race conditions caused by HTTP!
Although I really appreciate the model's message, I understand it's very contextual to me so I'll avoid pasting it verbatim.
I'm sure we're all numb of reading responses to another person's prompt.
I'll try to paraphrase as best as I can.
Explanation
The client's requests are sent with a Connection: keep-alive header, which
hints to the server that the client would like to reuse the underlying TCP
connection on future requests.
The server closes the sockets manually but http response headers sent by the
server do not include a Connection: close! Therefore client can
potentially reuse a TCP connection that are set to close!
Normally a client would avoid sending data to a closed socket, but a race
condition can cause the kernel to receive the FIN signal to close the socket
after the client decided to re-use the socket!
When the data arrives to the closed socket (in FIN_WAIT_2 state), the kernel
answers a RST on behalf of the server and that is what we see in the logs.
Conclusion
In conclusion, the peer in the provider log is the kernel and not the client.
The fix was simply to include a Connection: close in the response headers:
return [
['HTTP/1.1 200 OK', None],
['Content-Type', 'application/json'],
['Access-Control-Allow-Origin', corsOrigin],
['Access-Control-Allow-Headers', '*'],
- ['Content-Length', str(len(body))]
+ ['Content-Length', str(len(body))],
+ ['Connection', 'close'],
]
I wonder if changing the first header to communicate HTTP/1.0 protocol would
also help?
🤔
The details escape me and it is a bit past my self-imposed but loosely respected bedtime!
Growth is for those who sleep, which I deeply regret losing to video games in my teenage years.