Heap corruption at silc client disconnect time (memory reuse after 'free')
Skywing
Skywing at valhallalegends.com
Thu Mar 27 16:34:07 CET 2008
Hi,
I've been having a heap corruption problem in the silc client (leads to random crashes in malloc_consolidate, typically). This has been a longstanding bug since the last several silc versions at least, and also seems to affect the silc irssi plugin in 1.1.4.
After trying unsuccessfully to get various debugging tools (valgrind/efence) to operate on amd64, I finally broke down and installed a copy of silc on a 32-bit system to run under valgrind continuously to see if it would trap anything. It did, after a server disconnect, as follows:
==1351== Invalid read of size 1
==1351== at 0x80E7824: silc_client_unref_client (silcatomic.h:793)
==1351== by 0x80F1675: silc_client_notify_signoff (client_notify.c:550)
==1351== by 0x8121293: silc_fsm_start_sync (silcfsm.c:429)
==1351== by 0x80E6185: silc_client_packet_receive (client.c:120)
==1351== by 0x810DA88: silc_packet_dispatch (silcpacket.c:2006)
==1351== by 0x810BB18: silc_packet_read_process (silcpacket.c:2317)
==1351== by 0x810E019: silc_packet_stream_io (silcpacket.c:438)
==1351== by 0x815D4A5: silc_socket_stream_io (silcunixsocketstream.c:40)
==1351== by 0x811AC4A: silc_schedule_dispatch_fd (silcschedule.c:61)
==1351== by 0x811AEE1: silc_schedule_iterate (silcschedule.c:455)
==1351== by 0x811A9B4: silc_schedule_one (silcschedule.c:484)
==1351== by 0x80E5E72: silc_client_run_one (client.c:1093)
==1351== Address 0x1BDAD6DC is 1020 bytes inside a block of size 1024 free'd
==1351== at 0x1B904B04: free (vg_replace_malloc.c:152)
==1351== by 0x80E95A8: silc_client_del_client (client_entry.c:1000)
==1351== by 0x80F1662: silc_client_notify_signoff (client_notify.c:549)
==1351== by 0x8121293: silc_fsm_start_sync (silcfsm.c:429)
==1351== by 0x80E6185: silc_client_packet_receive (client.c:120)
==1351== by 0x810DA88: silc_packet_dispatch (silcpacket.c:2006)
==1351== by 0x810BB18: silc_packet_read_process (silcpacket.c:2317)
==1351== by 0x810E019: silc_packet_stream_io (silcpacket.c:438)
==1351== by 0x815D4A5: silc_socket_stream_io (silcunixsocketstream.c:40)
==1351== by 0x811AC4A: silc_schedule_dispatch_fd (silcschedule.c:61)
==1351== by 0x811AEE1: silc_schedule_iterate (silcschedule.c:455)
==1351== by 0x811A9B4: silc_schedule_one (silcschedule.c:484)
==1351==
==1351== Invalid read of size 4
==1351== at 0x812AD84: silc_cipher_get_name (silccipher.c:378)
==1351== by 0x80E879A: silc_client_del_channel (client_entry.c:1725)
==1351== by 0x80E5340: silc_client_del_connection (client.c:706)
==1351== by 0x80E602D: silc_client_connection_destructor (client.c:40)
==1351== by 0x8120029: silc_fsm_finish_fsm (silcfsm.c:493)
==1351== by 0x811AD9A: silc_schedule_dispatch_timeout (silcschedule.c:114)
==1351== by 0x811AEBD: silc_schedule_iterate (silcschedule.c:460)
==1351== by 0x811A9B4: silc_schedule_one (silcschedule.c:484)
==1351== by 0x80E5E72: silc_client_run_one (client.c:1093)
==1351== by 0x80AE4C3: my_silc_scheduler_fd (silc-core.c:91)
==1351== by 0x1B976DBE: (within /usr/lib/libglib-2.0.so.0.600.4)
==1351== by 0x1B951581: (within /usr/lib/libglib-2.0.so.0.600.4)
==1351== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1351==
==1351== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==1351== Access not within mapped region at address 0x0
==1351== at 0x812AD84: silc_cipher_get_name (silccipher.c:378)
==1351== by 0x80E879A: silc_client_del_channel (client_entry.c:1725)
==1351== by 0x80E5340: silc_client_del_connection (client.c:706)
==1351== by 0x80E602D: silc_client_connection_destructor (client.c:40)
==1351== by 0x8120029: silc_fsm_finish_fsm (silcfsm.c:493)
==1351== by 0x811AD9A: silc_schedule_dispatch_timeout (silcschedule.c:114)
==1351== by 0x811AEBD: silc_schedule_iterate (silcschedule.c:460)
==1351== by 0x811A9B4: silc_schedule_one (silcschedule.c:484)
==1351== by 0x80E5E72: silc_client_run_one (client.c:1093)
==1351== by 0x80AE4C3: my_silc_scheduler_fd (silc-core.c:91)
==1351== by 0x1B976DBE: (within /usr/lib/libglib-2.0.so.0.600.4)
==1351== by 0x1B951581: (within /usr/lib/libglib-2.0.so.0.600.4)
==1351==
Seems to me like we're deleting a silc client object while there's still outstanding references to it.
This results in a strange null deref in silc_cipher_get_name that I had been previously patching in my local builds with a check for null, even though it didn't seem like the right "real" solution to the problem for some time (though not in the build I did on the x86 box):
Core was generated by `'.
Program terminated with signal 11, Segmentation fault.
#0 0x0812ad84 in silc_cipher_get_name (cipher=0x0) at silccipher.c:378
378 return (const char *)cipher->cipher->name;
(gdb) bt
#0 0x0812ad84 in silc_cipher_get_name (cipher=0x0) at silccipher.c:378
#1 0x080ed8bd in silc_client_del_channel_private_keys (client=0x1bb56b98, conn=0x1bc24908, channel=0x1bcb4470) at client_channel.c:629
#2 0x080e879b in silc_client_del_channel (client=0x1bb56b98, conn=0x1bc24908, channel=0x1bcb4470) at client_entry.c:1725
#3 0x080e5341 in silc_client_del_connection (client=0x1bb56b98, conn=0x1bc24908) at client.c:706
#4 0x080e602e in silc_client_connection_destructor (fsm=0x1bb98aec, fsm_context=0x0, destructor_context=0x1bb83a28) at client.c:40
#5 0x0812002a in silc_fsm_finish_fsm (schedule=0x1bb79858, app_context=0x1bb56b98, type=SILC_TASK_EXPIRE, fd=0, context=0x1bb98aec) at silcfsm.c:493
#6 0x0811ad9b in silc_schedule_dispatch_timeout (schedule=0x1bb79858, dispatch_all=0 '\0') at silcschedule.c:114
#7 0x0811aebe in silc_schedule_iterate (schedule=0x1bb79858, timeout_usecs=0) at silcschedule.c:460
#8 0x0811a9b5 in silc_schedule_one (schedule=0x1bb79858, timeout_usecs=0) at silcschedule.c:484
#9 0x080e5e73 in silc_client_run_one (client=0x1bb56b98) at client.c:1093
#10 0x080ae4c4 in my_silc_scheduler_fd (source=0x1bb3b200, condition=G_IO_IN, data=0x7) at silc-core.c:91
#11 0x1b976dbf in g_vasprintf () from /usr/lib/libglib-2.0.so.0
#12 0x1b951582 in g_main_depth () from /usr/lib/libglib-2.0.so.0
#13 0x1b9525f8 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#14 0x1b952930 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#15 0x1b952b7d in g_main_context_iteration () from /usr/lib/libglib-2.0.so.0
#16 0x080843d3 in main (argc=134931616, argv=0x1bb78740) at silc.c:376
(gdb) x/1i $pc
0x812ad84 <silc_cipher_get_name+4>: mov (%ecx),%edx
(gdb) inf reg
eax 0x0 0
ecx 0x0 0
edx 0x1bc15498 465654936
ebx 0x0 0
esp 0x52bfe05c 0x52bfe05c
ebp 0x52bfe078 0x52bfe078
esi 0x1bcb4470 466306160
edi 0x1bc24908 465717512
eip 0x812ad84 0x812ad84 <silc_cipher_get_name+4>
eflags 0x246 [ PF ZF IF ]
cs 0x0 0
ss 0x0 0
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x33 51
(gdb)
It would now seem to me that the likely underlying cause of this might that we're using a deleted silc client object, which is in a torn-down state (being freed to free() ) and thus not in any way suitable for use.
(In this case, it would look like the "channel" struct is still mostly valid, but the silc client object would seem to have been already used after it's been deleted at this point from the prior valgrind complaints: )
(gdb) frame 1
#1 0x080ed8bd in silc_client_del_channel_private_keys (client=0x1bb56b98, conn=0x1bc24908, channel=0x1bcb4470) at client_channel.c:629
629 channel->cipher = silc_cipher_get_name(channel->internal.send_key);
(gdb) list
624 silc_hmac_free(entry->hmac);
625 silc_free(entry);
626 }
627
628 channel->internal.curr_key = NULL;
629 channel->cipher = silc_cipher_get_name(channel->internal.send_key);
630 channel->hmac = silc_hmac_get_name(channel->internal.hmac);
631
632 silc_dlist_uninit(channel->internal.private_keys);
633 channel->internal.private_keys = NULL;
(gdb) print channel;
Invalid character ';' in expression.
(gdb) print *channel
$1 = {channel_name = 0x1bccc6a8 "<channel name>", server = '\0' <repeats 256 times>, topic = 0x0, founder_key = 0x0, channel_pubkeys = 0x0, id = {ip = {data = "ÐnF\214", '\0' <repeats 11 times>,
data_len = 4 '\004'}, port = 49666, rnd = 43822}, mode = 519, user_limit = 0, user_list = 0x1bbe5f28, cipher = 0x1bb56d98 "aes-256-cbc", hmac = 0x1bb56d98 "aes-256-cbc", context = 0x0, internal = {
lock = 0x1bbd5870, old_channel_keys = 0x0, old_hmacs = 0x0, private_keys = 0x1bc15498, curr_key = 0x0, send_key = 0x0, receive_key = 0x0, hmac = 0x0, iv = '\0' <repeats 15 times>, resolve_cmd_ident = 0,
refcnt = {value = 0}}}
(gdb)
(Channel name intentionally changed to "<channel name>" for the mailinglist posting, though it was a valid string for a channel that client was joined to.)
Any ideas on this, before I try and take a crack at it myself? (Unfortunately, not generally familiar with the silc code base myself. I do have a core file + binary describing this crash (this time, using the 1.1.4 standalone silc client, not the irssi plugin, though I have observed similar failures with the irssi plugin as well and have no reason to believe it would be specific to the standalone client).
It is fairly easily repro'able by having a silc client run under valgrind and having the server go away out from under it (i.e. network failure) - sometimes, silc will trip valgrind by touching free'd memory, and sometimes more, it will crash out with this null deref. In this particular instance, the channel in question had a peer to peer key (e.g. /key channel set peer-to-peer-key-string), but I am not certain that this matters without looking at the code more.)
I've observed silc regularly segfaulting due to the heap corruption that seems to result from a disconnect+reconnect, so getting this fixed would definitely be a Good Thing for me. Not to mention that memory reuse after free bugs have historically been probable candidates for exploits, depending on the circumstance (haven't looked into this in depth enough to know whether that is a specific concern here, but it's certainly not a good thing).
N.B. Observed this problem on several boxes, so I'm quite sure it's not a random hardware problem (bad ram or the like).
- S
More information about the silc-devel
mailing list