Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible HCI.Poll() bug with _recvIndex and _recvBuffer #73

Open
mattleesmi opened this issue May 3, 2020 · 8 comments
Open

Possible HCI.Poll() bug with _recvIndex and _recvBuffer #73

mattleesmi opened this issue May 3, 2020 · 8 comments
Labels
status: waiting for information More information must be provided before work can proceed type: imperfection Perceived defect in any part of project

Comments

@mattleesmi
Copy link

Hello,

I came across an occasional crash while using BLE/HCI.poll(). The crash occurs when _recvIndex goes above the size of _recvBuffer. The crash in my tests occurs when _recvIndex = 533.

I explained how I came across it in this Arduino forum post: https://forum.arduino.cc/index.php?topic=680797.0 but the long and the short of it is that it seems to happen after receiving an HCI_ACLDATA_PKT (possibly without being connected to anything) which leads to _recvIndex++ over and over without being caught and reset by the if statements and reset (not sure why). The basic fix I currently have running in HCI.cpp within the while (HCITransport.available()) loop is:

if (_recvIndex > 257) { _recvIndex = 0; if (_debug) { _debug->println(); _debug->println("***Overflow Catch***"); _debug->println(); } }

Not groundbreaking but seems to be stopping the crash.

The "overflow catch" is still occurring in my tests, however it may be something else. Please let me know if more information would help or if I am missing something obvious.

Thanks,
Matt

@polldo
Copy link
Contributor

polldo commented Jul 29, 2020

Hi @mattleesmi ,
Are you using a nano 33 ble?
This issue is similar to this #102
If too much time passes between calls to BLE.poll() function, it can happen that some bytes are dropped. This can lead to unpredictable results: for example, as in your case, the length of a wrong packet (with byte drop) could be larger than the _recvBuffer size and this would lead to the overflow you have experienced.

@polldo polldo added the status: waiting for information More information must be provided before work can proceed label Jul 29, 2020
@mattleesmi
Copy link
Author

Hello @polldo

Yes, I am using a BLE and IoT, and it does seem similar to #102. I have stepped away from the work using the boards for a few months now but I hope to return to it soon so I can look this further as well as the #70 issue I have been encountering.

I will report back when I can!

Thank you for the input.
Matt

@mattleesmi
Copy link
Author

Hello,

Sorry for the very late update, I was running more tests on this and I am still getting crashes with the latest version with the same issues where _recvIndex is going above 258 and crashing at 533.

I am testing a catch if statement under the while loop like this

    while (HCITransport.available()) {
        byte b = HCITransport.read();
        _debug->print("_recvIndex: "), _debug->println(_recvIndex);
        if(_recvIndex == 258) _recvIndex = 0;

It seems that somehow the else statement of void HCIClass::poll is being skipped somehow.

Thanks,
Matt

@polldo
Copy link
Contributor

polldo commented Jun 23, 2021

Hi @mattleesmi
do you have this problem on the nano 33 ble only?
If so, please check if this patch solves your issue #96

@mattleesmi
Copy link
Author

Hey @polldo

So far I have only had this problem on the IoT's, as far as I can see my little tweak is working at the moment. Is that patch of IoT's too? If so how do I integrate it?

Thanks
Matt

@polldo
Copy link
Contributor

polldo commented Jun 23, 2021

No, it's a nano33ble specific patch.
Anyway, I took a look at your sketch. My suspicion is that too much time is spent between BLE updates, so the controller's queue becomes full and some packets are dropped.

The BLE should be continuously updated, by executing BLE.poll(). Try to execute this poll statement at each loop, so for example change your code into:

void loop() {
      BLE.poll();
      ....
}

@mattleesmi
Copy link
Author

mattleesmi commented Jun 23, 2021

With my current code setup it is running about 140 polls per second (have been tracking it), is that too little?

edit: I took a look again as I had switch it off for a bit, it still most gets around 140 or more although sometimes it dips to quite low numbers

@mattleesmi
Copy link
Author

mattleesmi commented Nov 4, 2021

Small update:

I was trying things out again and I found myself in a very noisy Bluetooth room that was causing very frequent crashes even with quite a lot of polls per second.

I did a bit more digging in HCI.cpp and I found that one of the main causes what the poll was picking up ACL packets that were too big (this may be linked to the other stuff that has been mentioned with regards to the buffer not being read enough)

So from what I have managed to understand is that this piece of code:
if ((_recvIndex > 5) && _recvIndex >= (5 + (_recvBuffer[3] + (_recvBuffer[4] << 8))))
is being used to determine if there is enough of an ACL packet to read and what the size of that packet is. However when I started to read the size of the packets some of them were way too big for the array. What was happening was that (5 + (_recvBuffer[3] + (_recvBuffer[4] << 8)))) was really big, but _recvBuffer[0] was still equal to HCI_ACLDATA_PKT so the index just goes up and up until it crashes.

So I put a separate if statement that looks like this:

if (_recvIndex > 5 && ((5 + (_recvBuffer[3] + (_recvBuffer[4] << 8))) > (sizeof(_recvBuffer) - 1))) {
                _recvIndex = 0;
                if (_debug) {
                    _debug->print("ACL too big: "), _debug->println((5 + (_recvBuffer[3] + (_recvBuffer[4] << 8))));
                }
            }

With this, I would get results like: ACL too big: 21920. As you can see I tried to catch this and reset the index. This seems to work but there are still crashes. Anecdotally these crashes seem to happen when the code tries to handle the packet but I am not sure.

A similar thing did occur with RX events but it was less frequent, although sometimes I did get large RX events like this:

HCI EVENT RX <- 043EFA827339E132A4043EB1964DD52DA2043EB1964DD59E043E2EB1964DD52DA3043E4DD52D2728A3043E2A02010001643E983370D71E0201061AFF4C000215F0910DA64FA24E988024BC5B71E0893E0002017EB8B3043E2202010401643E983370D71607094C4453313334020AF80A160DD03275466D343151B3043E28020102019DBE2EF627551C03036FFD17166FFD75046B93FD537784D430D0F1B2FA827339E13246A2043E0C020104019DBE2EF6275500A0043E2D2728DDA1043E2A02010001643E983370D71E0201061AFF4C000215F0910DA64FA24E988024BC5B71E0893E0002017EB8B3043E2202010401643E983370D71607094C445331 Size:253

Now I might be wrong but that feels like several events mashed up into one as most of the time they are between 45-150 characters and there is a hard limit of 246 (or thereabouts) in the reading capacity of the BLE #70

Is it possible that the if statements at the moment are not flexible enough to deal with data that is coming in "wrong"?

I have uploaded the full HCI.poll that I have been playing with but I am sure that the issue is deeper than this.

If this is of interest I can try to use the extra catches and see what it looks like when it still crashes.

HCI_Poll_Test.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting for information More information must be provided before work can proceed type: imperfection Perceived defect in any part of project
Projects
None yet
Development

No branches or pull requests

3 participants