Any logic analyzer with a software component will let you extract the data. Here's what the 'data dump' looks like from the Saleae software:

You can open up this data in any spreadsheet program, you'll see a long list with Sample column (sample time) and a Channel 0 (data) column. There's only one data column and we extracted only the transitions, so you'll see alternating numbers only.

The first sample is the pre-trigger (-1200000 us before trigger) you can just ignore that.

Afterwards, you see alternating 1's and 0's about 105 'somethings' apart. You might at first think its maybe ms or microseconds, but its not, its actually the sample # based on the sample rate. You need to know that the rate we sampled at here is 12MHz so each sample point is 0.083 us. doing the math, the period between a 0, 1 and back to zero transition is ~210 samples. The period is 210 * 0.083us = 17.5us, which is the same as ~57KHz. So the first burst is 57KHz modulated.

We could go thru the entire 600,000 point CSV file but of course that would be tedious! Let's use python instead.

Jupyter to the Rescue

Our new favorite way to manage data with python is to use Jupyter (also referred to sometimes as a Python notebook) Jupyter is free, and lets you do data analysis with ease, I personally like that data is managed in chunks, so you can read in all the data in one chunk, then do math in other chunks, rather than re-running the whooooole thing over and over.

Here's our notebook, you can load it with any Jupyter install you've got

Block #1

Let's start with the first block, where we read in the dataset:

Here, we open the 'raw tvbgone.csv' file as 'r'eadable text, then read the first two lines and toss them, then read each line, split the CSV into an list, then append the list to one big-ass list called dataset. At the end, we check, did we read the right number?

Yep, last line is 603966 and we tossed the first line (text header) and first datapoint (the -1200000 pre-trigger marker) so 603964 is correct

Block #2

OK this block is where we do all the work. so we'll chunk it up into pieces

In this code, we define our sample rate (12 MHz is a common rate), then loop thru the dataset, iterating through the length by 2's. the hi_p is 'high pulse', the amount of time we are at logic 1. lo_p is 'low pulse', the amount of time we are at logic 0. We also make a hi2_p which is the next pulse that goes high after the low pulse. We will fake this if we're at the end of the dataset, otherwise, we just take the next point.

SAMPLERATE = 12000000       # 12 Mhz default

unusual_codes = []          # These are manchester coded or otherwise non-standard!

frequency_pairs = []
pulse_points = []

# This function eats up two points at a time (but peeks at the third) 
# to calculate the high and low pulse lengths. As a pair, it determines
# the frequency (usually 38KHz - 57KHz) and stores the freq in pulse_points
# until it gets to a long low pulse (e.g. between bits or signals). It then
# checks that the pulses so far are all the same frequency, and compresses 
# them into a triplet of the frequency, the amount of time that freq is emitted
# and the amount of time the signal is 0 into frequency_pairs
for p in range(0, len(dataset), 2):  # take points two at a time
    hi_p = dataset[p]
    lo_p = dataset[p+1]
    if (p+2) == len(dataset):
        # we make a fake final pulse
        hi2_p = [lo_p[0] + 100000, 1]
        hi2_p = dataset[p+2]

Now we do a quick assertion, that the high pulses should be value '1' and the low pulse should be value '0' and bail if somehow that happened.

Then, we take the actual length of time in samples of the high and low pulses, by taking the differences (deltas) between the pulse's timecode and the next one. Once we have the deltas, add them to make one cycle, and divide by the samplerate to convert to seconds, then invert to get the frequency of those two pulses. This is basically the stuff we did by hand at the top of this page, but now its done in code.

        if (hi_p[1] != 1) or (lo_p[1] != 0) or (hi2_p[1] != 1):
        print("Error in matching pulse polarity")
    delta_high = lo_p[0] - hi_p[0]  # length of high pulse
    delta_low = hi2_p[0] - lo_p[0]  # length of low pulse
    pulse_period = (delta_high + delta_low) / SAMPLERATE
    pulse_freq = 1 / pulse_period
    #print("%d, %d -> %0.2f" % (delta_high, delta_low, pulse_freq))

Now we've got a pulse of on/off light. Check at the bottom of this block and we have this section after our special-case checks;

# otherwise, add this pulse point

That is, assuming nothing special, we'll append the frequency reading we just made to a list for later handling. We'll do this 99% of the time, calculating the frequency of a pair of pulses, then appending until....

Now we come back to the special cases at the top of the if statement. If the low pulse is over 30 times longer than the high pulse, we're probably at the end of a pillar of modulated signal. (we picked 30 arbitrarily) Lets check if we have anything stored in pulse_points, if not it means we had a single blip of light, which is super weird (but did happen to us) So we store it in unusual_codes.

Otherwise, lets figure out what happened in this 'pillar' of pulses. We calculate avg_freq which is just the plain 'mean' average. Then we check that all the pulses are within 10% (over 0.9x and under 1.1x the average). If there is any such variation, we store for later and keep going. This did happen a few times, we just dropped these points.

Finally, if all the frequency-pulses in a pillar are within our exactly standards, we simplify them all down to a 3-part list. The list contains the average-frequency, the length that the pulses were active, and then that long-delta_low pulse converted to seconds.

We then loop around and keep going to the next 'pillar'

if 30*delta_high < delta_low: # e.g. the last pulse
        if not pulse_points:
            print("#%d: %d, %d -> %0.2f" % (p, delta_high, delta_low, pulse_freq))
            print("Found an unusual pulse, storing for later")
            unusual_codes.append([p, delta_high, delta_low])
        # Lets get the avg frequency of all the pulse_points (they do have some slight variation)
        avg_freq = sum(pulse_points) / len(pulse_points)
        if not all([ 0.9*avg_freq<i<1.1*avg_freq for i in pulse_points ]):
            print("#%d: %d, %d -> %0.2f" % (p, delta_high, delta_low, pulse_freq))
            print("Found an unusual code, storing for later")
            pulse_points = []
        # we'll just store the frequency, and the length of time on, then the length of time off
        # We add one pulse for the 'final' pair we're on now
        frequency_pairs.append([avg_freq, 1/avg_freq * (len(pulse_points)+1), delta_low / SAMPLERATE])
        pulse_points = []
        continue # go to next pair of pulses
# otherwise, add this pulse point

Block #3

OK so far we've taken all the sub-modulated 1/0's and converted them to frequencies with on/off durations. In theory that's all we need to fully duplicate the TV-B-Gone, but it would be a huge amount of data and hard to manage. What we'll do now is group all the pulses within a chain, usually 10-30 are in a row, for an emitted code, and look like this (its common to have one big burst in the beginning to 'get the attention' of the TV)

In order to know when a code is done, we'll look back at the logic analyzer data. Just from scanning the data it seems like a lot of codes are 'repeated' about 65ms apart

And then there is a 0.25 second delay between code-types:

We want to keep the 'duplicated' codes together (we'll deal with 'compressing' them later) So as a 'Intra Code Delay" we'll pick 0.2 seconds.

In block #3, we kinda do the same thing we did in block #2, but instead of individual light pulses, we'll group together modulated light chunks:

# given the high frequency pairs, group them together by frequency and before a long (10ms?) pulse
all_codes = []
code = []
INTRA_CODE_DELAY = 0.2   # in seconds
for f in frequency_pairs:
    freq, high, low = f
    #print("%0.2f %0.2f @ %0.1f" % (high * 1000, low * 1000, freq))
    if low > INTRA_CODE_DELAY:
        code_freqs = [p[0] for p in code]
        avg_freq = sum(code_freqs) / len(code_freqs)
        if not all([ 0.9*avg_freq<i<1.1*avg_freq for i in code_freqs ]):
            print("Got an abberant frequency, bailing!")
            code = []
        only_pulses = [[p[1], p[2]] for p in code]
        all_codes.append({'freq':avg_freq, 'pulses':only_pulses})
        code = []

print("Decoded: ", len(all_codes))

For each on/off pair, we add it to our list called code. We keep going until the 'off' half of a pair is longer than that 0.2 seconds in which case we'll assume all the pairs till now are grouped together. We take the average modulation frequency of all the pairs and verify all are within 10%. Once we know they're all the same frequency, we don't have to save that part anymore, so only_pulse contains only the on/off timings. We then put those pulses in a dictionary that has the overall modulation frequency and the pulses, save it to all_codes and continue until we've finished processing all the on/off pairs.

According to our script, we've got 207 codes, which means about 207 different brands/models of TVs.

If we ask Python to print out the first code with print(all_codes[0]) we'll get this:

{'freq': 56697.911251837904, 'pulses': [[0.003968534673961258, 0.003993666666666667], [0.0004937998948659543, 0.0020033333333333335], [0.0004937998948659543, 0.00200325], [0.0004938864353312305, 0.00200325], [0.0004938864353312304, 0.0020033333333333335], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312304, 0.00100975], [0.0004937998948659543, 0.00200325], [0.0004938864353312304, 0.00100975], [0.0004937998948659543, 0.0020033333333333335], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312304, 0.0020033333333333335], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312304, 0.00100975], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312304, 0.00100975], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312304, 0.0020033333333333335], [0.0004937998948659543, 0.00200325], [0.0004938864353312305, 0.00100975], [0.0004937998948659543, 0.00200325], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312304, 0.0020033333333333335], [0.0004937998948659543, 0.00100975], [0.0004938864353312305, 0.00200325], [0.0004938864353312304, 0.007964666666666667], [0.003968450847028007, 0.003993666666666667], [0.0004937998948659543, 0.00200325], [0.0004937998948659543, 0.00200325], [0.0004938864353312304, 0.00200325], [0.0004938864353312304, 0.0020033333333333335], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312305, 0.00100975], [0.0004937998948659543, 0.00200325], [0.0004938864353312305, 0.00100975], [0.0004937998948659543, 0.00200325], [0.0004938864353312305, 0.0010096666666666667], [0.0004938864353312304, 0.0020033333333333335], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312304, 0.00100975], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312304, 0.00100975], [0.0004938864353312305, 0.0010096666666666667], [0.0004938864353312304, 0.0020033333333333335], [0.0004937998948659543, 0.00200325], [0.0004938864353312305, 0.0010096666666666667], [0.0004938864353312304, 0.0020033333333333335], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312304, 0.0020033333333333335], [0.0004937998948659543, 0.0010096666666666667], [0.0004938864353312304, 0.0020033333333333335], [0.0004937998948659543, 0.21241525]]}

Thanks to the precision of floating point numbers this is very wordy. Starting at the beginning, the dictionary item has {'freq': 56697.911251837904 which implies that the average frequency of this code is about 56.7KHz. If we look at the logic analyzer, we see that this is correct (each pulse has slight variation)

Zooming out, the first pillar starts at 0ms and ends at about 4ms, then is off for about 4 ms. Then the next pillar of pulses starts at about 7.9ms and ends at 8.3ms (so about 0.4ms long).

That corresponds to the first few entries in our pulses list:

'pulses': [[0.003968534673961258, 0.003993666666666667], [0.0004937998948659543, 0.0020033333333333335]...

Note that all the times in the list are in seconds: Python has double-precision and we're not worried about running out of memory so double's are fine for storage. Anyhow, its always good to check what your parser puts out, compared to the raw data in the logic analyzer!

Let's continue!

Block #4

Now we've got all our codes in a nice dictionary format, with the frequency and on/off pulses stored away. We're going to keep making improvements to the formatting. Why? Well, for one, we want to compress the data a little so we can fit it on a Gemma. As is, the output of all_codes is 384KB

Which will work on a Circuit Playground Express or other Express boards. But we wanted to make it fit in a Gemma M0 for a super-compact project, and that would require the whole source code to take less than about 40KB. So, time for compression!

First up, those double-precision floats take up a lot more space ascii-wise than if we just converted to micro-seconds which will keep each entry at about 2-4 digits rather than the 6+ we have now:

int_codes = []
for code in all_codes:
    # convert to integers and make a dictionary
    int_code = {'freq':int(code['freq'])}
    pulses = []
    for p in range(len(code['pulses'])):
        pulses.append(int(code['pulses'][p][0] * 1000000))   # convert to us
        pulses.append(int(code['pulses'][p][1] * 1000000))   # convert to us
    if len(pulses) % 2 == 0:
        x = pulses.pop()
        int_code['delay'] = x / 1000000  # convert to s
    int_code['pulses'] = pulses

We also make the dictionary object for the codes a little more comprehensive. To start, the frequency is converted to an integer (we really dont need to have more than 3 digits of precision for the frequency, so even this is overkill!) Then we go thru each pulse and multiply by 106. The very last entry, which is the final 'off' pulse, is removed, and renamed 'delay' and re-converted to seconds.

Next, remember we mentioned a lot of codes are repeated? That gives you a better chance of hitting the TV. So, the remainder of this block is dividing the pulses in half, then comparing each on/off timing entry to verify its 'similar'

    # lets see if we can cut it in half and compare both halves
    half = len(int_code['pulses']) // 2
    left_half = int_code['pulses'][0:half]
    repeat_delay = int_code['pulses'][half]
    right_half = int_code['pulses'][half+1:]
    equiv = True
    for i in range(len(left_half)):
        if not similar(left_half[i], right_half[i]):
            equiv = False
    if equiv:
        # many/most codes repeat twice!
        int_code['repeat'] = 2
        int_code['repeat_delay'] = repeat_delay / 1000000 # convert to seconds
        int_code['pulses'] = left_half
        #print("NOT REPEAT!")

The middle 'off' pulse is the repeat delay, usually about 100ms (0.1 seconds). We have a helper function that checks if two values are within 5%, since the timings are slightly variant, we will accept that much variation to consider both 'halves' equivalent

def similar(a, b, percent=0.05):
  return (abs(1.0 - a / b) < percent)

In theory we could check if the codes repeat 3 or 4 times instead of 2, but from scanning thru the data we could tell it was pretty much either once or twice per code.

Outputting all the int_codes, we see they now look like this (the first code)

{'freq': 56697, 'delay': 0.212415, 'pulses': [3968, 3993, 493, 2003, 493, 2003, 493, 2003, 493, 2003, 493, 1009, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 1009, 493, 1009, 493, 1009, 493, 1009, 493, 2003, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493], 'repeat': 2, 'repeat_delay': 0.007964}

Which is way more compact than the previous floating point and non-repeat-optimized version. Our entire text file of codes is now 82 KB compared to the previous 382KB - a very nice compression that 'cost' us nothing.

Block #5

Buuuut....82KB is still too big, we need it to be less than half that. Let's look at more ways to compress the data. Looking at the first code:

{'freq': 56697, 'delay': 0.212415, 'pulses': [3968, 3993, 493, 2003, 493, 2003, 493, 2003, 493, 2003, 493, 1009, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 1009, 493, 1009, 493, 1009, 493, 1009, 493, 2003, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493], 'repeat': 2, 'repeat_delay': 0.007964}

We see some patterns. The numbers 493, 1009, and 2003 show up a lot. In fact, its nearly all of the timing points! That's not too surprising, nearly all Infrared remotes send data that is encoded as 0's and 1's, and they do so with different length pulse pairs. In this code, there's 3 distinct pulse pairs:

  1. 3968, 3993 - This is the initial 'attention' pulse, about 4000us on and 4000us off
  2. 493, 2003 - about 500us on, 2000us off, will be decoded as a zero or one
  3. 493, 1009 - about 500us on, 1000us off, will be decoded as the opposite as the above pulse

Instead of just repeating those full values over and over, lets 'compress' the pairs by just having a single digit number for each pair. That's what we'll do in the next block:

paired_codes = []

for c in int_codes:
    pair_table = []
    pair_lookup = []
    for p in range(0, len(c['pulses']), 2):
        pair = (c['pulses'][p:p+2])
        if len(pair) == 1:               # for the last entry, which is solitary
            for pairs in pair_table:     # match it up with the first pair we find
                if pair[0] == pairs[0]:  # where the first pulse matches
                    pair.append(pairs[1])# (put in a false 'off' pulse)
        if not pair in pair_table:
    p_code = {'freq': c['freq'], 'delay': c['delay']}
        p_code['repeat'] = c['repeat']
        p_code['repeat_delay'] = c['repeat_delay']
    except KeyError:
    p_code['table'] = pair_table
    p_code['index'] = pair_lookup

After complete, you'll see comparisons of the pre-tableified and post codes like so:

{'freq': 56697, 'delay': 0.212415, 'pulses': [3968, 3993, 493, 2003, 493, 2003, 493, 2003, 493, 2003, 493, 1009, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 1009, 493, 1009, 493, 1009, 493, 1009, 493, 2003, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493, 1009, 493, 2003, 493], 'repeat': 2, 'repeat_delay': 0.007964}

{'freq': 56697, 'delay': 0.212415, 'repeat': 2, 'repeat_delay': 0.007964, 'table': [[3968, 3993], [493, 2003], [493, 1009]], 'index': [0, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1]}

As you can see, there's a new dictionary entry called 'table' with 3 entries: [[3968, 3993], [493, 2003], [493, 1009]] and then an index list, starting with a 0, then lots of 1's and 2's, those are the indicies into the pulse pair table.

Block #6

Now we're down to about 45KB - which is pretty good. We could try to convert all the codes into pure binary format instead of having indices, but considering the wide range of encoding schemes, and that we've essentially reached our target codesize, we can stop.

We can squeeze just a tiny bit more space out by removing spaces and reducing the floating point precision. That's what the final block does, it rounds out the floating points and takes out all the whitespace, then writes the codes out to a text file that we can load into our CircuitPython Board

# Compactify and print!

with open("codes.txt", "w") as f:
    for code in paired_codes:
        code['delay'] = round(code['delay'],2)  # keep only 2 digits of precision for the long delay
            code['repeat_delay'] = round(code['repeat_delay'],3) # only 1ms precision for shot delay
        except KeyError:
        s = str(code).replace(' ', '')  # remove whitespace!

And here's the final output

This guide was first published on Mar 18, 2018. It was last updated on Jul 17, 2024.

This page (Parsing Data) was last updated on Mar 08, 2024.

Text editor powered by tinymce.