AccuWeather has an API for programmers, but it’s a complex affair requiring a signed contract and is not used here. Instead, our code downloads a forecast web page the same way your browser does…but instead of displaying it, we just pick through the raw HTML source for any kernels of interest. “Web scraping,” it’s called.

The good news is that even a small Arduino-compatible board like the Feather HUZZAH ESP8266 can handle this task.

The bad news is that it requires some familiarity with HTML — the “source code” of web pages — and that the AccuWeather site is particularly complex (a typical forecast page is over 120 kilobytes).

Start by loading up a page of interest in your web browser…like let’s try the same migraine forecast…

Right off the bat, I identified a problem…a bug in AccuWeather’s migraine forecast…the current conditions always indicate migraine weather! Any time, any location. I’ve reported the bug, but it has yet to be fixed. This probably doesn’t apply to other predictions, but it’s something to check for.

Likewise, nighttime and “early AM” predictions always show migraine weather. Daytime predictions are fine, but night presents a false positive.

Fortunately, there’s still enough usable information on the page that we can work around this! Software to the rescue…

We know that the first “Migraine Headache” string is right out, and the second is out if it follows the “Tonight” forecast. So the trick here is to look specifically for the word “Today” (in which case the forecast immediately following is valid), or “Tomorrow” (also valid). Anything else, like “Tonight” or “Early AM” presented problems and should be ignored.

Thing is, words like “Migraine” and “Today” appear all over the page. We need to narrow in on a specific instance by looking at the HTML source and snagging some of the surrounding formatting tags (normally invisible on the page) along with the text. Each section of the page usually has some unique formatting going on.

Use your browser’s “view source” option. If it doesn’t offer one or you can’t find it, you can also save the page source to a file and examine it in a text editor.

There’s 56 occurrences of the word “migraine” on this page, so one really has to dig through the page to find one that correctly indentifies something in the forecast, and not just a random label or banner ad.

Some browsers have a more advanced “inspector” mode that’ll take you right to the part of the source corresponding to an element on the page.

Once some reliable “beacon” text is identified, it’s copied and pasted verbatim into the Arduino sketch. This must include exact spaces, HTML tags, everything.

In our case, to get past these false positives, we want to look for two different strings…if we find either of those, they then serve as a starting point for locating migraine symptoms on the page. As explained above, “Today” provides a viable forecast, as does “Tomorrow”. Near the top of the code, there’s this list of search strings:

} matchList0[] = {
  { "<h3>Today</h3>"    },
  { "<h3>Tomorrow</h3>" },
  { NULL                }  // END OF LIST, don't remove this
}; // Can create add'l string match lists here if needed

A couple of things to notice here:

  • There’s some HTML in each of the strings, to uniquely tell them apart from any “Today” that’s on the page. It’s exactly as it appears in the HTML source.
  • Each string is inside a set of { brackets }, because these are actually C struct elements. Formatting is super persnickety here.
  • The list of strings is assigned a unique name: matchList0[]. More complex programs might have multiple lists (e.g. matchList1[], etc.), but this example just needs one.

Later in the code…in the loop() function…you’ll see this line:

    if((multiFind(matchList0) >= 0) &&

This searches the page for any of the items in matchList0, and returns the index of the first string found (e.g. if "<h3>Today</h3>" is found, a value of 0 is returned), or -1 if no match was found.

Usually, when reading the migraine forecast, you will find either a Today string (during the day) or a Tomorrow string (any time). Those get us past the false positives, and we can search from that point forward for a valid migraine “beacon” in the HTML source, which is what the next line in the code does…

       client.find("Migraine Headache <span>Weather")) {

If both conditions are true, then migraine conditions were detected in the forecast. The variables “hi” and “lo” are set to the LED blink on and off times, in milliseconds…slow and steady if a match was found, or a tiny periodic “blip” to indicate there’s nothing in the forecast, but the microcontroller is still active and hasn’t locked up.

The real “trick” to web scraping like this is that you only get ONE PASS forward through the data. You CAN NOT REWIND and try a new search again…one must plan a combination of find() and/or multiFind() calls and if/then/else conditions that will produce an acceptable result.

Last updated on 2016-03-17 at 03.02.55 PM Published on 2016-03-20 at 09.47.39 PM