As a data scientist, one of my most important tasks is to sample data from vendors and create proofs of concept for new applications. These must meet a certain threshold of accuracy before we can move forward. I need to understand the historical time series of the dataset and how frequently it’s updated. Whether the process is long or short depends on what kind of hypothesis or solution we’re trying to build out, as with our national map of broadband Internet connectivity. We needed alternative data that wasn’t created for our intended use but worked well. The first proof of concept took a few weeks to complete but developing a solution clients could use was a much longer process.
Our new Data Platform can make this kind of work easier for our customers. Previously, they had to complete several steps before they could get a data dictionary to even understand what data was available on our platform. Now they can go straight to the developer portal, look at the attributes for all of it, and decide whether they want to move forward. If they do, the process is more accessible than before. In the age of the tech platform, it makes sense for us to offer customers and prospects the ability to vet what we offer quickly.
Customers using the platform can obtain a small dataset to learn more about it and how they might use it. They can also use an API for diving deeper or simply get a complete data dump that will allow them to get to work. The API is beneficial when it comes time to convert the model from development to production because at that point users don’t need to maintain the data on their own systems—LightBox will keep it up to date.
We’ve done a lot of research to solve the broadband accessibility issue on a national scale. We sourced external data such as Wi-Fi access points, but they didn’t solve the problem on a standalone basis—we also used our own data. Wi-Fi access-point data is geospatial—it’s a stream of latitude-longitude points. We have a lot of boundary data, mainly parcel boundaries and building footprint boundaries. We had to bring that geospatial data from the external vendor into our geospatial polygon data sets to link them. Other third-party datasets were attached to help with the analysis.
Occasionally there are inconsistencies with external data, so we have to convert it. Geospatial data can be pretty easily converted, but there are challenges. GPS data has an accuracy number associated with it—on average maybe plus or minus 15 meters. So if a user is linking a data set with that, they can’t be 100 percent certain that it will intersect our boundaries. Users can manage such discrepancies with heuristics or other methodologies.
While the accuracy level of our parcel boundaries and footprints is high, transactions regularly change, create, or destroy parcels. If you intersect a tax assessor record from the past on a current lot, the property characteristic information may not match up. Time series is critical. Some state and local governments publish updates quickly, while others don’t, affecting models. Ultimately, any project like this aims to streamline and centralize data so that there is a single source of truth to drive the most accurate analysis possible.
Learn more about what Zach and the data scientist team are working on by visiting the LightBox Labs page.