From the course: Data Planning, Strategy, and Compliance for AI Initiatives

Identifying data sources

- [Instructor] Let's consider how we go about identifying data sources. Now in this section, we're going to explore various internal, external, and synthetic data sources that we can use to power our AI initiatives. We'll also consider key factors to evaluate potential data sources against our specific business needs. We're also going to consider how to navigate legal and ethical landscapes when sourcing data and understand the common pitfalls and red flags when evaluating potential data sources. Now with regards to internal data sources, we have things like department databases and data warehouses, which are our existing repositories that contain structured information that's already aligned with business processes. And we also have enterprise-scale systems like CRMs and ERPs, which hold rich customer and operational data. And we want to be careful not to overlook employee-generated documents and other content because they contain valuable institutional knowledge and expertise that can be leveraged by AI. Now with regards to external data sources, there are multiple types, including public datasets from government, academic, and open data initiatives, commercial datasets from vendors that offer specialized datasets. But we do want to be careful to evaluate licensing costs and terms when working with commercial providers. Similarly, there are service providers that offer APIs, and the APIs provide real-time access to third party data with standardized formats. Web scraping is really useful for gathering publicly available information. We also want to keep in mind social media and public forums. Now these sources provide insight into customer sentiment and trends and related kinds of information. Now, sometimes we have to turn to synthetic datasets. For example, we use synthetic datasets when we have privacy concerns because synthetic datasets offer solutions when privacy regulations restrict the use of real customer information. We also use augmented datasets. Now this is a hybrid approach that combine actual data with synthetic elements to help address gaps or enhance features. And we can also use AI to create diverse training examples for scenarios that are difficult to collect naturally, and that's especially true for edge cases. Now we want to think in terms of relevance and alignment, that data sources must align with our AI initiative that we're trying to accomplish. We also want to ensure that the data reflects our specific industry realities and application scenarios. And we want to verify the data includes the variables that our models will need to make accurate predictions. We also want to consider data quality factors such as completeness, and we do that by evaluating whether the data has all the necessary fields. We also want to think about accuracy in terms of verifying the information is correct and reasonably free of errors. With regards to consistency, we want to check if the data is formatted uniformly across sources. Data should also be timely. That means we need to assess if the data is current enough for our specific application needs. We also want to determine if the level of detail is sufficient without being unnecessarily complex. There are several technical considerations to keep in mind when identifying data sources. We want to evaluate how difficult it will be to incorporate data into our existing workflows and systems. We also want to ensure that data formats work with our chosen AI platforms to minimize conversion efforts. Also, we want to assess whether we have the infrastructure needed to handle the data volume, the velocity, and the complexity of the data we're considering. We also want to consider how often the data needs refreshing and the resources required to keep it current. We also need to consider legal requirements around our data sources. Particularly, we want to verify that we have the proper rights to use the data for our intended AI applications. We also want to ensure adherence to frameworks like GDPR and California Consumer Privacy Acts, and other privacy laws affecting our operations. We should identify and address any industry-specific regulatory requirements that are unique to our sector. So for example, HIPAA in healthcare, or FCRA for financial services. We also want to be careful when we navigate the complex landscape of international data movement, and this is particularly important for organizations that operate on a global scale. With regards to risk management, we want to establish processes to validate the origin and chain of custody of all of our data sources. We also want to evaluate how well providers protect their data, especially for sensitive information. We also need to understand the track record of the companies that we are dealing with. And we want to consider potential dependencies and plan for alternatives if we need to swap providers. Now, there are some things we want to watch out for, such as gaps or outdated information because that can often indicate deeper quality issues that can undermine the quality of our AI models. Without understanding how data's collected, we can't properly assess its suitability, so we need adequate documentation. We also want to watch out for biased datasets that lead to biased AI outcomes. We want to understand usage restrictions, so we want to be aware of license limits and how we can test and validate our data. We also want to watch for pricing models that might become unsustainable as we move forward. And then finally, we want to consider dependencies on inconsistent data sources and make sure that we can tolerate any potential service disruptions that might occur.

Contents