AI Takes Over GitHub Analysis: The Future of Data Science is Here

GitHub is a popular platform for hosting and collaborating on software projects, and it generates a vast amount of data about the activities of its users and repositories. ChatGPT, the AI chatbot that has taken the internet by storm, is designed to respond to natural language dialogue and answer complex queries.

In this post, we will use ChatGPT to explore some of the insights that can be gleaned from three of the main data tables on GitHub:

github_users

( `id` int(11) NOT NULL, `login` varchar(255) NOT NULL, `type` char(3) NOT NULL DEFAULT 'N/A', `is_bot` tinyint(1) NOT NULL DEFAULT '0', `NAME` varchar(255) NOT NULL DEFAULT '', `email` varchar(255) NOT NULL DEFAULT '', `organization` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT '', `organization_formatted` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT '', `address` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT '', `country_code` char(3) NOT NULL DEFAULT 'N/A', `region_code` char(3) NOT NULL DEFAULT 'N/A', `state` varchar(255) NOT NULL DEFAULT '', `city` varchar(255) NOT NULL DEFAULT '', `longitude` decimal(11,8) NOT NULL DEFAULT '0', `latitude` decimal(10,8) NOT NULL DEFAULT '0', `public_repos` int(11) NOT NULL DEFAULT '0', `followers` int(11) NOT NULL DEFAULT '0', `followings` int(11) NOT NULL DEFAULT '0', `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', `updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', `is_deleted` tinyint(1) NOT NULL DEFAULT '0', `refreshed_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`id`)
/*T![clustered_index] CLUSTERED */
, KEY `index_gu_on_login_is_bot_organization_country_code` (`login`,`is_bot`,`organization_formatted`,`country_code`), KEY `index_gu_on_address` (`address`), KEY `index_gu_on_organization` (`organization`) ) engine=innodb DEFAULT charset=utf8mb4 COLLATE=utf8mb4_bin

show less

github_events

( `id` bigint(20) NOT NULL DEFAULT '0', `type` varchar(29) NOT NULL DEFAULT 'Event', `created_at` datetime NOT NULL DEFAULT '1970-01-01 00:00:00', `repo_id` bigint(20) NOT NULL DEFAULT '0', `repo_name` varchar(140) NOT NULL DEFAULT '', `actor_id` bigint(20) NOT NULL DEFAULT '0', `actor_login` varchar(40) NOT NULL DEFAULT '', `language` varchar(26) NOT NULL DEFAULT '', `additions` bigint(20) NOT NULL DEFAULT '0', `deletions` bigint(20) NOT NULL DEFAULT '0', `action` varchar(11) NOT NULL DEFAULT '', `number` int(11) NOT NULL DEFAULT '0', `commit_id` varchar(40) NOT NULL DEFAULT '', `comment_id` bigint(20) NOT NULL DEFAULT '0', `org_login` varchar(40) NOT NULL DEFAULT '', `org_id` bigint(20) NOT NULL DEFAULT '0', `state` varchar(6) NOT NULL DEFAULT '', `closed_at` datetime NOT NULL DEFAULT '1970-01-01 00:00:00', `comments` int(11) NOT NULL DEFAULT '0', `pr_merged_at` datetime NOT NULL DEFAULT '1970-01-01 00:00:00', `pr_merged` tinyint(1) NOT NULL DEFAULT '0', `pr_changed_files` int(11) NOT NULL DEFAULT '0', `pr_review_comments` int(11) NOT NULL DEFAULT '0', `pr_or_issue_id` bigint(20) NOT NULL DEFAULT '0', `event_day` date NOT NULL, `event_month` date NOT NULL, `event_year` int(11) NOT NULL, `push_size` int(11) NOT NULL DEFAULT '0', `push_distinct_size` int(11) NOT NULL DEFAULT '0', `creator_user_login` varchar(40) NOT NULL DEFAULT '', `creator_user_id` bigint(20) NOT NULL DEFAULT '0', `pr_or_issue_created_at` datetime NOT NULL DEFAULT '1970-01-01 00:00:00', KEY `index_github_events_on_id` (`id`) /*!80000 INVISIBLE */, KEY `index_github_events_on_actor_login` (`actor_login`), KEY `index_github_events_on_created_at` (`created_at`), KEY `index_github_events_on_repo_name` (`repo_name`), KEY `index_github_events_on_repo_id_type_action_month_actor_login` (`repo_id`,`type`,`action`,`event_month`,`actor_login`), KEY `index_ge_on_repo_id_type_action_pr_merged_created_at_add_del` (`repo_id`,`type`,`action`,`pr_merged`,`created_at`,`additions`,`deletions`), KEY `index_ge_on_creator_id_type_action_merged_created_at_add_del` (`creator_user_id`,`type`,`action`,`pr_merged`,`created_at`,`additions`,`deletions`), KEY `index_ge_on_actor_id_type_action_created_at_repo_id_commits` (`actor_id`,`type`,`action`,`created_at`,`repo_id`,`push_distinct_size`), KEY `index_ge_on_repo_id_type_action_created_at_number_pdsize_psize` (`repo_id`,`type`,`action`,`created_at`,`number`,`push_distinct_size`,`push_size`), KEY `index_ge_on_repo_id_type_action_created_at_actor_login` (`repo_id`,`type`,`action`,`created_at`,`actor_login`) )

show less

github_repos

(   `repo_id` int(11) NOT NULL,   `repo_name` varchar(150) NOT NULL,   `owner_id` int(11) NOT NULL,   `owner_login` varchar(255) NOT NULL,   `owner_is_org` tinyint(1) NOT NULL,   `description` varchar(512) NOT NULL DEFAULT '',   `primary_language` varchar(32) NOT NULL DEFAULT '',   `license` varchar(32) NOT NULL DEFAULT '',   `size` bigint(20) NOT NULL DEFAULT '0',   `stars` int(11) NOT NULL DEFAULT '0',   `forks` int(11) NOT NULL DEFAULT '0',   `parent_repo_id` int(11) DEFAULT NULL,   `is_fork` tinyint(1) NOT NULL DEFAULT '0',   `is_archived` tinyint(1) NOT NULL DEFAULT '0',   `is_deleted` tinyint(1) NOT NULL DEFAULT '0',   `latest_released_at` timestamp NULL DEFAULT NULL,   `pushed_at` timestamp NULL DEFAULT NULL,   `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',   `updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',   `last_event_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',   `refreshed_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',   PRIMARY KEY (`repo_id`) /*T![clustered_index] CLUSTERED */,   KEY `index_gr_on_owner_id` (`owner_id`),   KEY `index_gr_on_repo_name` (`repo_name`),   KEY `index_gr_on_stars` (`stars`),   KEY `index_gr_on_repo_id_repo_name` (`repo_id`,`repo_name`),   KEY `index_gr_on_created_at_is_deleted` (`created_at`,`is_deleted`),   KEY `index_gr_on_owner_login_owner_id_is_deleted` (`owner_login`,`owner_id`,`is_deleted`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

show less

Initial reading from ChatGPT

ChatGPT: These tables contain information about GitHub users, repositories, and events that occur on the platform. The github_users table stores information about individual GitHub users, such as their login name, type, whether or not they are a bot, and their organization. The github_repos table stores information about GitHub repositories, such as their name, owner, and description. The github_events table stores information about events that occur on GitHub, such as the type of event, the time it occurred, and the repository it occurred in.

Each of these tables has a number of columns that store various pieces of information about the entities they represent. For example, the github_users table has columns for the user’s login name, type, and whether or not they are a bot. The github_repos table has columns for the repository’s name, owner, and description. The github_events table has columns for the type of event, the time it occurred, and the repository it occurred in.

These tables also have indices, which are used to improve the performance of queries on the tables. Indices allow the database to quickly look up and retrieve the data that is needed to answer a query, without having to scan the entire table. For example, the github_repos table has an index on the owner_id column, which allows the database to quickly look up all repositories that are owned by a specific user.

Overall, these tables provide a wealth of information about GitHub users, repositories, and events that can be used to answer a wide variety of interesting questions.

Interesting questioning “thoughts” by ChatGPT

What are the interesting questions ChatGPT would recommend asking about these 3 tables?

ChatGPT: Here are some interesting questions that could be asked about these tables:

What is the average number of followers for users in the github_users table?
What is the most popular primary language used in repositories in the github_repos table?
How many events in the github_events table are of type “PushEvent”?
What is the average number of stars for repositories in the github_repos table that are owned by organizations, as indicated by the owner_is_org column?
What is the most popular license used by repositories in the github_repos table?
How many events in the github_events table were performed by users who are marked as bots in the github_users table?
What is the average number of public repositories for users in the github_users table who are located in a specific country, as indicated by the country_code column?
What is the most popular type of event in the github_events table, as indicated by the type column?
How many events in the github_events table were performed on repositories that are marked as deleted in the github_repos table?
What is the average number of comments on pull requests in the github_events table?

Those are some examples, and actually, there could be more and more complex as you might dig into the detail.

More complex insights with ChatGPT

What is the average number of public repositories for users in the github_users table who are located in a specific country?

ChatGPT: To answer this question, you can use a SQL query like the following:

SELECT AVG(public_repos)
FROM github_users
WHERE country_code = '<country code>';

This query selects the average value of the public_repos column from the github_users table, but only for rows where the country_code column is equal to the specified country code. This will return the average number of public repositories for users in the github_users table who are located in the specified country.

You can replace <country code> in the query with the actual country code that you want to filter on. For example, if you wanted to find the average number of public repositories for users located in the United States, you could use a query like this:

SELECT AVG(public_repos)
FROM github_users
WHERE country_code = 'US';

This query would return the average number of public repositories for users located in the United States.

What is the most popular primary language used in repositories in the github_repos table?

ChatGPT: To answer this question, you can use a SQL query like the following:

SELECT primary_language, COUNT(*) as count
FROM github_repos
GROUP BY primary_language
ORDER BY count DESC
LIMIT 1;

This query first groups the rows in the github_repos table by the primary_language column, and then counts the number of rows in each group. This will give you a count of how many repositories use each primary language. The query then sorts the results in descending order by the count column, and finally limits the results to just the first row. This will return the primary language that is used by most repositories.

For example, if the github_repos table contained the following rows:

repo_id	repo_name	primary_language
1	repo1	Python
2	repo2	Python
3	repo3	Ruby
4	repo4	Python
5	repo5	Ruby

Then the query would return the following result:

primary_language	count
Python	3

This indicates that the most popular primary language used in repositories in the github_repos table is Python, as it is used by 3 repositories.

But `primary_language` may be empty. How to avoid this?

ChatGPT: To avoid including empty values for the primary_language column in the query results, you can add a WHERE clause to the query that filters out rows with empty values for this column. For example, you could use a query like the following:

SELECT primary_language, COUNT(*) as count
FROM github_repos
WHERE primary_language != ''GROUP BY primary_language
ORDER BY count DESC
LIMIT 1;

This query is the same as the previous one, except that it adds a WHERE clause that filters out rows where the primary_language column is empty. This will ensure that only rows with non-empty values for this column are included in the query results.

For example, if the github_repos table contained the following rows:

repo_id	repo_name	primary_language
1	repo1	Python
2	repo2	Python
3	repo3	Ruby
4	repo4	Python
5	repo5

Then the query would return the following result:

primary_language	count
Python	3

Any More?

ChatGPT: There are many possible insights that can be derived from these tables, depending on the specific data that they contain and the questions that you want to answer. Here are a few examples of insights that you might be able to glean from these tables:

By joining the github_events and github_users tables on the actor_id column, you can find out which users have performed the most events in the github_events table, and which types of events they have performed the most. This can give you an idea of who the most active users on GitHub are, and what kinds of activities they are most involved in.
By joining the github_events and github_repos tables on the repo_id column, you can find out which repositories have the most events in the github_events table, and what kinds of events are most common for those repositories. This can give you an idea of which repositories are the most active, and what kinds of activities are happening in those repositories.
By analyzing the created_at and updated_at columns in the github_repos table, you can find out how long repositories typically stay active, and how often they are updated. This can give you an idea of the typical lifespan of a repository on GitHub, and how often users are committing changes to their repositories.
By analyzing the created_at and last_event_at columns in the github_repos table, you can find out how long it typically takes for a repository to receive its first event, and how long it takes for a repository to receive its last event. This can give you an idea of how quickly repositories become active, and how long they remain active before becoming inactive.
By analyzing the stars and forks columns in the github_repos table, you can find out which repositories are the most popular, and how many forks they have. This can give you an idea of which repositories are the most widely used, and how many users they have.

GitHub is an immensely popular platform for open-source software development, hosting millions of repositories and millions of users. This wealth of data offers a unique opportunity for data scientists and researchers to study and understand the dynamics of open-source software development, and to uncover hidden trends and patterns in this data. AI and machine learning are transforming the way we analyze GitHub data, and they are enabling researchers to uncover hidden trends and patterns in this data. With AI, we can automate and accelerate the process of data analysis, and we can make more accurate and reliable predictions about the future of open-source software development.

An illustration of AI with a database from DALL.E 2

What is the takeaway here?

ChatGPT: As the world becomes more and more reliant on technology, the importance of data grows exponentially. The amount of data that is generated and collected by organizations on a daily basis is staggering, and it is only going to continue to increase. This means that the way we store, manage, and analyze data needs to evolve to keep up with the demand.

One area that has seen significant advances in recent years is artificial intelligence (AI) and its ability to write and optimize SQL queries. SQL, or Structured Query Language, is a programming language that is used to manage and analyze data stored in relational databases. It is an essential tool for anyone working with data, and the ability of AI to write and optimize SQL queries can greatly improve the efficiency and effectiveness of data management and analysis.

One of the key benefits of AI-powered SQL query writing is the ability to rewrite cross-protocol SQL queries. This eliminates the need for manual rewriting of SQL queries, which can be time-consuming and error-prone.

Another key factor is to simplify the complex database infrastructure that is commonly used today. Many organizations use multiple databases for different purposes, such as operational databases, analytical databases, cache databases, and messaging databases. This can be difficult to manage and can lead to inefficiencies. It is possible to consolidate all of these databases into a single, one-stop database that can handle all of the organization’s data needs. This not only simplifies the infrastructure but also improves the performance and reliability of the database system.

Finally, the combination of AI-powered SQL query writing and a one-stop database can greatly improve accessibility to data. With the ability to write and optimize SQL queries automatically, even non-technical users can easily access and analyze data without needing to have expertise in SQL or database management. This opens up the world of data to a wider audience, making it easier for everyone to gain insights and make informed decisions based on data.

In conclusion, the advances in AI-powered SQL query writing and the simplification of database infrastructure are revolutionizing the way we manage and analyze data. By automating the process of writing and optimizing SQL queries and consolidating multiple databases into a single one-stop database, organizations can improve the efficiency, performance, and accessibility of their data systems. This will enable them to make better use of their data and gain valuable insights that can help drive their business forward.

Note: The answers and conclusions in this post are all provided by AI.

Book a Demo