CMU and Google Researchers Open-Source ‘python_graphs’, a Library for Representing Python Programs as Graphs for Machine Learning Research

On Aug 27, 2022

Graphs are the ultimate tools of storytelling for Data scientists and engineers, but there exists another type of graph called code graphs. These graphs are the visual representation of the code and the flow of the execution and find its application in machine learning projects. Now through the hard work of researchers at Google Research, you will be able to create graphs from the code much more accurately through the help of machine learning. The types of graphs that can make through the python_graphs library are – abstract syntax tree (AST), control-flow graph (CFG), data-flow graphs, inter-procedural control-flow graph (ICFG), interval graph, and composite “program graphs.”Through this library, coders can directly construct these graphs from the code or give you tools to aid in creating other varieties of graphs. Graph representations are a standard tool used in machine learning research, with the most common being the abstract syntax tree(ASTs), and several research papers have used them. A typical syntax-based graph has an AST backbone with some properties of control flow, data flow, and syntactical knowledge encoded into it as additional benefits. Other graph-creating systems create graphs through extra help like CodeQl etc., which can also lead to compilation errors or bugs in the future due to encoding additional information not given in the code. To improve the situation, google researchers created python_graph as it doesn’t need any other source; hence it is also free from its disadvantages.

Control flow graphs are the graphs that show the flow of execution of the code, and each node in the graph offers a primary line of code. In addition to controlling flow graphs, the python_graphs library can also produce statement-level control-flow graphs where a single node can represent a single linestatement of the code. Program Groups are created through this library which are the graphs with abstract syntax tree as their backbone, and each node in the program correlates to a single node in the AST.

python_graphs also allows for Alternative Composite Program Graphs, which lets the user select the desired nodes and edges to construct the graph. Inter-procedural Control-flow Graphs let you create graphs that connect multiple functions and not just represent a single function. The Data Flow graph shows the dependencies on the chart, and the nodes represent the variable access location, with the edges representing the relationship between these accesses. Span mapped graphs are tokenized graphs created to be useful for machine learning applications. There are two tokenizations: one is per node, and the other is a whole program. In the whole program, you tokenize the entire program and then, using python_graphs, create the graph of the program in the per-node tokenization. You split the program into chunks, and according to the node, they are part of these chunks arranged automatically.

While this library intends to make life easier for data scientists and researchers all around the globe, this library still has limitations of its own. One of them is that the code is written in python, and to correctly analyze the code has to be what we call in coding terms static’ whereas python is a very dynamic language in those exact terms. So the library does a best-effort analysis which cannot guarantee that the analysis is 100% correct.

And while this library has its flaws and benefits, just like every other coding library, it exists for the specific reason to make the life of the coders easier. Where it excels, through its faults, the library does more for the particular task it was created for than its predecessors, for which the people behind it deserve some praise.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'A LIBRARY FOR REPRESENTING PYTHON PROGRAMS AS GRAPHS FOR MACHINE LEARNING'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.

Please Don't Forget To Join Our ML Subreddit

Asif Razzaq is an AI Journalist and Cofounder of Marktechpost, LLC. He is a visionary, entrepreneur and engineer who aspires to use the power of Artificial Intelligence for good.

Asif’s latest venture is the development of an Artificial Intelligence Media Platform (Marktechpost) that will revolutionize how people can find relevant news related to Artificial Intelligence, Data Science and Machine Learning.

Asif was featured by Onalytica in it’s ‘Who’s Who in AI? (Influential Voices & Brands)’ as one of the ‘Influential Journalists in AI’ (https://onalytica.com/wp-content/uploads/2021/09/Whos-Who-In-AI.pdf). His interview was also featured by Onalytica (https://onalytica.com/blog/posts/interview-with-asif-razzaq/).

Credit: Source link