{"id":165265,"date":"2021-06-03T18:30:13","date_gmt":"2021-06-03T13:30:13","guid":{"rendered":"https:\/\/venturebeat.com\/?p=2693608"},"modified":"2021-06-03T18:30:13","modified_gmt":"2021-06-03T13:30:13","slug":"researchers-open-source-benchmarks-measuring-quality-of-ai-generated-code","status":"publish","type":"post","link":"https:\/\/www.technologyforyou.org\/researchers-open-source-benchmarks-measuring-quality-of-ai-generated-code\/","title":{"rendered":"Researchers open-source benchmarks measuring quality of AI-generated code"},"content":{"rendered":"<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Elevate your enterprise data technology and strategy at <a href=\"https:\/\/venturebeat.com\/event\/transform-2021\/register\/#\" data-type=\"URL\" target=\"_blank\" rel=\"noreferrer noopener\">Transform 2021<\/a><\/em>. <\/p>\n<hr class=\"wp-block-separator is-style-wide\">\n<\/div>\n<p>The applications of computer programming are vast in scope. And as computers become ubiquitous, the demand for quality code draws an ever-growing number of aspiring programmers to the profession. After years of study to become proficient at coding, experts learn to convert abstracts into concrete, executable programs. But what if AI could do the same?<\/p>\n<p>In recent years, large-scale AI language models have shown promise in generalizing to tasks including writing code, implying that humans\u2019 work may be one day supplemented by AI systems. But while some studies show that language models can translate code and fix compilation issues, there\u2019s been little work on rigorously testing the coding ability of models given general coding problems.<\/p>\n<p>That\u2019s why a <a href=\"https:\/\/arxiv.org\/pdf\/2105.09938.pdf\">team of researchers<\/a> at the University of California at Berkeley, Cornell, the University of Chicago, and the University of Illinois at Urbana-Champaign created <a href=\"https:\/\/github.com\/hendrycks\/apps\">APPS<\/a>, a benchmark for code generation from natural language specifications. Unlike prior work on code generation, which mostly focuses on code translation and pseudocode-to-code, the researchers tested models on their ability to take specifications and write code that meets these specifications.<\/p>\n<p>Their work comes on the heels of the release of IBM\u2019s Project CodeNet, one of the largest open source dataset for benchmarking around AI for code. But CodeNet centers around the problems of code translation, code similarity, and code constraints. APPS is broader in scope, evaluating models not only on their ability to understand coding syntax but on their ability to comprehend task descriptions and create algorithms to solve these tasks.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-2693630 aligncenter\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2021\/05\/Screenshot-2021-05-28T154439.892.png?w=800&amp;resize=800%2C196&amp;strip=all\" alt=\"Code generation AI\" width=\"800\" height=\"196\" data-recalc-dims=\"1\"><\/p>\n<p>\u201cAPPS enables robust evaluation of models along several dimensions, providing a precise and comprehensive view of code generation ability,\u201d the coauthors wrote in a paper detailing their work. \u201cIf a model were to perform well on APPS, this would indicate an ability to flexibly use data structures and programming techniques, as well as an ability to correctly interpret diverse task specifications, follow instructions, and understand human intent.\u201d<\/p>\n<p>APPS contains 10,000 programming problems in Python, Java, and C++ ranging in difficulty from introductory to coding competition challenges, as well as a bank of over 130,000 test cases and more than 230,000 human-written solutions for evaluation. The test cases were chosen to create a gold-standard metric for model performance, including correct functionality across edge cases. And most were taken from open access coding websites including Codeforces and Kattis.<\/p>\n<p>The introductory problems in APPS, which include counting the number of appearances of a substring and finding if a string is a palindrome, can be solved by programmers with 1-2 years of experience without requiring algorithms. The intermediate, interview-level problems are more difficult in nature and at the level of questions asked in typical technical interviews. As for the competition-level problems, they\u2019re even more challenging and representative of those in high school and collegiate programming competitions like the United States of America Computing Olympiad (USACO).<\/p>\n<h2>Results<\/h2>\n<p>The researchers tested several types of models on APPS, including OpenAI\u2019s GPT-2, GPT-3, and an open source version of GPT-3 called <a href=\"https:\/\/venturebeat.com\/2021\/05\/15\/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about\/\">GPT-Neo<\/a>. In experiments, they discovered that the models could learn to generate code that solves easier problems but not without syntax errors. Approximately 59% of GPT-3\u2019s solutions for introductory problems had errors, while GPT-Neo averaged 3%. Moreover, the best-performing model \u2014 GPT-Neo \u2014 attained only 10.15% accuracy (excluding edge cases) and 1.12% strict accuracy (including edge cases) across introductory-, interview-, and competitive-level problems, indicating that there\u2019s substantial room for improvement.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-2693632 aligncenter\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2021\/05\/Screenshot-2021-05-28T154517.154.png?w=800&amp;resize=800%2C323&amp;strip=all\" alt=\"Code generation AI\" width=\"800\" height=\"323\" data-recalc-dims=\"1\"><\/p>\n<p>\u201cThese results position code generation as a challenging but tractable testbed for large-scale language models \u2026 Writing code to meet specifications in natural language is an economically valuable task with widespread social implications should it be solved, in that it could eventually facilitate malicious code generation and one day result in job automation. As large-scale language models have the potential to make significant progress on code generation, it is essential that we begin to track advancements on this task,\u201d the researchers wrote.<\/p>\n<p>Several efforts are underway to create viable AI-powered coding tools, including Intel\u2019s <a href=\"https:\/\/venturebeat.com\/2020\/12\/03\/intels-controlflag-taps-ai-to-automatically-detect-errors-in-code\/\">ControlFlag<\/a>, which can autonomously detect errors in code. <a href=\"https:\/\/venturebeat.com\/2020\/04\/27\/codota-raises-12-million-for-ai-that-suggests-and-autocompletes-code\/\">Codota<\/a>&nbsp;is developing a platform that suggests and autocompletes scripts in Python, C, HTML, Java, Scala, Kotlin, and JavaScript.&nbsp;<a href=\"https:\/\/venturebeat.com\/2020\/07\/14\/ponicode-raises-3-4-million-to-develop-ai-that-automates-code-testing\/\">Ponicode<\/a>&nbsp;taps AI to check the accuracy of code, and&nbsp;<a href=\"https:\/\/venturebeat.com\/2019\/08\/06\/deepcode-learns-from-github-project-data-to-give-developers-ai-powered-code-reviews\/\">DeepCode<\/a>&nbsp;offers a machine learning-powered system for whole-app code reviews (<a href=\"https:\/\/venturebeat.com\/2020\/06\/29\/amazon-launches-ai-powered-code-review-service-codeguru-in-general-availability\/\">as does Amazon<\/a>). Perhaps one of the most impressive projects to date is <a href=\"https:\/\/venturebeat.com\/2020\/06\/08\/facebooks-transcoder-ai-converts-code-from-one-programming-language-into-another\/\">TransCoder<\/a>, an AI system Facebook researchers developed that converts code from one programming language into another. Another contender is a <a href=\"https:\/\/twitter.com\/i\/broadcasts\/1OyKAYWPRrWKb\">model<\/a>&nbsp;from OpenAI that was trained on GitHub repositories to generate entire functions from English-language comments.<\/p>\n<p>According to a&nbsp;<a href=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.370.9611&amp;rep=rep1&amp;type=pdf\">study<\/a> from the University of Cambridge\u2019s Judge Business School, programmers spend 50.1% of their work time not programming; half of the rest of their time is spent debugging. And the total estimated cost of debugging is $312 billion per year. AI-powered code suggestion and review tools, then, promise to cut development costs substantially while enabling coders to focus on more creative, less repetitive tasks.<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\">\n<h3>VentureBeat<\/h3>\n<p>VentureBeat&#8217;s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:<\/p>\n<ul>\n<li><span>up-to-date information on the subjects of interest to you<\/span><\/li>\n<li><span>our newsletters<\/span><\/li>\n<li><span>gated thought-leader content and discounted access to our prized events, such as <a href=\"https:\/\/events.venturebeat.com\/transform2021\/\"><strong>Transform 2021<\/strong>: Learn More<\/a><\/span><\/li>\n<li><span>networking features, and more<\/span><\/li>\n<\/ul>\n<p><a class=\"membership-link\" href=\"https:\/\/venturebeat.com\/venturebeat-membership-plans\/\">Become a member<\/a><\/div>\n<p><!-- Boilerplate CSS for \"after\" --> <a href=\"http:\/\/feedproxy.google.com\/~r\/venturebeat\/SZYF\/~3\/O7ZHWgjqg3c\/\">Source Link<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Elevate your enterprise data technology and strategy at Transform 2021. The applications of computer programming are vast in scope. And as computers become ubiquitous, the demand for quality code draws an ever-growing number of aspiring programmers to the profession. After years of study to become proficient at coding, experts learn to convert abstracts into concrete, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[27765,27766,14083],"tags":[20,379,37,20910,16621,16672,31926,18430,16413,76,15210,16390,17194,22830],"class_list":{"0":"post-165265","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-artificial-intelligence-news","7":"category-machine-learning-news","8":"category-technology-industry-news","9":"tag-ai","10":"tag-apps","11":"tag-artificial-intelligence","12":"tag-benchmark","13":"tag-category-computers-electronics-programming","14":"tag-category-science-computer-science","15":"tag-computer-programming","16":"tag-computer-science","17":"tag-dev","18":"tag-machine-learning","19":"tag-open-source","20":"tag-programming","21":"tag-study","22":"tag-vb-home-page"},"_links":{"self":[{"href":"https:\/\/www.technologyforyou.org\/wp-json\/wp\/v2\/posts\/165265","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.technologyforyou.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.technologyforyou.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.technologyforyou.org\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.technologyforyou.org\/wp-json\/wp\/v2\/comments?post=165265"}],"version-history":[{"count":0,"href":"https:\/\/www.technologyforyou.org\/wp-json\/wp\/v2\/posts\/165265\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.technologyforyou.org\/wp-json\/wp\/v2\/media?parent=165265"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.technologyforyou.org\/wp-json\/wp\/v2\/categories?post=165265"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.technologyforyou.org\/wp-json\/wp\/v2\/tags?post=165265"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}