In-depth EVM-the risks behind the trivial matter of contract classification

2023-06-13 05:22:30

In the field of smart contracts, the "Ethereum Virtual Machine EVM" and its algorithms and data structures are first principles.

This article starts from why contracts should be classified, and combines what kind of malicious attacks each scenario may face, and finally gives a set of relatively safe contract classification analysis algorithms.

**Although the technical content is high, it can also be used as a reading material for miscellaneous talks. **Glance at the dark forest of games between decentralized systems.

1. Why should contracts be classified?

Because it is so important, it can be said to be the cornerstone of Dapps such as exchanges, wallets, blockchain browsers, data analysis platforms, etc.!

The reason why a transaction is an ERC20 transfer is because his behavior complies with the ERC20 standard, at least:

The status of the transaction is success
The To address is a contract that conforms to the ERC20 standard
The Transfer function is called, and the characteristic is that the first 4 digits of the CallData of the transaction are 0xa9059cbb
After execution, a transfer event is sent on the To address

If the classification is wrong, the transaction behavior will be misjudged

With the transaction behavior as the cornerstone, whether the To address can be accurately classified will lead to a completely different conclusion in the judgment of its CallData. For Dapp, the information communication on and off the chain is highly dependent on the monitoring of transaction events, and the same event code can only be trusted if it is sent in a contract that meets the standards.

If the classification is wrong, the transaction will go into the black hole by mistake

If the user transfers a Token into a certain contract, if the contract does not have a preset function method for Token transfer, the funds will be locked in the same way as Burn and cannot be controlled

And now that a large number of projects have begun to add built-in wallet support, it is inevitable to manage wallets for users. It is necessary to classify the latest deployed contracts from the chain in real time at all times, whether they can meet the asset standards.

2. What are the risks of classification?

**The chain is a place where there is no identity and no rule of law. You cannot stop a normal transaction, even if it is malicious. **

He can be a wolf pretending to be a grandma, doing most of the behaviors that you expect a grandma to do, but with the purpose of breaking into the house and robbing.

Claims standard, but may not actually meet

A common classification method is to directly adopt the EIP-165 standard to read whether the address supports ERC20, etc. Of course, this is an efficient method, but after all, the contract is controlled by the other party, so a statement can be forged after all.

The 165 standard query is just a method to prevent funds from being transferred into black holes with the lowest cost among the limited operation codes on the chain.

This is why when we analyzed NFT before, we specifically mentioned that there will be a SafeTransferFrom method in the standard, where Safe refers to the use of the 165 standard to determine that the other party has the ability to transfer NFT.

Only by starting from the contract bytecode, doing static analysis at the source code level, and starting from the expected behavior of the contract, can it be more accurate.

3. Contract classification scheme design

Next, we will systematically analyze the overall plan, and note that our ultimate goal is the two core indicators of "precision" and "efficiency". **

You must know that even if the direction is right, the way to reach the other side of the ocean is not clear. The first stop to do bytecode analysis is to obtain the code

3.1. How to get the code?

From the point of view of going to the chain, there is getCode, an RPC method, which can get the bytecode from the address specified on the chain. It is very fast in terms of reading, because the codeHash is placed in the account structure of the EVM. at the very top.

But this method is tantamount to obtaining a certain address alone. Want to further improve the accuracy and efficiency?

If it is a contract deployment transaction, how to get the deployed code just after it is executed or even when it is still in the memory pool?

If the transaction is in the contract factory mode, is there any source code in the Calldata of the transaction?

In the end, my way is to classify in a sieve-like mode

For non-contract-deployed transactions, directly use getCode to obtain the involved addresses for classification.
For the latest memory pool transactions, filter out the transactions whose to address is empty, and whose CallData is the source code with the constructor
For the transaction of the contract factory mode, since the contract deployed by the contract may be recycled to call other contracts to execute the deployment, it will recursively analyze the sub-transactions of the transaction, and record each call whose type is CREATE or CREATE2 .

When I made a demo implementation, I found that the rpc version is relatively high now, because the most difficult part of the whole process is how to recursively find the call of the specified type when executing 3. The bottom-level method is to restore the context through opcode. I was taken aback!

Fortunately, there is a debug_traceTransaction method in the current geth version, which can help sort out the context information of each call through the opcode operation code, and sort out the core fields.

In the end, the original bytecodes of various deployment modes (direct deployment, single deployment in factory mode, batch deployment in factory mode) can be obtained.

3.2 How to classify from the code?

The simplest but unsafe way is to directly do string matching with code. Taking ERC20 as an example, the function that meets the standard has

After the function name is the function signature of the function. As mentioned in the previous analysis, the transaction depends on matching the first 4 digits of callData to find the target function. Further reading:

Therefore, the signatures of these 6 functions must be stored in the contract bytecode.

Of course, this method is very fast and you can find all 6, but the unsafe factor is that if I use the solidity contract and design a variable with a storage value of 0x18160ddd, then he will think that I have this function.

3.3. Accuracy rate improvement 1- decompilation

The further accurate method is to decompile Opcode! Decompilation is the process of converting the obtained bytecodes into opcodes, and more advanced decompilation is to convert them into pseudocodes, which is more conducive to human reading. We don’t need it this time. The decompilation method is listed in in the appendix at the end of the article.

solidity (high-level language) -> bytecode (bytecode) -> opcode (operation code)

We can clearly find a feature, the function signature will be executed by the PUSH4 opcode, so the further method is to extract the content after PUSH4 from the full text and match it with the function standard.

I also did a simple performance experiment, and I have to say that the Go language is very efficient, and it only takes 220ms for 10,000 times of decompilation.

What follows will be difficult

3.4. Accuracy rate improvement 2-find code block

The accuracy rate above has been improved but not enough, because it is full-text search PUSH4, because we can still construct a variable, which is of type byte4, which will also trigger the PUSH4 command.

When I was distressed, I thought of the implementation of some open source projects. ETL is a tool for reading data on the chain for analysis. It will analyze the transfer of ERC20 and 721 into separate tables, so it must have the ability to classify contracts.

After analysis, it can be found that he is based on the classification of code blocks and only processes the first basic_blocks [0] The push4 instruction in

The question comes, how to accurately judge the code block

The concept of the code block comes from the two consecutive opcodes of REVERT + JUMPDEST. There must be two consecutive opcodes here, because in the opcode range of the entire function selector, if there are too many functions, the logic of page turning will appear. Then the JUMPDEST command will also appear.

3.5. Accuracy rate improvement 3-Find function selector

The function of the function selector is to read the first 4 bytes of the Calldata of the transaction, and match it with the contract function signature preset in the code, and assist the instruction to jump to the memory location specified by the function method

Let's try a minimal mock execution

This part is the selector store(uint 256) and retrieve() of the two functions, and the signature can be calculated as 2e64cec1, 6057361d

After decompiling, you will get the following opcode string, which can be said to be divided into two parts

first part:

In the compiler, only the function selector part of the contract will get the content of callData, which means to get the function call signature of its CallData, as shown in the figure below.

We can see the effect by simulating the change of the memory pool of EVM

the second part:

The process of judging whether it matches the value of the selector

Pass the 4-byte function signature (0x2e64cec1) of retrieve() to the stack,
The EQ opcode pops 2 variables from the stack area, namely 0x2e64cec1 and 0x6057361d, and checks whether they are equal
PUSH2 transfers 2 bytes of data (0x003b here, 59 in decimal) to the stack. There is a program counter in the stack area, which specifies the position of the next execution command in the bytecode. Here we set 59 because that's where the retrieve() bytecode starts
JUMPI stands for "Jump to if...", it pops 2 values from the stack as input, and if the condition is true, the program counter will be updated to 59.

This is how the EVM determines the location of the function bytecode it needs to execute based on the function call in the contract.

In reality, this is just a simple set of "if statements" for every function in the contract and where they jump to.

4. Scheme Summary

The overall brief is as follows

Each contract address can obtain the bytecode after deployment through rpcgetcode or debug_traceTransaction, using the VM and ASM libraries in GO, and obtain the opcode after decompilation
In the principle of EVM operation, the contract will have the following characteristics

Use REVERT+JUMPDEST as the code block distinction
The contract must have the function of a function selector, and this function must also be on the first code block
In the function selector, its function methods all use PUSH4 as opcode,
In the opcode contained in this selector, there will be consecutive PUSH1 00; CALLDATALOAD; PUSH1 e0; SHR; DUP1. The core function is to load the callDate data and perform displacement operations. From the contract function, other syntax will not generate

The corresponding function signature is defined in eip, and there are mandatory and optional clear instructions

4.1. Proof of Uniqueness

At this point, we can say that a high-efficiency and high-accuracy contract analysis method has been basically realized. Of course, since it has been rigorous for so long, we might as well be more rigorous. In the above scheme, we use REVER+JUMPDEST to make code blocks Distinguish, and combine the inevitable CallDate loading and displacement to make a unique judgment. Does it exist that I can use a solidity contract to implement a similar opcode sequence?

I did a control experiment. Although there are methods of obtaining CallData such as msg.sig from the solidity grammar level, the implementation methods of the opcode after compilation are different.

View Original

The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.

Reward
like
Comment
Share

Comment

0/400

No comments

Topic
#PI#
288k posts
#BTC#
260k posts
#ETH#
170k posts
4#GateioInto11#
82k posts
5#ContentStar#
68k posts
6#GT#
68k posts
7#DOGE#
62k posts
8#BOME#
62k posts
9#MAGA#
53k posts
10#SLERF#
51k posts

sitemap